R kunshin tidyr da sabbin ayyukan sa pivot_longer da pivot_wider

Kunshin gyara an haɗa shi cikin ainihin ɗayan manyan ɗakunan karatu a cikin harshen R - tsararru.
Babban manufar kunshin shine don kawo bayanai cikin ingantaccen tsari.

Dama akwai akan Habré bugawa sadaukar da wannan kunshin, amma ya koma 2015. Kuma ina so in gaya muku game da mafi yawan sauye-sauye na yau da kullun, wanda marubucin ta, Hedley Wickham ya sanar kwanakin baya.

R kunshin tidyr da sabbin ayyukan sa pivot_longer da pivot_wider

S.J.K.: Za a taru () da yada () za a yanke su?

Hadley Wickham ne adam wata: Har zuwa wani matsayi. Ba za mu ƙara ba da shawarar yin amfani da waɗannan ayyuka da gyara kwari a cikinsu ba, amma za su ci gaba da kasancewa a cikin kunshin a halin yanzu.

Abubuwa

Idan kuna sha'awar nazarin bayanai, kuna iya sha'awar nawa telegram и youtube tashoshi. Yawancin abun ciki an sadaukar da shi ga yaren R.

TidyData ra'ayi

Manufar gyara - taimaka muku kawo bayanan zuwa abin da ake kira tsari mai kyau. Neat data shine data inda:

  • Kowane m yana cikin ginshiƙi.
  • Kowane abin kallo shine kirtani.
  • Kowace ƙima tantanin halitta ce.

Yana da sauƙin sauƙi kuma mafi dacewa don aiki tare da bayanan da aka gabatar a cikin cikakkun bayanai lokacin gudanar da bincike.

Babban ayyuka da aka haɗa a cikin fakitin tidyr

tidyr ya ƙunshi saitin ayyuka da aka tsara don canza tebur:

  • fill() - cika ƙimar da aka ɓace a cikin ginshiƙi tare da ƙimar baya;
  • separate() - raba filin daya zuwa da yawa ta amfani da mai raba;
  • unite() - yana aiwatar da aikin haɗa filayen da yawa zuwa ɗaya, aikin juzu'i na aikin separate();
  • pivot_longer() - aikin da ke canza bayanai daga tsari mai fadi zuwa tsari mai tsawo;
  • pivot_wider() - aikin da ke juyar da bayanai daga dogon tsari zuwa tsari mai fadi. Aiki na baya na wanda aikin yayi pivot_longer().
  • gather()m - aikin da ke canza bayanai daga tsari mai fadi zuwa tsari mai tsawo;
  • spread()m - aikin da ke juyar da bayanai daga dogon tsari zuwa tsari mai fadi. Aiki na baya na wanda aikin yayi gather().

Sabuwar ra'ayi don canza bayanai daga fa'ida zuwa tsayi mai tsayi da akasin haka

A baya can, ana amfani da ayyuka don irin wannan canji gather() и spread(). Tsawon shekarun wanzuwar waɗannan ayyuka, ya zama a fili cewa ga mafi yawan masu amfani, gami da mawallafin kunshin, sunayen waɗannan ayyuka da hujjojinsu ba a bayyane suke ba, kuma sun haifar da matsala wajen gano su da fahimtar wanene daga cikin waɗannan ayyukan ya canza. firam ɗin kwanan wata daga fa'ida zuwa tsari mai tsayi, kuma akasin haka.

Dangane da wannan, in gyara Sabbin ayyuka biyu masu mahimmanci an ƙara waɗanda aka ƙera don canza firam ɗin kwanan wata.

Sabbin kayan aiki pivot_longer() и pivot_wider() an yi wahayi zuwa ga wasu fasalulluka a cikin kunshin cdata, John Mount da Nina Zumel suka kirkiro.

Shigar da mafi yawan halin yanzu na tidyr 0.8.3.9000

Don shigar da sabon, mafi yawan nau'in kunshin gyara 0.8.3.9000, inda sabbin abubuwa ke samuwa, yi amfani da lambar mai zuwa.

devtools::install_github("tidyverse/tidyr")

A lokacin rubutawa, waɗannan ayyukan suna samuwa ne kawai a cikin sigar dev na fakitin akan GitHub.

Juyawa zuwa sababbin fasali

A gaskiya ma, ba shi da wahala don canja wurin tsofaffin rubutun don yin aiki tare da sababbin ayyuka; don ƙarin fahimta, zan ɗauki misali daga takardun tsofaffin ayyuka kuma in nuna yadda ake gudanar da ayyuka iri ɗaya ta amfani da sababbin. pivot_*() ayyuka.

Maida fadi da tsari zuwa dogon tsari.

Lambar misali daga takardun aikin tattarawa

# example
library(dplyr)
stocks <- data.frame(
  time = as.Date('2009-01-01') + 0:9,
  X = rnorm(10, 0, 1),
  Y = rnorm(10, 0, 2),
  Z = rnorm(10, 0, 4)
)

# old
stocks_gather <- stocks %>% gather(key   = stock, 
                                   value = price, 
                                   -time)

# new
stocks_long   <- stocks %>% pivot_longer(cols      = -time, 
                                       names_to  = "stock", 
                                       values_to = "price")

Maida dogon tsari zuwa faffadan tsari.

Lambar misali daga takardun aikin yadawa

# old
stocks_spread <- stocks_gather %>% spread(key = stock, 
                                          value = price) 

# new 
stock_wide    <- stocks_long %>% pivot_wider(names_from  = "stock",
                                            values_from = "price")

Domin a cikin misalan da ke sama na yin aiki tare da pivot_longer() и pivot_wider(), a cikin tebur na asali hannun jari babu ginshiƙai da aka jera a cikin muhawara sunayen_zuwa и dabi'u_zuwa Dole ne sunayensu su kasance cikin alamomin ambato.

Tebur wanda zai taimaka muku mafi sauƙin gano yadda ake canzawa zuwa aiki tare da sabon ra'ayi gyara.

R kunshin tidyr da sabbin ayyukan sa pivot_longer da pivot_wider

Bayani daga marubucin

Duk rubutun da ke ƙasa yana daidaitawa, zan ma faɗi fassarar kyauta vignettes daga gidan yanar gizon tidyverse library na hukuma.

Misali mai sauƙi na juyar da bayanai daga fa'ida zuwa dogon tsari

pivot_longer () - yana sanya bayanai ya fi tsayi ta hanyar rage adadin ginshiƙai da ƙara yawan layuka.

R kunshin tidyr da sabbin ayyukan sa pivot_longer da pivot_wider

Don gudanar da misalan da aka gabatar a cikin labarin, da farko kuna buƙatar haɗa fakitin da suka dace:

library(tidyr)
library(dplyr)
library(readr)

Bari mu ce muna da tebur tare da sakamakon binciken da (cikin wasu abubuwa) ya tambayi mutane game da addininsu da kudaden shiga na shekara:

#> # A tibble: 18 x 11
#>    religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k`
#>    <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
#>  1 Agnostic      27        34        60        81        76       137
#>  2 Atheist       12        27        37        52        35        70
#>  3 Buddhist      27        21        30        34        33        58
#>  4 Catholic     418       617       732       670       638      1116
#>  5 Don’t k…      15        14        15        11        10        35
#>  6 Evangel…     575       869      1064       982       881      1486
#>  7 Hindu          1         9         7         9        11        34
#>  8 Histori…     228       244       236       238       197       223
#>  9 Jehovah…      20        27        24        24        21        30
#> 10 Jewish        19        19        25        25        30        95
#> # … with 8 more rows, and 4 more variables: `$75-100k` <dbl>,
#> #   `$100-150k` <dbl>, `>150k` <dbl>, `Don't know/refused` <dbl>

Wannan tebur ya ƙunshi bayanan addinin masu amsa a cikin layuka, kuma matakan samun kuɗin shiga suna warwatse a cikin sunaye. Ana adana adadin masu amsawa daga kowane nau'i a cikin ƙimar tantanin halitta a tsakar addini da matakin samun kudin shiga. Don kawo teburin cikin tsari mai kyau, daidaitaccen tsari, ya isa a yi amfani da shi pivot_longer():

pew %>% 
  pivot_longer(cols = -religion, names_to = "income", values_to = "count")

pew %>% 
  pivot_longer(cols = -religion, names_to = "income", values_to = "count")
#> # A tibble: 180 x 3
#>    religion income             count
#>    <chr>    <chr>              <dbl>
#>  1 Agnostic <$10k                 27
#>  2 Agnostic $10-20k               34
#>  3 Agnostic $20-30k               60
#>  4 Agnostic $30-40k               81
#>  5 Agnostic $40-50k               76
#>  6 Agnostic $50-75k              137
#>  7 Agnostic $75-100k             122
#>  8 Agnostic $100-150k            109
#>  9 Agnostic >150k                 84
#> 10 Agnostic Don't know/refused    96
#> # … with 170 more rows

Hujjar Aiki pivot_longer()

  • Hujja ta farko abin wuya, ya bayyana waɗanne ginshiƙai ne ake buƙatar haɗa su. A wannan yanayin, duk ginshiƙai banda lokaci.
  • shaida sunayen_zuwa yana ba da sunan canjin da za a ƙirƙira daga sunayen ginshiƙan da muka tattara.
  • dabi'u_zuwa yana ba da sunan madaidaicin da za a ƙirƙira daga bayanan da aka adana a cikin ƙimar sel na ginshiƙan da aka haɗa.

Спецификации

Wannan sabon aiki ne na kunshin gyara, wanda a baya baya samuwa yayin aiki tare da ayyukan gado.

Ƙayyadaddun ƙayyadaddun bayanai shine firam ɗin bayanai, kowane jeri wanda ya yi daidai da shafi ɗaya a cikin sabon firam ɗin kwanan wata, da ginshiƙai na musamman guda biyu waɗanda suka fara da:

  • .name ya ƙunshi ainihin sunan shafi.
  • .daraja ya ƙunshi sunan ginshiƙi wanda zai ƙunshi ƙimar tantanin halitta.

Sauran ginshiƙan ƙayyadaddun bayanai suna nuna yadda sabon shafi zai nuna sunan ginshiƙan da aka matsa daga .name.

Bayanin ƙayyadaddun yana bayyana metadata da aka adana a cikin sunan shafi, tare da jere ɗaya don kowane shafi da shafi ɗaya ga kowane mai canzawa, haɗe da sunan shafi, wannan ma'anar na iya zama kamar ruɗani a wannan lokacin, amma bayan duba ƴan misalai zai zama da yawa. mafi bayyane.

Maƙasudin ƙayyadaddun bayanai shine cewa zaku iya dawo da, gyara, da ayyana sabon metadata don jujjuya tsarin bayanan.

Don aiki tare da ƙayyadaddun bayanai lokacin canza tebur daga tsari mai faɗi zuwa tsari mai tsayi, yi amfani da aikin pivot_longer_spec().

Yadda wannan aikin ke aiki shine yana ɗaukar kowane firam ɗin kwanan wata kuma yana samar da metadata ta hanyar da aka bayyana a sama.

A matsayin misali, bari mu ɗauki bayanan wanene da aka samar tare da kunshin gyara. Wannan bayanan yana kunshe da bayanan da kungiyar lafiya ta duniya ta bayar game da kamuwa da cutar tarin fuka.

who
#> # A tibble: 7,240 x 60
#>    country iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534
#>    <chr>   <chr> <chr> <int>       <int>        <int>        <int>
#>  1 Afghan… AF    AFG    1980          NA           NA           NA
#>  2 Afghan… AF    AFG    1981          NA           NA           NA
#>  3 Afghan… AF    AFG    1982          NA           NA           NA
#>  4 Afghan… AF    AFG    1983          NA           NA           NA
#>  5 Afghan… AF    AFG    1984          NA           NA           NA
#>  6 Afghan… AF    AFG    1985          NA           NA           NA
#>  7 Afghan… AF    AFG    1986          NA           NA           NA
#>  8 Afghan… AF    AFG    1987          NA           NA           NA
#>  9 Afghan… AF    AFG    1988          NA           NA           NA
#> 10 Afghan… AF    AFG    1989          NA           NA           NA
#> # … with 7,230 more rows, and 53 more variables

Bari mu gina ƙayyadaddun sa.

spec <- who %>%
  pivot_longer_spec(new_sp_m014:newrel_f65, values_to = "count")

#> # A tibble: 56 x 3
#>    .name        .value name        
#>    <chr>        <chr>  <chr>       
#>  1 new_sp_m014  count  new_sp_m014 
#>  2 new_sp_m1524 count  new_sp_m1524
#>  3 new_sp_m2534 count  new_sp_m2534
#>  4 new_sp_m3544 count  new_sp_m3544
#>  5 new_sp_m4554 count  new_sp_m4554
#>  6 new_sp_m5564 count  new_sp_m5564
#>  7 new_sp_m65   count  new_sp_m65  
#>  8 new_sp_f014  count  new_sp_f014 
#>  9 new_sp_f1524 count  new_sp_f1524
#> 10 new_sp_f2534 count  new_sp_f2534
#> # … with 46 more rows

filayen kasar, isoxnumx, isoxnumx sun riga sun kasance masu canji. Ayyukanmu shine jujjuya ginshiƙan da sabon_sp_m014 a kan sabon_f65.

Sunayen waɗannan ginshiƙan suna adana bayanan masu zuwa:

  • Prefix new_ yana nuna cewa ginshiƙi yana ƙunshe da bayanai game da sababbin cututtukan tarin fuka, tsarin kwanan wata na yanzu ya ƙunshi bayanai kawai akan sababbin cututtuka, don haka wannan prefix a cikin mahallin yanzu ba ya da wata ma'ana.
  • sp/rel/sp/ep ya bayyana hanyar gano cuta.
  • m/f jinsin marasa lafiya.
  • 014/1524/2535/3544/4554/65 kewayon shekarun haƙuri.

Za mu iya raba waɗannan ginshiƙai ta amfani da aikin extract()ta amfani da magana ta yau da kullun.

spec <- spec %>%
        extract(name, c("diagnosis", "gender", "age"), "new_?(.*)_(.)(.*)")

#> # A tibble: 56 x 5
#>    .name        .value diagnosis gender age  
#>    <chr>        <chr>  <chr>     <chr>  <chr>
#>  1 new_sp_m014  count  sp        m      014  
#>  2 new_sp_m1524 count  sp        m      1524 
#>  3 new_sp_m2534 count  sp        m      2534 
#>  4 new_sp_m3544 count  sp        m      3544 
#>  5 new_sp_m4554 count  sp        m      4554 
#>  6 new_sp_m5564 count  sp        m      5564 
#>  7 new_sp_m65   count  sp        m      65   
#>  8 new_sp_f014  count  sp        f      014  
#>  9 new_sp_f1524 count  sp        f      1524 
#> 10 new_sp_f2534 count  sp        f      2534 
#> # … with 46 more rows

Da fatan za a kula da ginshiƙi .name yakamata ya kasance baya canzawa tunda wannan shine fihirisar mu a cikin sunayen ginshiƙi na ainihin bayanan.

Jinsi da shekaru (ginshiƙai jinsi и shekaru) suna da ƙayyadaddun dabi'u da aka sani, don haka ana ba da shawarar canza waɗannan ginshiƙai zuwa dalilai:

spec <-  spec %>%
            mutate(
              gender = factor(gender, levels = c("f", "m")),
              age = factor(age, levels = unique(age), ordered = TRUE)
            ) 

A ƙarshe, don amfani da ƙayyadaddun ƙayyadaddun da muka ƙirƙira zuwa firam ɗin kwanan wata na asali wanda muna bukatar mu yi amfani da hujja tabarau cikin aiki pivot_longer().

who %>% pivot_longer(spec = spec)

#> # A tibble: 405,440 x 8
#>    country     iso2  iso3   year diagnosis gender age   count
#>    <chr>       <chr> <chr> <int> <chr>     <fct>  <ord> <int>
#>  1 Afghanistan AF    AFG    1980 sp        m      014      NA
#>  2 Afghanistan AF    AFG    1980 sp        m      1524     NA
#>  3 Afghanistan AF    AFG    1980 sp        m      2534     NA
#>  4 Afghanistan AF    AFG    1980 sp        m      3544     NA
#>  5 Afghanistan AF    AFG    1980 sp        m      4554     NA
#>  6 Afghanistan AF    AFG    1980 sp        m      5564     NA
#>  7 Afghanistan AF    AFG    1980 sp        m      65       NA
#>  8 Afghanistan AF    AFG    1980 sp        f      014      NA
#>  9 Afghanistan AF    AFG    1980 sp        f      1524     NA
#> 10 Afghanistan AF    AFG    1980 sp        f      2534     NA
#> # … with 405,430 more rows

Duk abin da muka yi kawai za a iya kwatanta shi da tsari kamar haka:

R kunshin tidyr da sabbin ayyukan sa pivot_longer da pivot_wider

Ƙayyadewa ta amfani da ƙima mai yawa (.darajar)

A cikin misalin da ke sama, ginshiƙin ƙayyadaddun bayanai .daraja ya ƙunshi ƙima ɗaya kawai, a mafi yawan lokuta haka lamarin yake.

Amma lokaci-lokaci yanayi na iya tasowa lokacin da kake buƙatar tattara bayanai daga ginshiƙai masu nau'ikan bayanai daban-daban a cikin ƙima. Amfani da aikin gado spread() wannan zai yi wuya a yi.

Misalin da ke ƙasa an ɗauko daga vignettes zuwa kunshin bayanai.

Bari mu ƙirƙiri tsarin bayanan horo.

family <- tibble::tribble(
  ~family,  ~dob_child1,  ~dob_child2, ~gender_child1, ~gender_child2,
       1L, "1998-11-26", "2000-01-29",             1L,             2L,
       2L, "1996-06-22",           NA,             2L,             NA,
       3L, "2002-07-11", "2004-04-05",             2L,             2L,
       4L, "2004-10-10", "2009-08-27",             1L,             1L,
       5L, "2000-12-05", "2005-02-28",             2L,             1L,
)
family <- family %>% mutate_at(vars(starts_with("dob")), parse_date)

#> # A tibble: 5 x 5
#>   family dob_child1 dob_child2 gender_child1 gender_child2
#>    <int> <date>     <date>             <int>         <int>
#> 1      1 1998-11-26 2000-01-29             1             2
#> 2      2 1996-06-22 NA                     2            NA
#> 3      3 2002-07-11 2004-04-05             2             2
#> 4      4 2004-10-10 2009-08-27             1             1
#> 5      5 2000-12-05 2005-02-28             2             1

Fim ɗin kwanan wata da aka ƙirƙira ya ƙunshi bayanai kan yaran iyali ɗaya a kowane layi. Iyalai suna iya samun 'ya'ya ɗaya ko biyu. Ga kowane yaro, ana ba da bayanai akan ranar haihuwa da jinsi, kuma bayanan kowane yaro yana cikin ginshiƙai daban-daban; aikinmu shine kawo wannan bayanan zuwa tsarin da ya dace don bincike.

Lura cewa muna da masu canji guda biyu tare da bayani game da kowane yaro: jinsin su da kwanan watan haihuwa (ginshiƙai tare da prefix dop ya ƙunshi ranar haihuwa, ginshiƙai tare da prefix jinsi dauke da jima'i na yaro). Sakamakon da ake tsammanin shine ya kamata su bayyana a cikin ginshiƙai daban-daban. Za mu iya yin haka ta hanyar samar da ƙayyadaddun ƙayyadaddun abin da shafi .value zai sami ma'anoni daban-daban guda biyu.

spec <- family %>%
  pivot_longer_spec(-family) %>%
  separate(col = name, into = c(".value", "child"))%>%
  mutate(child = parse_number(child))

#> # A tibble: 4 x 3
#>   .name         .value child
#>   <chr>         <chr>  <dbl>
#> 1 dob_child1    dob        1
#> 2 dob_child2    dob        2
#> 3 gender_child1 gender     1
#> 4 gender_child2 gender     2

Don haka, bari mu ɗauki mataki-mataki duba ayyukan da lambar da ke sama ta yi.

  • pivot_longer_spec(-family) - ƙirƙira ƙayyadaddun ƙayyadaddun bayanai waɗanda ke danne duk ginshiƙan da ke akwai ban da rukunin iyali.
  • separate(col = name, into = c(".value", "child")) - raba shafi .name, wanda ya ƙunshi sunayen filayen tushe, ta yin amfani da alamar da kuma shigar da sakamakon da aka samu a cikin ginshiƙai. .daraja и yaro.
  • mutate(child = parse_number(child)) - canza darajar filin yaro daga rubutu zuwa nau'in bayanan lamba.

Yanzu za mu iya amfani da ƙayyadaddun da aka samu zuwa ainihin bayanan bayanan da kuma kawo tebur zuwa nau'in da ake so.

family %>% 
    pivot_longer(spec = spec, na.rm = T)

#> # A tibble: 9 x 4
#>   family child dob        gender
#>    <int> <dbl> <date>      <int>
#> 1      1     1 1998-11-26      1
#> 2      1     2 2000-01-29      2
#> 3      2     1 1996-06-22      2
#> 4      3     1 2002-07-11      2
#> 5      3     2 2004-04-05      2
#> 6      4     1 2004-10-10      1
#> 7      4     2 2009-08-27      1
#> 8      5     1 2000-12-05      2
#> 9      5     2 2005-02-28      1

Muna amfani da hujja na.rm = TRUE, saboda yanayin halin yanzu na bayanan yana tilasta ƙirƙirar ƙarin layuka don abubuwan da ba su wanzu ba. Domin iyali 2 suna da ɗa guda ɗaya, na.rm = TRUE yana ba da tabbacin cewa iyali 2 za su sami layi ɗaya a cikin fitarwa.

Canza firam ɗin kwanan wata daga dogon lokaci zuwa tsari mai faɗi

pivot_wider() - shine canji na juzu'i, kuma akasin haka yana ƙara adadin ginshiƙan firam ɗin kwanan wata ta hanyar rage adadin layuka.

R kunshin tidyr da sabbin ayyukan sa pivot_longer da pivot_wider

Irin wannan sauyi ba a cika yin amfani da shi ba don kawo bayanai cikin ingantaccen tsari, duk da haka, wannan dabarar za ta iya zama da amfani don ƙirƙirar allunan pivot da ake amfani da su wajen gabatarwa, ko don haɗawa da wasu kayan aikin.

A gaskiya ayyuka pivot_longer() и pivot_wider() suna daidaitawa, kuma suna haifar da ayyuka da suka saba wa juna, watau: df %>% pivot_longer(spec = spec) %>% pivot_wider(spec = spec) и df %>% pivot_wider(spec = spec) %>% pivot_longer(spec = spec) zai dawo da asalin df.

Misali mafi sauƙi na juya tebur zuwa tsari mai faɗi

Don nuna yadda aikin ke aiki pivot_wider() za mu yi amfani da dataset kifi_ haduwa, wanda ke adana bayanai game da yadda tashoshi daban-daban ke yin rikodin motsin kifin a bakin kogin.

#> # A tibble: 114 x 3
#>    fish  station  seen
#>    <fct> <fct>   <int>
#>  1 4842  Release     1
#>  2 4842  I80_1       1
#>  3 4842  Lisbon      1
#>  4 4842  Rstr        1
#>  5 4842  Base_TD     1
#>  6 4842  BCE         1
#>  7 4842  BCW         1
#>  8 4842  BCE2        1
#>  9 4842  BCW2        1
#> 10 4842  MAE         1
#> # … with 104 more rows

A mafi yawan lokuta, wannan tebur zai zama ƙarin bayani da sauƙi don amfani idan kun gabatar da bayanai ga kowane tasha a cikin ginshiƙi daban.

fish_encounters %>% pivot_wider(names_from = station, values_from = seen)

fish_encounters %>% pivot_wider(names_from = station, values_from = seen)
#> # A tibble: 19 x 12
#>    fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE
#>    <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int>
#>  1 4842        1     1      1     1       1     1     1     1     1     1
#>  2 4843        1     1      1     1       1     1     1     1     1     1
#>  3 4844        1     1      1     1       1     1     1     1     1     1
#>  4 4845        1     1      1     1       1    NA    NA    NA    NA    NA
#>  5 4847        1     1      1    NA      NA    NA    NA    NA    NA    NA
#>  6 4848        1     1      1     1      NA    NA    NA    NA    NA    NA
#>  7 4849        1     1     NA    NA      NA    NA    NA    NA    NA    NA
#>  8 4850        1     1     NA     1       1     1     1    NA    NA    NA
#>  9 4851        1     1     NA    NA      NA    NA    NA    NA    NA    NA
#> 10 4854        1     1     NA    NA      NA    NA    NA    NA    NA    NA
#> # … with 9 more rows, and 1 more variable: MAW <int>

Wannan saitin bayanan yana rubuta bayanai ne kawai lokacin da tashar ta gano kifi, watau. idan wani kifin ba a rubuta ta wani tashar ba, to wannan bayanan ba zai kasance a cikin tebur ba. Wannan yana nufin za a cika fitarwa da NA.

Duk da haka, a wannan yanayin mun san cewa rashin rikodin yana nufin ba a ga kifi ba, don haka za mu iya amfani da hujja dabi'u_cika cikin aiki pivot_wider() kuma cika waɗannan darajojin da suka ɓace da sifili:

fish_encounters %>% pivot_wider(
  names_from = station, 
  values_from = seen,
  values_fill = list(seen = 0)
)

#> # A tibble: 19 x 12
#>    fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE
#>    <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int>
#>  1 4842        1     1      1     1       1     1     1     1     1     1
#>  2 4843        1     1      1     1       1     1     1     1     1     1
#>  3 4844        1     1      1     1       1     1     1     1     1     1
#>  4 4845        1     1      1     1       1     0     0     0     0     0
#>  5 4847        1     1      1     0       0     0     0     0     0     0
#>  6 4848        1     1      1     1       0     0     0     0     0     0
#>  7 4849        1     1      0     0       0     0     0     0     0     0
#>  8 4850        1     1      0     1       1     1     1     0     0     0
#>  9 4851        1     1      0     0       0     0     0     0     0     0
#> 10 4854        1     1      0     0       0     0     0     0     0     0
#> # … with 9 more rows, and 1 more variable: MAW <int>

Samar da sunan shafi daga masu canjin tushe da yawa

Ka yi tunanin muna da tebur wanda ya ƙunshi haɗin samfur, ƙasa da shekara. Don samar da firam ɗin kwanan wata gwaji, kuna iya gudanar da lambar mai zuwa:

df <- expand_grid(
  product = c("A", "B"), 
  country = c("AI", "EI"), 
  year = 2000:2014
) %>%
  filter((product == "A" & country == "AI") | product == "B") %>% 
  mutate(value = rnorm(nrow(.)))

#> # A tibble: 45 x 4
#>    product country  year    value
#>    <chr>   <chr>   <int>    <dbl>
#>  1 A       AI       2000 -2.05   
#>  2 A       AI       2001 -0.676  
#>  3 A       AI       2002  1.60   
#>  4 A       AI       2003 -0.353  
#>  5 A       AI       2004 -0.00530
#>  6 A       AI       2005  0.442  
#>  7 A       AI       2006 -0.610  
#>  8 A       AI       2007 -2.77   
#>  9 A       AI       2008  0.899  
#> 10 A       AI       2009 -0.106  
#> # … with 35 more rows

Ayyukanmu shine fadada tsarin bayanai ta yadda shafi ɗaya ya ƙunshi bayanai don kowane haɗin samfur da ƙasa. Don yin wannan, kawai shiga cikin gardama sunayen_daga vector mai kunshe da sunayen filayen da za a hade.

df %>% pivot_wider(names_from = c(product, country),
                 values_from = "value")

#> # A tibble: 15 x 4
#>     year     A_AI    B_AI    B_EI
#>    <int>    <dbl>   <dbl>   <dbl>
#>  1  2000 -2.05     0.607   1.20  
#>  2  2001 -0.676    1.65   -0.114 
#>  3  2002  1.60    -0.0245  0.501 
#>  4  2003 -0.353    1.30   -0.459 
#>  5  2004 -0.00530  0.921  -0.0589
#>  6  2005  0.442   -1.55    0.594 
#>  7  2006 -0.610    0.380  -1.28  
#>  8  2007 -2.77     0.830   0.637 
#>  9  2008  0.899    0.0175 -1.30  
#> 10  2009 -0.106   -0.195   1.03  
#> # … with 5 more rows

Hakanan zaka iya amfani da ƙayyadaddun bayanai zuwa aiki pivot_wider(). Amma idan aka sallama zuwa ga pivot_wider() ƙayyadaddun ya yi akasin juyawa pivot_longer(): ginshiƙan da aka ƙayyade a cikin .name, ta amfani da dabi'u daga .daraja da sauran ginshiƙai.

Don wannan saitin bayanai, zaku iya samar da ƙayyadaddun ƙayyadaddun al'ada idan kuna son kowace ƙasa mai yuwuwa da haɗin samfur su sami nasu ginshiƙi, ba kawai waɗanda ke cikin bayanan ba:

spec <- df %>% 
  expand(product, country, .value = "value") %>% 
  unite(".name", product, country, remove = FALSE)

#> # A tibble: 4 x 4
#>   .name product country .value
#>   <chr> <chr>   <chr>   <chr> 
#> 1 A_AI  A       AI      value 
#> 2 A_EI  A       EI      value 
#> 3 B_AI  B       AI      value 
#> 4 B_EI  B       EI      value

df %>% pivot_wider(spec = spec) %>% head()

#> # A tibble: 6 x 5
#>    year     A_AI  A_EI    B_AI    B_EI
#>   <int>    <dbl> <dbl>   <dbl>   <dbl>
#> 1  2000 -2.05       NA  0.607   1.20  
#> 2  2001 -0.676      NA  1.65   -0.114 
#> 3  2002  1.60       NA -0.0245  0.501 
#> 4  2003 -0.353      NA  1.30   -0.459 
#> 5  2004 -0.00530    NA  0.921  -0.0589
#> 6  2005  0.442      NA -1.55    0.594

Misalai na ci gaba da yawa na aiki tare da sabon ra'ayi na tidyr

Tsaftace bayanai ta amfani da Ƙididdigar Ƙididdiga ta Amurka da saitin hayar a matsayin misali.

Saitin bayanai kudin shiga_hayar mu ya ƙunshi matsakaicin kudin shiga da bayanin haya ga kowace jiha a cikin Amurka don 2017 (saitin bayanan da ke cikin fakiti ƙididdiga).

us_rent_income
#> # A tibble: 104 x 5
#>    GEOID NAME       variable estimate   moe
#>    <chr> <chr>      <chr>       <dbl> <dbl>
#>  1 01    Alabama    income      24476   136
#>  2 01    Alabama    rent          747     3
#>  3 02    Alaska     income      32940   508
#>  4 02    Alaska     rent         1200    13
#>  5 04    Arizona    income      27517   148
#>  6 04    Arizona    rent          972     4
#>  7 05    Arkansas   income      23789   165
#>  8 05    Arkansas   rent          709     5
#>  9 06    California income      29454   109
#> 10 06    California rent         1358     3
#> # … with 94 more rows

A cikin hanyar da aka adana bayanan a cikin ma'aunin bayanai kudin shiga_hayar mu Yin aiki tare da su ba shi da daɗi sosai, don haka muna son ƙirƙirar saitin bayanai tare da ginshiƙai: haya, haya_moe, zo, kudin shiga_moe. Akwai hanyoyi da yawa don ƙirƙirar wannan ƙayyadaddun ƙayyadaddun bayanai, amma babban batu shine cewa muna buƙatar samar da kowane haɗuwa da ƙima masu canzawa kimanta/moesannan samar da sunan shafi.

  spec <- us_rent_income %>% 
    expand(variable, .value = c("estimate", "moe")) %>% 
    mutate(
      .name = paste0(variable, ifelse(.value == "moe", "_moe", ""))
    )

#> # A tibble: 4 x 3
#>   variable .value   .name     
#>   <chr>    <chr>    <chr>     
#> 1 income   estimate income    
#> 2 income   moe      income_moe
#> 3 rent     estimate rent      
#> 4 rent     moe      rent_moe

Samar da wannan ƙayyadaddun pivot_wider() yana bamu sakamakon da muke nema:

us_rent_income %>% pivot_wider(spec = spec)

#> # A tibble: 52 x 6
#>    GEOID NAME                 income income_moe  rent rent_moe
#>    <chr> <chr>                 <dbl>      <dbl> <dbl>    <dbl>
#>  1 01    Alabama               24476        136   747        3
#>  2 02    Alaska                32940        508  1200       13
#>  3 04    Arizona               27517        148   972        4
#>  4 05    Arkansas              23789        165   709        5
#>  5 06    California            29454        109  1358        3
#>  6 08    Colorado              32401        109  1125        5
#>  7 09    Connecticut           35326        195  1123        5
#>  8 10    Delaware              31560        247  1076       10
#>  9 11    District of Columbia  43198        681  1424       17
#> 10 12    Florida               25952         70  1077        3
#> # … with 42 more rows

Bankin Duniya

Wani lokaci kawo saitin bayanai cikin sigar da ake so yana buƙatar matakai da yawa.
Saitin bayanai duniya_bank_pop ya ƙunshi bayanan Bankin Duniya kan yawan al'ummar kowace ƙasa tsakanin 2000 zuwa 2018.

#> # A tibble: 1,056 x 20
#>    country indicator `2000` `2001` `2002` `2003`  `2004`  `2005`   `2006`
#>    <chr>   <chr>      <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
#>  1 ABW     SP.URB.T… 4.24e4 4.30e4 4.37e4 4.42e4 4.47e+4 4.49e+4  4.49e+4
#>  2 ABW     SP.URB.G… 1.18e0 1.41e0 1.43e0 1.31e0 9.51e-1 4.91e-1 -1.78e-2
#>  3 ABW     SP.POP.T… 9.09e4 9.29e4 9.50e4 9.70e4 9.87e+4 1.00e+5  1.01e+5
#>  4 ABW     SP.POP.G… 2.06e0 2.23e0 2.23e0 2.11e0 1.76e+0 1.30e+0  7.98e-1
#>  5 AFG     SP.URB.T… 4.44e6 4.65e6 4.89e6 5.16e6 5.43e+6 5.69e+6  5.93e+6
#>  6 AFG     SP.URB.G… 3.91e0 4.66e0 5.13e0 5.23e0 5.12e+0 4.77e+0  4.12e+0
#>  7 AFG     SP.POP.T… 2.01e7 2.10e7 2.20e7 2.31e7 2.41e+7 2.51e+7  2.59e+7
#>  8 AFG     SP.POP.G… 3.49e0 4.25e0 4.72e0 4.82e0 4.47e+0 3.87e+0  3.23e+0
#>  9 AGO     SP.URB.T… 8.23e6 8.71e6 9.22e6 9.77e6 1.03e+7 1.09e+7  1.15e+7
#> 10 AGO     SP.URB.G… 5.44e0 5.59e0 5.70e0 5.76e0 5.75e+0 5.69e+0  4.92e+0
#> # … with 1,046 more rows, and 11 more variables: `2007` <dbl>,
#> #   `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>,
#> #   `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>

Manufarmu ita ce ƙirƙirar ƙayyadaddun saitin bayanai tare da kowane ma'auni a cikin nasa ginshiƙi. Ba a san ainihin matakan da ake buƙata ba, amma za mu fara da matsala mafi mahimmanci: shekara ta bazu cikin ginshiƙai da yawa.

Don gyara wannan kuna buƙatar amfani da aikin pivot_longer().

pop2 <- world_bank_pop %>% 
  pivot_longer(`2000`:`2017`, names_to = "year")

#> # A tibble: 19,008 x 4
#>    country indicator   year  value
#>    <chr>   <chr>       <chr> <dbl>
#>  1 ABW     SP.URB.TOTL 2000  42444
#>  2 ABW     SP.URB.TOTL 2001  43048
#>  3 ABW     SP.URB.TOTL 2002  43670
#>  4 ABW     SP.URB.TOTL 2003  44246
#>  5 ABW     SP.URB.TOTL 2004  44669
#>  6 ABW     SP.URB.TOTL 2005  44889
#>  7 ABW     SP.URB.TOTL 2006  44881
#>  8 ABW     SP.URB.TOTL 2007  44686
#>  9 ABW     SP.URB.TOTL 2008  44375
#> 10 ABW     SP.URB.TOTL 2009  44052
#> # … with 18,998 more rows

Mataki na gaba shine duba madaidaicin mai nuna alama.
pop2 %>% count(indicator)

#> # A tibble: 4 x 2
#>   indicator       n
#>   <chr>       <int>
#> 1 SP.POP.GROW  4752
#> 2 SP.POP.TOTL  4752
#> 3 SP.URB.GROW  4752
#> 4 SP.URB.TOTL  4752

Inda SP.POP.GROW shine haɓakar yawan jama'a, SP.POP.TOTL shine jimlar yawan jama'a, da SP.URB. * abu daya ne, amma ga garuruwa kawai. Bari mu raba waɗannan dabi'u zuwa mabambanta biyu: yanki - yanki (dumi ko birni) da ma'auni mai ɗauke da ainihin bayanai (yawan jama'a ko girma):

pop3 <- pop2 %>% 
  separate(indicator, c(NA, "area", "variable"))

#> # A tibble: 19,008 x 5
#>    country area  variable year  value
#>    <chr>   <chr> <chr>    <chr> <dbl>
#>  1 ABW     URB   TOTL     2000  42444
#>  2 ABW     URB   TOTL     2001  43048
#>  3 ABW     URB   TOTL     2002  43670
#>  4 ABW     URB   TOTL     2003  44246
#>  5 ABW     URB   TOTL     2004  44669
#>  6 ABW     URB   TOTL     2005  44889
#>  7 ABW     URB   TOTL     2006  44881
#>  8 ABW     URB   TOTL     2007  44686
#>  9 ABW     URB   TOTL     2008  44375
#> 10 ABW     URB   TOTL     2009  44052
#> # … with 18,998 more rows

Yanzu duk abin da za mu yi shi ne raba canjin zuwa ginshiƙai biyu:

pop3 %>% 
  pivot_wider(names_from = variable, values_from = value)

#> # A tibble: 9,504 x 5
#>    country area  year   TOTL    GROW
#>    <chr>   <chr> <chr> <dbl>   <dbl>
#>  1 ABW     URB   2000  42444  1.18  
#>  2 ABW     URB   2001  43048  1.41  
#>  3 ABW     URB   2002  43670  1.43  
#>  4 ABW     URB   2003  44246  1.31  
#>  5 ABW     URB   2004  44669  0.951 
#>  6 ABW     URB   2005  44889  0.491 
#>  7 ABW     URB   2006  44881 -0.0178
#>  8 ABW     URB   2007  44686 -0.435 
#>  9 ABW     URB   2008  44375 -0.698 
#> 10 ABW     URB   2009  44052 -0.731 
#> # … with 9,494 more rows

Jerin lambobin sadarwa

Misali ɗaya na ƙarshe, yi tunanin kana da lissafin tuntuɓar da ka kwafa kuma ka manna daga gidan yanar gizo:

contacts <- tribble(
  ~field, ~value,
  "name", "Jiena McLellan",
  "company", "Toyota", 
  "name", "John Smith", 
  "company", "google", 
  "email", "[email protected]",
  "name", "Huxley Ratcliffe"
)

Rubuta wannan jeri yana da wuyar gaske saboda babu wani canji da zai iya gano ko wane bayanan na wane lamba ne. Za mu iya gyara wannan ta hanyar lura cewa kowane sabon bayanan tuntuɓar yana farawa da "suna", don haka za mu iya ƙirƙirar mai ganowa na musamman kuma mu ƙara shi da ɗaya a duk lokacin da ginshiƙin filin ya ƙunshi darajar "suna":

contacts <- contacts %>% 
  mutate(
    person_id = cumsum(field == "name")
  )
contacts

#> # A tibble: 6 x 3
#>   field   value            person_id
#>   <chr>   <chr>                <int>
#> 1 name    Jiena McLellan           1
#> 2 company Toyota                   1
#> 3 name    John Smith               2
#> 4 company google                   2
#> 5 email   [email protected]          2
#> 6 name    Huxley Ratcliffe         3

Yanzu da muna da ID na musamman ga kowane lamba, za mu iya juya filin da ƙima zuwa ginshiƙai:

contacts %>% 
  pivot_wider(names_from = field, values_from = value)

#> # A tibble: 3 x 4
#>   person_id name             company email          
#>       <int> <chr>            <chr>   <chr>          
#> 1         1 Jiena McLellan   Toyota  <NA>           
#> 2         2 John Smith       google  [email protected]
#> 3         3 Huxley Ratcliffe <NA>    <NA>

ƙarshe

Ra'ayina na sirri shine sabon ra'ayi gyara da gaske mafi ƙwarewa, kuma mafi mahimmanci a cikin ayyuka zuwa ayyukan gado spread() и gather(). Ina fatan wannan labarin ya taimaka muku magance pivot_longer() и pivot_wider().

source: www.habr.com

Add a comment