R paket tidyr lan fungsi anyar pivot_longer lan pivot_wider

Paket tidyr kalebu ing inti saka salah sawijining perpustakaan paling populer ing basa R - tidyverse.
Tujuan utama paket kasebut yaiku nggawa data menyang wangun sing akurat.

Wis kasedhiya ing Habré publikasi darmabakti kanggo paket iki, nanging tanggal bali menyang 2015. Lan aku pengin ngandhani babagan owah-owahan paling anyar, sing diumumake sawetara dina kepungkur dening penulis, Hedley Wickham.

R paket tidyr lan fungsi anyar pivot_longer lan pivot_wider

S.J.K.: Bakal klumpukne () lan nyebar () bakal deprecated?

Hadley Wickham: Kanggo sawetara ombone. Kita ora bakal menehi rekomendasi maneh nggunakake fungsi kasebut lan ndandani kewan omo, nanging bakal tetep ana ing paket kasebut ing kahanan saiki.

Isi

Yen sampeyan kasengsem ing analisis data, sampeyan bisa uga kasengsem ing sandi telegram и youtube saluran. Sebagéyan gedhé isiné dikhususaké kanggo basa R.

Konsep TidyData

Tujuane tidyr - mbantu sampeyan nggawa data menyang wangun sing diarani rapi. Data sing rapi yaiku data ing ngendi:

  • Saben variabel ana ing kolom.
  • Saben pengamatan minangka senar.
  • Saben nilai minangka sel.

Luwih gampang lan luwih trep kanggo nggarap data sing ditampilake ing data sing rapi nalika nindakake analisis.

Fungsi utama kalebu ing paket tidyr

tidyr ngemot sakumpulan fungsi sing dirancang kanggo ngowahi tabel:

  • fill() - ngisi nilai sing ilang ing kolom kanthi nilai sadurunge;
  • separate() - pamisah siji lapangan dadi sawetara nggunakake separator;
  • unite() - nindakake operasi nggabungake sawetara lapangan dadi siji, tumindak kuwalik saka fungsi kasebut separate();
  • pivot_longer() - fungsi sing ngowahi data saka format sudhut kanggo format dawa;
  • pivot_wider() - fungsi sing ngowahi data saka format dawa kanggo format sudhut. Operasi mbalikke sing ditindakake dening fungsi kasebut pivot_longer().
  • gather()lungse - fungsi sing ngowahi data saka format sudhut kanggo format dawa;
  • spread()lungse - fungsi sing ngowahi data saka format dawa kanggo format sudhut. Operasi mbalikke sing ditindakake dening fungsi kasebut gather().

Konsep anyar kanggo ngowahi data saka format amba menyang dawa lan kosok balene

Sadurunge, fungsi digunakake kanggo jinis transformasi iki gather() и spread(). Sajrone pirang-pirang taun fungsi kasebut, dadi jelas manawa kanggo umume pangguna, kalebu penulis paket kasebut, jeneng fungsi kasebut lan argumene ora jelas, lan nyebabake kesulitan nemokake lan ngerteni fungsi kasebut sing diowahi. pigura tanggal saka sudhut kanggo format dawa, lan kosok balene.

Ing babagan iki, ing tidyr Loro fungsi penting anyar wis ditambahake sing dirancang kanggo ngowahi pigura tanggal.

Fitur anyar pivot_longer() и pivot_wider() padha inspirasi dening sawetara fitur ing paket cdata, digawe dening John Mount lan Nina Zumel.

Nginstal versi paling saiki tidyr 0.8.3.9000

Kanggo nginstal anyar, versi paling anyar saka paket tidyr 0.8.3.9000, ing ngendi fitur anyar kasedhiya, gunakake kode ing ngisor iki.

devtools::install_github("tidyverse/tidyr")

Nalika nulis, fungsi kasebut mung kasedhiya ing versi dev paket ing GitHub.

Transisi menyang fitur anyar

Nyatane, ora angel nransfer skrip lawas kanggo nggarap fungsi anyar; kanggo pangerten sing luwih apik, aku bakal njupuk conto saka dokumentasi fungsi lawas lan nuduhake carane operasi sing padha ditindakake nggunakake sing anyar. pivot_*() fungsi.

Ngonversi format sudhut menyang format dawa.

Kode conto saka dokumentasi fungsi kumpul

# example
library(dplyr)
stocks <- data.frame(
  time = as.Date('2009-01-01') + 0:9,
  X = rnorm(10, 0, 1),
  Y = rnorm(10, 0, 2),
  Z = rnorm(10, 0, 4)
)

# old
stocks_gather <- stocks %>% gather(key   = stock, 
                                   value = price, 
                                   -time)

# new
stocks_long   <- stocks %>% pivot_longer(cols      = -time, 
                                       names_to  = "stock", 
                                       values_to = "price")

Ngonversi format dawa menyang format lebar.

Kode conto saka dokumentasi fungsi nyebar

# old
stocks_spread <- stocks_gather %>% spread(key = stock, 
                                          value = price) 

# new 
stock_wide    <- stocks_long %>% pivot_wider(names_from  = "stock",
                                            values_from = "price")

Amarga ing conto ing ndhuwur nggarap pivot_longer() и pivot_wider(), ing tabel asli saham ora ana kolom sing kadhaptar ing argumen jeneng_kanggo и nilai_kanggo jenenge kudu nganggo tanda petik.

Tabel sing bakal mbantu sampeyan paling gampang ngerti carane ngalih menyang karya karo konsep anyar tidyr.

R paket tidyr lan fungsi anyar pivot_longer lan pivot_wider

Cathetan saka penulis

Kabeh teks ing ngisor iki adaptif, Aku malah bakal ngomong free terjemahan vignette saka situs web perpustakaan tidyverse resmi.

Conto prasaja ngonversi data saka format amba menyang dawa

pivot_longer () - nggawe set data luwih dawa kanthi nyuda jumlah kolom lan nambah jumlah larik.

R paket tidyr lan fungsi anyar pivot_longer lan pivot_wider

Kanggo mbukak conto sing ditampilake ing artikel kasebut, sampeyan kudu nyambungake paket sing dibutuhake:

library(tidyr)
library(dplyr)
library(readr)

Ayo kita duwe tabel kanthi asil survey sing (antara liya) takon babagan agama lan penghasilan taunan:

#> # A tibble: 18 x 11
#>    religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k`
#>    <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
#>  1 Agnostic      27        34        60        81        76       137
#>  2 Atheist       12        27        37        52        35        70
#>  3 Buddhist      27        21        30        34        33        58
#>  4 Catholic     418       617       732       670       638      1116
#>  5 Don’t k…      15        14        15        11        10        35
#>  6 Evangel…     575       869      1064       982       881      1486
#>  7 Hindu          1         9         7         9        11        34
#>  8 Histori…     228       244       236       238       197       223
#>  9 Jehovah…      20        27        24        24        21        30
#> 10 Jewish        19        19        25        25        30        95
#> # … with 8 more rows, and 4 more variables: `$75-100k` <dbl>,
#> #   `$100-150k` <dbl>, `>150k` <dbl>, `Don't know/refused` <dbl>

Tabel iki ngemot data agama responden ing baris, lan tingkat income kasebar ing jeneng kolom. Jumlah responden saka saben kategori disimpen ing nilai sel ing persimpangan agama lan tingkat penghasilan. Kanggo nggawa meja menyang rapi, format bener, iku cukup kanggo nggunakake pivot_longer():

pew %>% 
  pivot_longer(cols = -religion, names_to = "income", values_to = "count")

pew %>% 
  pivot_longer(cols = -religion, names_to = "income", values_to = "count")
#> # A tibble: 180 x 3
#>    religion income             count
#>    <chr>    <chr>              <dbl>
#>  1 Agnostic <$10k                 27
#>  2 Agnostic $10-20k               34
#>  3 Agnostic $20-30k               60
#>  4 Agnostic $30-40k               81
#>  5 Agnostic $40-50k               76
#>  6 Agnostic $50-75k              137
#>  7 Agnostic $75-100k             122
#>  8 Agnostic $100-150k            109
#>  9 Agnostic >150k                 84
#> 10 Agnostic Don't know/refused    96
#> # … with 170 more rows

Argumen Fungsi pivot_longer()

  • Argumentasi pisanan kerah, nerangake kolom endi sing kudu digabung. Ing kasus iki, kabeh kolom kajaba wektu.
  • Argumentasi jeneng_kanggo menehi jeneng variabel sing bakal digawe saka jeneng kolom kita concatenated.
  • nilai_kanggo menehi jeneng variabel sing bakal digawe saka data sing disimpen ing nilai sel saka kolom gabungan.

Spesifikasi

Iki minangka fungsi anyar saka paket kasebut tidyr, sing sadurunge ora kasedhiya nalika nggarap fungsi warisan.

Spesifikasi minangka pigura data, saben baris cocog karo siji kolom ing pigura tanggal output anyar, lan rong kolom khusus sing diwiwiti karo:

  • .name ngemot jeneng kolom asli.
  • .nilai ngemot jeneng kolom sing bakal ngemot nilai sel.

Kolom spesifikasi sing isih ana nggambarake carane kolom anyar bakal nampilake jeneng kolom sing dikompres saka .name.

Spesifikasi kasebut nggambarake metadata sing disimpen ing jeneng kolom, kanthi siji baris kanggo saben kolom lan siji kolom kanggo saben variabel, digabungake karo jeneng kolom, definisi iki bisa uga katon bingung, nanging sawise ndeleng sawetara conto bakal dadi akeh. luwih cetha.

Titik spesifikasi yaiku sampeyan bisa njupuk, ngowahi, lan nemtokake metadata anyar kanggo pigura data sing diowahi.

Kanggo nggarap spesifikasi nalika ngowahi tabel saka format amba menyang format dawa, gunakake fungsi kasebut pivot_longer_spec().

Cara kerjane fungsi iki yaiku njupuk pigura tanggal lan ngasilake metadata kanthi cara sing kasebut ing ndhuwur.

Minangka conto, ayo njupuk dataset sing diwenehake karo paket kasebut tidyr. Dataset iki ngemot informasi sing diwenehake dening organisasi kesehatan internasional babagan kedadeyan tuberkulosis.

who
#> # A tibble: 7,240 x 60
#>    country iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534
#>    <chr>   <chr> <chr> <int>       <int>        <int>        <int>
#>  1 Afghan… AF    AFG    1980          NA           NA           NA
#>  2 Afghan… AF    AFG    1981          NA           NA           NA
#>  3 Afghan… AF    AFG    1982          NA           NA           NA
#>  4 Afghan… AF    AFG    1983          NA           NA           NA
#>  5 Afghan… AF    AFG    1984          NA           NA           NA
#>  6 Afghan… AF    AFG    1985          NA           NA           NA
#>  7 Afghan… AF    AFG    1986          NA           NA           NA
#>  8 Afghan… AF    AFG    1987          NA           NA           NA
#>  9 Afghan… AF    AFG    1988          NA           NA           NA
#> 10 Afghan… AF    AFG    1989          NA           NA           NA
#> # … with 7,230 more rows, and 53 more variables

Ayo mbangun spesifikasi.

spec <- who %>%
  pivot_longer_spec(new_sp_m014:newrel_f65, values_to = "count")

#> # A tibble: 56 x 3
#>    .name        .value name        
#>    <chr>        <chr>  <chr>       
#>  1 new_sp_m014  count  new_sp_m014 
#>  2 new_sp_m1524 count  new_sp_m1524
#>  3 new_sp_m2534 count  new_sp_m2534
#>  4 new_sp_m3544 count  new_sp_m3544
#>  5 new_sp_m4554 count  new_sp_m4554
#>  6 new_sp_m5564 count  new_sp_m5564
#>  7 new_sp_m65   count  new_sp_m65  
#>  8 new_sp_f014  count  new_sp_f014 
#>  9 new_sp_f1524 count  new_sp_f1524
#> 10 new_sp_f2534 count  new_sp_f2534
#> # … with 46 more rows

kothak negara, isoxnumx, isoxnumx wis variabel. Tugas kita kanggo flip kolom karo new_sp_m014 ing newrel_f65.

Jeneng kolom kasebut nyimpen informasi ing ngisor iki:

  • Ater-ater new_ nuduhake yen kolom kasebut ngemot data babagan kasus tuberkulosis anyar, pigura tanggal saiki mung ngemot informasi babagan penyakit anyar, saengga ater-ater iki ing konteks saiki ora duwe makna.
  • sp/rel/sp/ep njlèntrèhaké cara kanggo diagnosa penyakit.
  • m/f gender pasien.
  • 014/1524/2535/3544/4554/65 rentang umur pasien.

Kita bisa pamisah kolom iki nggunakake fungsi extract()nggunakake ekspresi reguler.

spec <- spec %>%
        extract(name, c("diagnosis", "gender", "age"), "new_?(.*)_(.)(.*)")

#> # A tibble: 56 x 5
#>    .name        .value diagnosis gender age  
#>    <chr>        <chr>  <chr>     <chr>  <chr>
#>  1 new_sp_m014  count  sp        m      014  
#>  2 new_sp_m1524 count  sp        m      1524 
#>  3 new_sp_m2534 count  sp        m      2534 
#>  4 new_sp_m3544 count  sp        m      3544 
#>  5 new_sp_m4554 count  sp        m      4554 
#>  6 new_sp_m5564 count  sp        m      5564 
#>  7 new_sp_m65   count  sp        m      65   
#>  8 new_sp_f014  count  sp        f      014  
#>  9 new_sp_f1524 count  sp        f      1524 
#> 10 new_sp_f2534 count  sp        f      2534 
#> # … with 46 more rows

Wigati kolom .name kudu tetep ora diganti amarga iki indeks kita menyang jeneng kolom saka dataset asli.

Jenis kelamin lan umur (kolom jender и umur) duwe nilai tetep lan dikenal, mula disaranake ngowahi kolom kasebut dadi faktor:

spec <-  spec %>%
            mutate(
              gender = factor(gender, levels = c("f", "m")),
              age = factor(age, levels = unique(age), ordered = TRUE)
            ) 

Pungkasan, kanggo ngetrapake spesifikasi sing digawe menyang pigura tanggal asli sing kita kudu nggunakake argumentasi spec ing fungsi pivot_longer().

who %>% pivot_longer(spec = spec)

#> # A tibble: 405,440 x 8
#>    country     iso2  iso3   year diagnosis gender age   count
#>    <chr>       <chr> <chr> <int> <chr>     <fct>  <ord> <int>
#>  1 Afghanistan AF    AFG    1980 sp        m      014      NA
#>  2 Afghanistan AF    AFG    1980 sp        m      1524     NA
#>  3 Afghanistan AF    AFG    1980 sp        m      2534     NA
#>  4 Afghanistan AF    AFG    1980 sp        m      3544     NA
#>  5 Afghanistan AF    AFG    1980 sp        m      4554     NA
#>  6 Afghanistan AF    AFG    1980 sp        m      5564     NA
#>  7 Afghanistan AF    AFG    1980 sp        m      65       NA
#>  8 Afghanistan AF    AFG    1980 sp        f      014      NA
#>  9 Afghanistan AF    AFG    1980 sp        f      1524     NA
#> 10 Afghanistan AF    AFG    1980 sp        f      2534     NA
#> # … with 405,430 more rows

Kabeh sing ditindakake bisa digambarake kanthi skema kaya ing ngisor iki:

R paket tidyr lan fungsi anyar pivot_longer lan pivot_wider

Spesifikasi nggunakake sawetara nilai (.value)

Ing conto ing ndhuwur, kolom spesifikasi .nilai isine mung siji nilai, ing paling kasus iki cilik.

Nanging sok-sok ana kahanan nalika sampeyan kudu ngumpulake data saka kolom kanthi macem-macem jinis data ing nilai. Nggunakake fungsi warisan spread() iki bakal cukup angel kanggo nindakake.

Tuladha ing ngisor iki dijupuk saka vignette menyang paket data.tabel.

Ayo nggawe dataframe latihan.

family <- tibble::tribble(
  ~family,  ~dob_child1,  ~dob_child2, ~gender_child1, ~gender_child2,
       1L, "1998-11-26", "2000-01-29",             1L,             2L,
       2L, "1996-06-22",           NA,             2L,             NA,
       3L, "2002-07-11", "2004-04-05",             2L,             2L,
       4L, "2004-10-10", "2009-08-27",             1L,             1L,
       5L, "2000-12-05", "2005-02-28",             2L,             1L,
)
family <- family %>% mutate_at(vars(starts_with("dob")), parse_date)

#> # A tibble: 5 x 5
#>   family dob_child1 dob_child2 gender_child1 gender_child2
#>    <int> <date>     <date>             <int>         <int>
#> 1      1 1998-11-26 2000-01-29             1             2
#> 2      2 1996-06-22 NA                     2            NA
#> 3      3 2002-07-11 2004-04-05             2             2
#> 4      4 2004-10-10 2009-08-27             1             1
#> 5      5 2000-12-05 2005-02-28             2             1

Bingkai tanggal sing digawe ngemot data babagan anak saka siji kulawarga ing saben baris. Kulawarga bisa duwe anak siji utawa loro. Kanggo saben bocah, data diwenehake ing tanggal lair lan jender, lan data kanggo saben bocah ana ing kolom sing kapisah; tugas kita yaiku nggawa data kasebut menyang format sing bener kanggo dianalisis.

Elinga yen kita duwe rong variabel kanthi informasi babagan saben bocah: jender lan tanggal lair (kolom kanthi ater-ater dop ngemot tanggal lair, kolom kanthi ater-ater jender ngemot jinising bocah). Asil samesthine iku kudu katon ing kolom kapisah. Kita bisa nindakake iki dening ngasilaken specification kang kolom .value bakal duwe rong makna sing beda.

spec <- family %>%
  pivot_longer_spec(-family) %>%
  separate(col = name, into = c(".value", "child"))%>%
  mutate(child = parse_number(child))

#> # A tibble: 4 x 3
#>   .name         .value child
#>   <chr>         <chr>  <dbl>
#> 1 dob_child1    dob        1
#> 2 dob_child2    dob        2
#> 3 gender_child1 gender     1
#> 4 gender_child2 gender     2

Dadi, ayo goleki langkah-langkah langkah-langkah sing ditindakake dening kode ing ndhuwur.

  • pivot_longer_spec(-family) - nggawe specification sing compresses kabeh kolom ana kajaba kolom kulawarga.
  • separate(col = name, into = c(".value", "child")) - pamisah kolom .name, sing ngemot jeneng kolom sumber, nggunakake garis ngisor lan ngetik nilai sing diasilake menyang kolom .nilai и Bocah.
  • mutate(child = parse_number(child)) - ngowahi nilai lapangan Bocah saka teks menyang jinis data numerik.

Saiki kita bisa ngetrapake spesifikasi sing diasilake menyang kerangka data asli lan nggawa tabel menyang wangun sing dikarepake.

family %>% 
    pivot_longer(spec = spec, na.rm = T)

#> # A tibble: 9 x 4
#>   family child dob        gender
#>    <int> <dbl> <date>      <int>
#> 1      1     1 1998-11-26      1
#> 2      1     2 2000-01-29      2
#> 3      2     1 1996-06-22      2
#> 4      3     1 2002-07-11      2
#> 5      3     2 2004-04-05      2
#> 6      4     1 2004-10-10      1
#> 7      4     2 2009-08-27      1
#> 8      5     1 2000-12-05      2
#> 9      5     2 2005-02-28      1

Kita nggunakake argumentasi na.rm = TRUE, amarga wangun data saiki meksa nggawe baris ekstra kanggo pengamatan sing ora ana. Amarga kulawarga 2 mung duwe anak siji, na.rm = TRUE njamin sing kulawarga 2 bakal siji baris ing output.

Ngonversi pigura tanggal saka format dawa nganti amba

pivot_wider() - punika transformasi kuwalik, lan kosok balene nambah nomer kolom pigura tanggal kanthi ngurangi jumlah larik.

R paket tidyr lan fungsi anyar pivot_longer lan pivot_wider

Transformasi jenis iki arang banget digunakake kanggo ngowahi data dadi akurat, nanging teknik iki bisa migunani kanggo nggawe tabel pangsi sing digunakake ing presentasi, utawa kanggo integrasi karo sawetara alat liyane.

Sejatine fungsi pivot_longer() и pivot_wider() simetris, lan ngasilake tumindak sing saling berlawanan, yaiku: df %>% pivot_longer(spec = spec) %>% pivot_wider(spec = spec) и df %>% pivot_wider(spec = spec) %>% pivot_longer(spec = spec) bakal bali df asli.

Conto paling gampang kanggo ngowahi tabel dadi format sing amba

Kanggo nduduhake cara kerjane pivot_wider() kita bakal nggunakake dataset fish_encounters, sing nyimpen informasi babagan carane stasiun beda ngrekam gerakan iwak ing sadawane kali.

#> # A tibble: 114 x 3
#>    fish  station  seen
#>    <fct> <fct>   <int>
#>  1 4842  Release     1
#>  2 4842  I80_1       1
#>  3 4842  Lisbon      1
#>  4 4842  Rstr        1
#>  5 4842  Base_TD     1
#>  6 4842  BCE         1
#>  7 4842  BCW         1
#>  8 4842  BCE2        1
#>  9 4842  BCW2        1
#> 10 4842  MAE         1
#> # … with 104 more rows

Ing sawetara kasus, tabel iki bakal luwih informatif lan luwih gampang digunakake yen sampeyan menehi informasi kanggo saben stasiun ing kolom kapisah.

fish_encounters %>% pivot_wider(names_from = station, values_from = seen)

fish_encounters %>% pivot_wider(names_from = station, values_from = seen)
#> # A tibble: 19 x 12
#>    fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE
#>    <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int>
#>  1 4842        1     1      1     1       1     1     1     1     1     1
#>  2 4843        1     1      1     1       1     1     1     1     1     1
#>  3 4844        1     1      1     1       1     1     1     1     1     1
#>  4 4845        1     1      1     1       1    NA    NA    NA    NA    NA
#>  5 4847        1     1      1    NA      NA    NA    NA    NA    NA    NA
#>  6 4848        1     1      1     1      NA    NA    NA    NA    NA    NA
#>  7 4849        1     1     NA    NA      NA    NA    NA    NA    NA    NA
#>  8 4850        1     1     NA     1       1     1     1    NA    NA    NA
#>  9 4851        1     1     NA    NA      NA    NA    NA    NA    NA    NA
#> 10 4854        1     1     NA    NA      NA    NA    NA    NA    NA    NA
#> # … with 9 more rows, and 1 more variable: MAW <int>

Set data iki mung nyathet informasi nalika iwak wis dideteksi dening stasiun, i.e. yen ana iwak ora direkam dening sawetara stasiun, banjur data iki ora bakal ing meja. Iki tegese output bakal diisi karo NA.

Nanging, ing kasus iki kita ngerti yen ora ana rekaman tegese iwak kasebut ora katon, mula kita bisa nggunakake argumen kasebut. nilai_isi ing fungsi pivot_wider() lan isi angka sing ilang iki karo nol:

fish_encounters %>% pivot_wider(
  names_from = station, 
  values_from = seen,
  values_fill = list(seen = 0)
)

#> # A tibble: 19 x 12
#>    fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE
#>    <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int>
#>  1 4842        1     1      1     1       1     1     1     1     1     1
#>  2 4843        1     1      1     1       1     1     1     1     1     1
#>  3 4844        1     1      1     1       1     1     1     1     1     1
#>  4 4845        1     1      1     1       1     0     0     0     0     0
#>  5 4847        1     1      1     0       0     0     0     0     0     0
#>  6 4848        1     1      1     1       0     0     0     0     0     0
#>  7 4849        1     1      0     0       0     0     0     0     0     0
#>  8 4850        1     1      0     1       1     1     1     0     0     0
#>  9 4851        1     1      0     0       0     0     0     0     0     0
#> 10 4854        1     1      0     0       0     0     0     0     0     0
#> # … with 9 more rows, and 1 more variable: MAW <int>

Ngasilake jeneng kolom saka macem-macem variabel sumber

Mbayangno kita duwe tabel sing ngemot kombinasi produk, negara lan taun. Kanggo nggawe pigura tanggal test, sampeyan bisa mbukak kode ing ngisor iki:

df <- expand_grid(
  product = c("A", "B"), 
  country = c("AI", "EI"), 
  year = 2000:2014
) %>%
  filter((product == "A" & country == "AI") | product == "B") %>% 
  mutate(value = rnorm(nrow(.)))

#> # A tibble: 45 x 4
#>    product country  year    value
#>    <chr>   <chr>   <int>    <dbl>
#>  1 A       AI       2000 -2.05   
#>  2 A       AI       2001 -0.676  
#>  3 A       AI       2002  1.60   
#>  4 A       AI       2003 -0.353  
#>  5 A       AI       2004 -0.00530
#>  6 A       AI       2005  0.442  
#>  7 A       AI       2006 -0.610  
#>  8 A       AI       2007 -2.77   
#>  9 A       AI       2008  0.899  
#> 10 A       AI       2009 -0.106  
#> # … with 35 more rows

Tugas kita yaiku nggedhekake pigura data supaya siji kolom ngemot data kanggo saben kombinasi produk lan negara. Kanggo nindakake iki, mung pass ing argumentasi jeneng_saka vektor sing ngemot jeneng kolom sing bakal digabung.

df %>% pivot_wider(names_from = c(product, country),
                 values_from = "value")

#> # A tibble: 15 x 4
#>     year     A_AI    B_AI    B_EI
#>    <int>    <dbl>   <dbl>   <dbl>
#>  1  2000 -2.05     0.607   1.20  
#>  2  2001 -0.676    1.65   -0.114 
#>  3  2002  1.60    -0.0245  0.501 
#>  4  2003 -0.353    1.30   -0.459 
#>  5  2004 -0.00530  0.921  -0.0589
#>  6  2005  0.442   -1.55    0.594 
#>  7  2006 -0.610    0.380  -1.28  
#>  8  2007 -2.77     0.830   0.637 
#>  9  2008  0.899    0.0175 -1.30  
#> 10  2009 -0.106   -0.195   1.03  
#> # … with 5 more rows

Sampeyan uga bisa aplikasi specifications kanggo fungsi pivot_wider(). Nanging nalika diajukake kanggo pivot_wider() specification nindakake konversi ngelawan pivot_longer(): Kolom sing ditemtokake ing .name, nggunakake nilai saka .nilai lan kolom liyane.

Kanggo set data iki, sampeyan bisa nggawe spesifikasi khusus yen sampeyan pengin saben negara lan kombinasi produk duwe kolom dhewe, ora mung sing ana ing data kasebut:

spec <- df %>% 
  expand(product, country, .value = "value") %>% 
  unite(".name", product, country, remove = FALSE)

#> # A tibble: 4 x 4
#>   .name product country .value
#>   <chr> <chr>   <chr>   <chr> 
#> 1 A_AI  A       AI      value 
#> 2 A_EI  A       EI      value 
#> 3 B_AI  B       AI      value 
#> 4 B_EI  B       EI      value

df %>% pivot_wider(spec = spec) %>% head()

#> # A tibble: 6 x 5
#>    year     A_AI  A_EI    B_AI    B_EI
#>   <int>    <dbl> <dbl>   <dbl>   <dbl>
#> 1  2000 -2.05       NA  0.607   1.20  
#> 2  2001 -0.676      NA  1.65   -0.114 
#> 3  2002  1.60       NA -0.0245  0.501 
#> 4  2003 -0.353      NA  1.30   -0.459 
#> 5  2004 -0.00530    NA  0.921  -0.0589
#> 6  2005  0.442      NA -1.55    0.594

Sawetara conto majeng nggarap konsep tidyr anyar

Ngresiki data nggunakake US Census Income lan Rent dataset minangka conto.

kumpulan data us_rent_income ngemot penghasilan rata-rata lan informasi sewa kanggo saben negara ing AS kanggo 2017 (set data kasedhiya ing paket tidycensus).

us_rent_income
#> # A tibble: 104 x 5
#>    GEOID NAME       variable estimate   moe
#>    <chr> <chr>      <chr>       <dbl> <dbl>
#>  1 01    Alabama    income      24476   136
#>  2 01    Alabama    rent          747     3
#>  3 02    Alaska     income      32940   508
#>  4 02    Alaska     rent         1200    13
#>  5 04    Arizona    income      27517   148
#>  6 04    Arizona    rent          972     4
#>  7 05    Arkansas   income      23789   165
#>  8 05    Arkansas   rent          709     5
#>  9 06    California income      29454   109
#> 10 06    California rent         1358     3
#> # … with 94 more rows

Ing wangun kang data disimpen ing dataset us_rent_income nggarap dheweke pancen ora trep, mula kita pengin nggawe set data kanthi kolom: rent, rent_moe, teka, income_moe. Ana akeh cara kanggo nggawe spesifikasi iki, nanging sing utama yaiku kita kudu ngasilake saben kombinasi nilai variabel lan taksiran/moebanjur generate jeneng kolom.

  spec <- us_rent_income %>% 
    expand(variable, .value = c("estimate", "moe")) %>% 
    mutate(
      .name = paste0(variable, ifelse(.value == "moe", "_moe", ""))
    )

#> # A tibble: 4 x 3
#>   variable .value   .name     
#>   <chr>    <chr>    <chr>     
#> 1 income   estimate income    
#> 2 income   moe      income_moe
#> 3 rent     estimate rent      
#> 4 rent     moe      rent_moe

Nyedhiyakake spesifikasi iki pivot_wider() menehi asil sing kita goleki:

us_rent_income %>% pivot_wider(spec = spec)

#> # A tibble: 52 x 6
#>    GEOID NAME                 income income_moe  rent rent_moe
#>    <chr> <chr>                 <dbl>      <dbl> <dbl>    <dbl>
#>  1 01    Alabama               24476        136   747        3
#>  2 02    Alaska                32940        508  1200       13
#>  3 04    Arizona               27517        148   972        4
#>  4 05    Arkansas              23789        165   709        5
#>  5 06    California            29454        109  1358        3
#>  6 08    Colorado              32401        109  1125        5
#>  7 09    Connecticut           35326        195  1123        5
#>  8 10    Delaware              31560        247  1076       10
#>  9 11    District of Columbia  43198        681  1424       17
#> 10 12    Florida               25952         70  1077        3
#> # … with 42 more rows

Bank Dunia

Kadhangkala nggawa set data menyang formulir sing dikarepake mbutuhake sawetara langkah.
kumpulan data world_bank_pop ngemot data Bank Dunia babagan populasi saben negara antarane 2000 lan 2018.

#> # A tibble: 1,056 x 20
#>    country indicator `2000` `2001` `2002` `2003`  `2004`  `2005`   `2006`
#>    <chr>   <chr>      <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
#>  1 ABW     SP.URB.T… 4.24e4 4.30e4 4.37e4 4.42e4 4.47e+4 4.49e+4  4.49e+4
#>  2 ABW     SP.URB.G… 1.18e0 1.41e0 1.43e0 1.31e0 9.51e-1 4.91e-1 -1.78e-2
#>  3 ABW     SP.POP.T… 9.09e4 9.29e4 9.50e4 9.70e4 9.87e+4 1.00e+5  1.01e+5
#>  4 ABW     SP.POP.G… 2.06e0 2.23e0 2.23e0 2.11e0 1.76e+0 1.30e+0  7.98e-1
#>  5 AFG     SP.URB.T… 4.44e6 4.65e6 4.89e6 5.16e6 5.43e+6 5.69e+6  5.93e+6
#>  6 AFG     SP.URB.G… 3.91e0 4.66e0 5.13e0 5.23e0 5.12e+0 4.77e+0  4.12e+0
#>  7 AFG     SP.POP.T… 2.01e7 2.10e7 2.20e7 2.31e7 2.41e+7 2.51e+7  2.59e+7
#>  8 AFG     SP.POP.G… 3.49e0 4.25e0 4.72e0 4.82e0 4.47e+0 3.87e+0  3.23e+0
#>  9 AGO     SP.URB.T… 8.23e6 8.71e6 9.22e6 9.77e6 1.03e+7 1.09e+7  1.15e+7
#> 10 AGO     SP.URB.G… 5.44e0 5.59e0 5.70e0 5.76e0 5.75e+0 5.69e+0  4.92e+0
#> # … with 1,046 more rows, and 11 more variables: `2007` <dbl>,
#> #   `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>,
#> #   `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>

Tujuan kita yaiku nggawe set data sing rapi karo saben variabel ing kolom dhewe. Ora jelas persis apa langkah sing dibutuhake, nanging kita bakal miwiti kanthi masalah sing paling jelas: taun kasebut nyebar ing pirang-pirang kolom.

Kanggo ndandani iki, sampeyan kudu nggunakake fungsi kasebut pivot_longer().

pop2 <- world_bank_pop %>% 
  pivot_longer(`2000`:`2017`, names_to = "year")

#> # A tibble: 19,008 x 4
#>    country indicator   year  value
#>    <chr>   <chr>       <chr> <dbl>
#>  1 ABW     SP.URB.TOTL 2000  42444
#>  2 ABW     SP.URB.TOTL 2001  43048
#>  3 ABW     SP.URB.TOTL 2002  43670
#>  4 ABW     SP.URB.TOTL 2003  44246
#>  5 ABW     SP.URB.TOTL 2004  44669
#>  6 ABW     SP.URB.TOTL 2005  44889
#>  7 ABW     SP.URB.TOTL 2006  44881
#>  8 ABW     SP.URB.TOTL 2007  44686
#>  9 ABW     SP.URB.TOTL 2008  44375
#> 10 ABW     SP.URB.TOTL 2009  44052
#> # … with 18,998 more rows

Langkah sabanjure yaiku ndeleng variabel indikator.
pop2 %>% count(indicator)

#> # A tibble: 4 x 2
#>   indicator       n
#>   <chr>       <int>
#> 1 SP.POP.GROW  4752
#> 2 SP.POP.TOTL  4752
#> 3 SP.URB.GROW  4752
#> 4 SP.URB.TOTL  4752

Where SP.POP.GROW punika wutah populasi, SP.POP.TOTL total populasi, lan SP.URB. * bab sing padha, nanging mung kanggo wilayah kutha. Dibagi nilai kasebut dadi rong variabel: area - area (total utawa kutha) lan variabel sing ngemot data nyata (populasi utawa pertumbuhan):

pop3 <- pop2 %>% 
  separate(indicator, c(NA, "area", "variable"))

#> # A tibble: 19,008 x 5
#>    country area  variable year  value
#>    <chr>   <chr> <chr>    <chr> <dbl>
#>  1 ABW     URB   TOTL     2000  42444
#>  2 ABW     URB   TOTL     2001  43048
#>  3 ABW     URB   TOTL     2002  43670
#>  4 ABW     URB   TOTL     2003  44246
#>  5 ABW     URB   TOTL     2004  44669
#>  6 ABW     URB   TOTL     2005  44889
#>  7 ABW     URB   TOTL     2006  44881
#>  8 ABW     URB   TOTL     2007  44686
#>  9 ABW     URB   TOTL     2008  44375
#> 10 ABW     URB   TOTL     2009  44052
#> # … with 18,998 more rows

Saiki sing kudu ditindakake yaiku pamisah variabel dadi rong kolom:

pop3 %>% 
  pivot_wider(names_from = variable, values_from = value)

#> # A tibble: 9,504 x 5
#>    country area  year   TOTL    GROW
#>    <chr>   <chr> <chr> <dbl>   <dbl>
#>  1 ABW     URB   2000  42444  1.18  
#>  2 ABW     URB   2001  43048  1.41  
#>  3 ABW     URB   2002  43670  1.43  
#>  4 ABW     URB   2003  44246  1.31  
#>  5 ABW     URB   2004  44669  0.951 
#>  6 ABW     URB   2005  44889  0.491 
#>  7 ABW     URB   2006  44881 -0.0178
#>  8 ABW     URB   2007  44686 -0.435 
#>  9 ABW     URB   2008  44375 -0.698 
#> 10 ABW     URB   2009  44052 -0.731 
#> # … with 9,494 more rows

Dhaptar kontak

Conto pungkasan, bayangake sampeyan duwe dhaptar kontak sing disalin lan ditempel saka situs web:

contacts <- tribble(
  ~field, ~value,
  "name", "Jiena McLellan",
  "company", "Toyota", 
  "name", "John Smith", 
  "company", "google", 
  "email", "[email protected]",
  "name", "Huxley Ratcliffe"
)

Tabulasi dhaptar iki cukup angel amarga ora ana variabel sing ngenali data sing dadi kontak. Kita bisa ndandani iki kanthi nyathet yen saben data kontak anyar diwiwiti kanthi "jeneng", supaya kita bisa nggawe pengenal unik lan nambah siji saben kolom kolom ngemot nilai "jeneng":

contacts <- contacts %>% 
  mutate(
    person_id = cumsum(field == "name")
  )
contacts

#> # A tibble: 6 x 3
#>   field   value            person_id
#>   <chr>   <chr>                <int>
#> 1 name    Jiena McLellan           1
#> 2 company Toyota                   1
#> 3 name    John Smith               2
#> 4 company google                   2
#> 5 email   [email protected]          2
#> 6 name    Huxley Ratcliffe         3

Saiki kita duwe ID unik kanggo saben kontak, kita bisa ngowahi kolom lan nilai dadi kolom:

contacts %>% 
  pivot_wider(names_from = field, values_from = value)

#> # A tibble: 3 x 4
#>   person_id name             company email          
#>       <int> <chr>            <chr>   <chr>          
#> 1         1 Jiena McLellan   Toyota  <NA>           
#> 2         2 John Smith       google  [email protected]
#> 3         3 Huxley Ratcliffe <NA>    <NA>

kesimpulan

Pendapat pribadiku yaiku konsep anyar tidyr saestu luwih intuisi, lan pinunjul ing fungsi kanggo fungsi warisan spread() и gather(). Muga-muga artikel iki mbantu sampeyan ngatasi pivot_longer() и pivot_wider().

Source: www.habr.com

Add a comment