ããã±ãŒãž ãã¡ããšãã R èšèªã§æã人æ°ã®ããã©ã€ãã©ãªã® XNUMX ã€ã®ã³ã¢ã«å«ãŸããŠããŸã - ãã¡ããšãã.
ããã±ãŒãžã®äž»ãªç®çã¯ãããŒã¿ãæ£ç¢ºãªåœ¢åŒã«ããããšã§ãã
ãã§ã«ããã¬ã§å
¥æå¯èœ
SJK: Gather() ãš Spread() ã¯éæšå¥šã«ãªãã®ã§ãããã?
ãããªãŒã»ãŠã£ãã«ã ïŒ ããçšåºŠã ãããã®é¢æ°ã®äœ¿çšã¯æšå¥šãããªããªãããã°ã¯ä¿®æ£ãããŸãããçŸåšã®ç¶æ ã§ããã±ãŒãžã«ååšãç¶ããŸãã
ããŒãžå 容
ããŒã¿åæã«èå³ãããå Žåã¯ãç§ã®èšäºã«èå³ããããããããŸããã
TidyDataã®ã³ã³ã»ãã Tidyr ããã±ãŒãžã«å«ãŸããäž»ãªæ©èœ ããŒã¿ãã¯ã€ã圢åŒãããã³ã°åœ¢åŒãžããŸãã¯ãã®éã«å€æããããã®æ°ããã³ã³ã»ãã ææ°ããŒãžã§ã³ã®tidyr 0.8.3.9000ãã€ã³ã¹ããŒã«ãã æ°æ©èœãžã®ç§»è¡ ããŒã¿ãã¯ã€ã圢åŒãããã³ã°åœ¢åŒã«å€æããç°¡åãªäŸ ä»æ§æž è€æ°ã®å€ã䜿çšããæå®(.value) æ¥ä»ãã¬ãŒã ããã³ã°åœ¢åŒããã¯ã€ã圢åŒã«å€æãã æ°ããtidyrã³ã³ã»ããã䜿çšããããã€ãã®å é²çãªäŸ ãŸãšã
TidyDataã®ã³ã³ã»ãã
ç®æš ãã¡ããšãã â ããŒã¿ãããããæŽã£ã圢åŒã«ããã®ã«åœ¹ç«ã¡ãŸãã ããŒã ããŒã¿ãšã¯ã次ã®ãããªããŒã¿ã§ãã
- åå€æ°ã¯åå ã«ãããŸãã
- å芳枬å€ã¯æååã§ãã
- åå€ã¯ã»ã«ã§ãã
åæãè¡ããšãã«ãæŽç¶ãšããããŒã¿ã§è¡šç€ºãããããŒã¿ãæäœããæ¹ãã¯ããã«ç°¡åã§äŸ¿å©ã§ãã
Tidyr ããã±ãŒãžã«å«ãŸããäž»ãªæ©èœ
tinyr ã«ã¯ãããŒãã«ãå€æããããã«èšèšãããäžé£ã®é¢æ°ãå«ãŸããŠããŸãã
fill()
â åã®æ¬ æå€ã以åã®å€ã§åãããseparate()
â åºåãæåã䜿çšã㊠XNUMX ã€ã®ãã£ãŒã«ããè€æ°ã®ãã£ãŒã«ãã«åå²ããŸããunite()
â è€æ°ã®ãã£ãŒã«ãã XNUMX ã€ã«çµåããæäœãã€ãŸãé¢æ°ã®éã¢ã¯ã·ã§ã³ãå®è¡ããŸããseparate()
;pivot_longer()
â ããŒã¿ãã¯ã€ã圢åŒãããã³ã°åœ¢åŒã«å€æããé¢æ°ãpivot_wider()
- ããŒã¿ããã³ã°ãã©ãŒãããããã¯ã€ããã©ãŒãããã«å€æããæ©èœã é¢æ°ã«ãã£ãŠå®è¡ãããæäœã®éã®æäœpivot_longer()
.gather()
å»æ¢ â ããŒã¿ãã¯ã€ã圢åŒãããã³ã°åœ¢åŒã«å€æããé¢æ°ãspread()
å»æ¢ - ããŒã¿ããã³ã°ãã©ãŒãããããã¯ã€ããã©ãŒãããã«å€æããæ©èœã é¢æ°ã«ãã£ãŠå®è¡ãããæäœã®éã®æäœgather()
.
ããŒã¿ãã¯ã€ã圢åŒãããã³ã°åœ¢åŒãžããŸãã¯ãã®éã«å€æããããã®æ°ããã³ã³ã»ãã
以åã¯ããã®çš®ã®å€æã«ã¯é¢æ°ã䜿çšãããŠããŸããã gather()
О spread()
ã ãããã®é¢æ°ãäœå¹Žãååšãããã¡ã«ãããã±ãŒãžã®äœæè
ãå«ãã»ãšãã©ã®ãŠãŒã¶ãŒã«ãšã£ãŠããããã®é¢æ°ã®ååãšãã®åŒæ°ãæ確ã§ã¯ãªããããããèŠã€ããããã©ã®é¢æ°ãå€æããã®ããç解ããããšãå°é£ã§ããããšãæããã«ãªããŸãããã¯ã€ã圢åŒãããã³ã°åœ¢åŒãžã®æ¥ä»ãã¬ãŒã ããŸãã¯ãã®éã®æ¥ä»ãã¬ãŒã ã
ãã®ç¹ã«é¢ããŠã ãã¡ããšãã æ¥ä»ãã¬ãŒã ãå€æããããã«èšèšããã XNUMX ã€ã®æ°ããéèŠãªé¢æ°ãè¿œå ãããŸããã
æ°æ©èœ pivot_longer()
О pivot_wider()
ããã±ãŒãžã®ããã€ãã®æ©èœããã€ã³ã¹ãã¬ãŒã·ã§ã³ãåŸããã® cdataããžã§ã³ã»ããŠã³ããšããŒãã»ãºã¡ã«ã«ãã£ãŠäœæãããŸããã
ææ°ããŒãžã§ã³ã®tidyr 0.8.3.9000ãã€ã³ã¹ããŒã«ãã
æ°ããææ°ããŒãžã§ã³ã®ããã±ãŒãžãã€ã³ã¹ããŒã«ããã«ã¯ ãã¡ããšãã 0.8.3.9000ã§ãæ°ããæ©èœãå©çšã§ããå Žåã¯ã次ã®ã³ãŒãã䜿çšããŸãã
devtools::install_github("tidyverse/tidyr")
å·çæç¹ã§ã¯ããããã®é¢æ°ã¯ GitHub äžã®ããã±ãŒãžã®éçºããŒãžã§ã³ã§ã®ã¿äœ¿çšã§ããŸãã
æ°æ©èœãžã®ç§»è¡
å®éãå€ãã¹ã¯ãªããã移è¡ããŠæ°ããé¢æ°ã§åäœãããããšã¯é£ãããããŸãããããããç解ããããã«ãå€ãé¢æ°ã®ããã¥ã¡ã³ãããäŸããšããæ°ããé¢æ°ã䜿çšããŠåãæäœãã©ã®ããã«å®è¡ããããã瀺ããŸãã pivot_*()
é¢æ°ã
ã¯ã€ã圢åŒããã³ã°åœ¢åŒã«å€æããŸãã
åéé¢æ°ããã¥ã¡ã³ãã®ã³ãŒãäŸ
# example
library(dplyr)
stocks <- data.frame(
time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4)
)
# old
stocks_gather <- stocks %>% gather(key = stock,
value = price,
-time)
# new
stocks_long <- stocks %>% pivot_longer(cols = -time,
names_to = "stock",
values_to = "price")
ãã³ã°ãã©ãŒãããããã¯ã€ããã©ãŒããããžã®å€æã
æ¡æ£é¢æ°ã®ããã¥ã¡ã³ãããã®ã³ãŒãäŸ
# old
stocks_spread <- stocks_gather %>% spread(key = stock,
value = price)
# new
stock_wide <- stocks_long %>% pivot_wider(names_from = "stock",
values_from = "price")
ãªããªãäžèšã®äœæ¥äŸã§ã¯ã pivot_longer()
О pivot_wider()
ãå
ã®è¡šã§ã¯ ã¹ãã㯠åŒæ°ã«åããªã¹ããããŠããŸãã ååãä»ãã О å€_to ãããã®ååã¯åŒçšç¬Šã§å²ãå¿
èŠããããŸãã
æ°ããã³ã³ã»ããã§ã®äœæ¥ã«åãæ¿ããæ¹æ³ãæãç°¡åã«ç解ããã®ã«åœ¹ç«ã€è¡š ãã¡ããšãã.
èè ããã®ã¡ã¢
以äžã®ããã¹ãã¯ãã¹ãŠã¢ãããã£ãã§ããèªç±ç¿»èš³ãšããèšããŸãã
ãããã Tidyverse ã©ã€ãã©ãªã®å ¬åŒ Web ãµã€ãããã
ããŒã¿ãã¯ã€ã圢åŒãããã³ã°åœ¢åŒã«å€æããç°¡åãªäŸ
pivot_longer ()
â åã®æ°ãæžãããè¡ã®æ°ãå¢ããããšã«ãããããŒã¿ ã»ãããé·ããªããŸãã
ãã®èšäºã§çŽ¹ä»ãããŠããäŸãå®è¡ããã«ã¯ããŸãå¿ èŠãªããã±ãŒãžãæ¥ç¶ããå¿ èŠããããŸãã
library(tidyr)
library(dplyr)
library(readr)
(ãšããã) å®æãšå¹Žåã«ã€ããŠäººã ã«å°ãã調æ»ã®çµæãå«ãè¡šããããšããŸãã
#> # A tibble: 18 x 11
#> religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Agnostic 27 34 60 81 76 137
#> 2 Atheist 12 27 37 52 35 70
#> 3 Buddhist 27 21 30 34 33 58
#> 4 Catholic 418 617 732 670 638 1116
#> 5 Donât k⊠15 14 15 11 10 35
#> 6 Evangel⊠575 869 1064 982 881 1486
#> 7 Hindu 1 9 7 9 11 34
#> 8 Histori⊠228 244 236 238 197 223
#> 9 Jehovah⊠20 27 24 24 21 30
#> 10 Jewish 19 19 25 25 30 95
#> # ⊠with 8 more rows, and 4 more variables: `$75-100k` <dbl>,
#> # `$100-150k` <dbl>, `>150k` <dbl>, `Don't know/refused` <dbl>
ãã®ããŒãã«ã«ã¯ãåçè
ã®å®æããŒã¿ãè¡ã«å«ãŸããŠãããåå
¥ã¬ãã«ãååå
šäœã«åæ£ãããŠããŸãã åã«ããŽãªã®åçè
ã®æ°ã¯ãå®æãšåå
¥ã¬ãã«ã®äº€å·®ç¹ã®ã»ã«å€ã«ä¿åãããŸãã ããŒãã«ããã¡ããšããæ£ãã圢åŒã«ããã«ã¯ã次ã®ããã«äœ¿çšããã ãã§ååã§ãã pivot_longer()
:
pew %>%
pivot_longer(cols = -religion, names_to = "income", values_to = "count")
pew %>%
pivot_longer(cols = -religion, names_to = "income", values_to = "count")
#> # A tibble: 180 x 3
#> religion income count
#> <chr> <chr> <dbl>
#> 1 Agnostic <$10k 27
#> 2 Agnostic $10-20k 34
#> 3 Agnostic $20-30k 60
#> 4 Agnostic $30-40k 81
#> 5 Agnostic $40-50k 76
#> 6 Agnostic $50-75k 137
#> 7 Agnostic $75-100k 122
#> 8 Agnostic $100-150k 109
#> 9 Agnostic >150k 84
#> 10 Agnostic Don't know/refused 96
#> # ⊠with 170 more rows
é¢æ°ã®åŒæ° pivot_longer()
- æåã®åŒæ° ã³ã«ãºã§ã¯ãã©ã®åãããŒãžããå¿ èŠããããã説æããŸãã ãã®å Žåã次ãé€ããã¹ãŠã®å æé.
- åŒæ° ååãä»ãã é£çµããåã®ååããäœæãããå€æ°ã®ååãæå®ããŸãã
- å€_to çµåãããåã®ã»ã«ã®å€ã«æ ŒçŽãããããŒã¿ããäœæãããå€æ°ã®ååãæå®ããŸãã
ä»æ§æž
ããã¯ããã±ãŒãžã®æ°æ©èœã§ã ãã¡ããšããããããŸã§ã¯ã¬ã¬ã·ãŒé¢æ°ã䜿çšããå Žåã«ã¯å©çšã§ããŸããã§ããã
ä»æ§ã¯ããŒã¿ ãã¬ãŒã ã§ããããã®åè¡ã¯æ°ããåºåæ¥ä»ãã¬ãŒã ã® XNUMX ã€ã®åãšã次ã§å§ãŸã XNUMX ã€ã®ç¹å¥ãªåã«å¯Ÿå¿ããŸãã
- .nameã® å ã®ååãå«ãŸããŸãã
- ãå€ ã»ã«å€ãå«ãåã®ååãå«ãŸããŸãã
ä»æ§ã®æ®ãã®åã¯ãæ°ããåãå§çž®ãããåã®ååãã©ã®ããã«è¡šç€ºããããåæ ããŸãã .nameã®.
ãã®ä»æ§ã§ã¯ãååã«æ ŒçŽãããã¡ã¿ããŒã¿ã説æããŸããåããšã« XNUMX è¡ãå€æ°ããšã« XNUMX åãæå®ããååãšçµã¿åããããšããã®å®çŸ©ã¯çŸæç¹ã§ã¯ãããã«ããããã«æãããããããŸããããããã€ãã®äŸãèŠããšããããããããããªããŸããããæ確ã«ã
ãã®ä»æ§ã®ãã€ã³ãã¯ãå€æãããããŒã¿ãã¬ãŒã ã®æ°ããã¡ã¿ããŒã¿ãååŸãå€æŽãå®çŸ©ã§ããããšã§ãã
ããŒãã«ãã¯ã€ã圢åŒãããã³ã°åœ¢åŒã«å€æãããšãã«ä»æ§ãæäœããã«ã¯ãé¢æ°ã䜿çšããŸã pivot_longer_spec()
.
ãã®é¢æ°ãã©ã®ããã«æ©èœããããšãããšãä»»æã®æ¥ä»ãã¬ãŒã ãåãåããäžã§èª¬æããæ¹æ³ã§ãã®ã¡ã¿ããŒã¿ãçæããŸãã
äŸãšããŠãããã±ãŒãžã«ä»å±ããŠãã who ããŒã¿ã»ãããèŠãŠã¿ãŸããã ãã¡ããšããã ãã®ããŒã¿ã»ããã«ã¯ãåœéä¿å¥æ©é¢ããæäŸãããçµæ žã®çºççã«é¢ããæ å ±ãå«ãŸããŠããŸãã
who
#> # A tibble: 7,240 x 60
#> country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534
#> <chr> <chr> <chr> <int> <int> <int> <int>
#> 1 Afghan⊠AF AFG 1980 NA NA NA
#> 2 Afghan⊠AF AFG 1981 NA NA NA
#> 3 Afghan⊠AF AFG 1982 NA NA NA
#> 4 Afghan⊠AF AFG 1983 NA NA NA
#> 5 Afghan⊠AF AFG 1984 NA NA NA
#> 6 Afghan⊠AF AFG 1985 NA NA NA
#> 7 Afghan⊠AF AFG 1986 NA NA NA
#> 8 Afghan⊠AF AFG 1987 NA NA NA
#> 9 Afghan⊠AF AFG 1988 NA NA NA
#> 10 Afghan⊠AF AFG 1989 NA NA NA
#> # ⊠with 7,230 more rows, and 53 more variables
ãã®ä»æ§ãæ§ç¯ããŸãããã
spec <- who %>%
pivot_longer_spec(new_sp_m014:newrel_f65, values_to = "count")
#> # A tibble: 56 x 3
#> .name .value name
#> <chr> <chr> <chr>
#> 1 new_sp_m014 count new_sp_m014
#> 2 new_sp_m1524 count new_sp_m1524
#> 3 new_sp_m2534 count new_sp_m2534
#> 4 new_sp_m3544 count new_sp_m3544
#> 5 new_sp_m4554 count new_sp_m4554
#> 6 new_sp_m5564 count new_sp_m5564
#> 7 new_sp_m65 count new_sp_m65
#> 8 new_sp_f014 count new_sp_f014
#> 9 new_sp_f1524 count new_sp_f1524
#> 10 new_sp_f2534 count new_sp_f2534
#> # ⊠with 46 more rows
ãã£ãŒã«ã åœ, iso2, iso3 ã¯ãã§ã«å€æ°ã«ãªã£ãŠããŸãã ç§ãã¡ã®ã¿ã¹ã¯ã¯ãåãå転ããããšã§ãã new_sp_m014 äžã® newrel_f65.
ãããã®åã®ååã«ã¯ã次ã®æ å ±ãä¿åãããŸãã
- ãã¬ãã£ãã¯ã¹
new_
åã«ã¯çµæ žã®æ°èŠçäŸã«é¢ããããŒã¿ãå«ãŸããŠããããšã瀺ããŸãããçŸåšã®æ¥ä»ãã¬ãŒã ã«ã¯æ°èŠçŸæ£ã«é¢ããæ å ±ã®ã¿ãå«ãŸãããããçŸåšã®ã³ã³ããã¹ãã«ããããã®æ¥é èŸã¯äœã®æå³ãæã¡ãŸããã sp
/rel
/sp
/ep
ç æ°ã蚺æããæ¹æ³ã説æããŸããm
/f
æ£è ã®æ§å¥ã014
/1524
/2535
/3544
/4554
/65
æ£è ã®å¹Žéœ¢å±€ã
é¢æ°ã䜿çšããŠãããã®åãåå²ã§ããŸãã extract()
æ£èŠè¡šçŸã䜿çšããŸãã
spec <- spec %>%
extract(name, c("diagnosis", "gender", "age"), "new_?(.*)_(.)(.*)")
#> # A tibble: 56 x 5
#> .name .value diagnosis gender age
#> <chr> <chr> <chr> <chr> <chr>
#> 1 new_sp_m014 count sp m 014
#> 2 new_sp_m1524 count sp m 1524
#> 3 new_sp_m2534 count sp m 2534
#> 4 new_sp_m3544 count sp m 3544
#> 5 new_sp_m4554 count sp m 4554
#> 6 new_sp_m5564 count sp m 5564
#> 7 new_sp_m65 count sp m 65
#> 8 new_sp_f014 count sp f 014
#> 9 new_sp_f1524 count sp f 1524
#> 10 new_sp_f2534 count sp f 2534
#> # ⊠with 46 more rows
ã³ã©ã ã«ã泚ç®ãã ãã .nameã® ããã¯å ã®ããŒã¿ã»ããã®ååã®ã€ã³ããã¯ã¹ã§ãããããå€æŽããªãã§ãã ããã
æ§å¥ãšå¹Žéœ¢ïŒæ¬ïŒ æ§å¥ О 幎霢) ã«ã¯åºå®ã®æ¢ç¥ã®å€ãããããããããã®åãä¿æ°ã«å€æããããšããå§ãããŸãã
spec <- spec %>%
mutate(
gender = factor(gender, levels = c("f", "m")),
age = factor(age, levels = unique(age), ordered = TRUE)
)
æåŸã«ãäœæããä»æ§ãå
ã®æ¥ä»ãã¬ãŒã ã«é©çšããããã who åŒæ°ã䜿çšããå¿
èŠããããŸã ã¹ãã㯠æ©èœçã« pivot_longer()
.
who %>% pivot_longer(spec = spec)
#> # A tibble: 405,440 x 8
#> country iso2 iso3 year diagnosis gender age count
#> <chr> <chr> <chr> <int> <chr> <fct> <ord> <int>
#> 1 Afghanistan AF AFG 1980 sp m 014 NA
#> 2 Afghanistan AF AFG 1980 sp m 1524 NA
#> 3 Afghanistan AF AFG 1980 sp m 2534 NA
#> 4 Afghanistan AF AFG 1980 sp m 3544 NA
#> 5 Afghanistan AF AFG 1980 sp m 4554 NA
#> 6 Afghanistan AF AFG 1980 sp m 5564 NA
#> 7 Afghanistan AF AFG 1980 sp m 65 NA
#> 8 Afghanistan AF AFG 1980 sp f 014 NA
#> 9 Afghanistan AF AFG 1980 sp f 1524 NA
#> 10 Afghanistan AF AFG 1980 sp f 2534 NA
#> # ⊠with 405,430 more rows
ç§ãã¡ãä»è¡ã£ãããšã¯ãã¹ãŠã次ã®ããã«æŠç¥çã«è¡šãããšãã§ããŸãã
è€æ°ã®å€ã䜿çšããæå®(.value)
äžèšã®äŸã§ã¯ãä»æ§æ¬ ãå€ å€ã XNUMX ã€ã ãå«ãŸããŠããŸããããã»ãšãã©ã®å ŽåãããåœãŠã¯ãŸããŸãã
ãã ããå€ã®ããŒã¿åãç°ãªãåããããŒã¿ãåéããå¿
èŠãããç¶æ³ãçºçããããšããããŸãã åŸæ¥ã®é¢æ°ã®äœ¿çš spread()
ããã¯ããªãé£ããã§ãããã
以äžã®äŸã¯ããåŒçšãããã®ã§ã
ãã¬ãŒãã³ã° ããŒã¿ãã¬ãŒã ãäœæããŸãããã
family <- tibble::tribble(
~family, ~dob_child1, ~dob_child2, ~gender_child1, ~gender_child2,
1L, "1998-11-26", "2000-01-29", 1L, 2L,
2L, "1996-06-22", NA, 2L, NA,
3L, "2002-07-11", "2004-04-05", 2L, 2L,
4L, "2004-10-10", "2009-08-27", 1L, 1L,
5L, "2000-12-05", "2005-02-28", 2L, 1L,
)
family <- family %>% mutate_at(vars(starts_with("dob")), parse_date)
#> # A tibble: 5 x 5
#> family dob_child1 dob_child2 gender_child1 gender_child2
#> <int> <date> <date> <int> <int>
#> 1 1 1998-11-26 2000-01-29 1 2
#> 2 2 1996-06-22 NA 2 NA
#> 3 3 2002-07-11 2004-04-05 2 2
#> 4 4 2004-10-10 2009-08-27 1 1
#> 5 5 2000-12-05 2005-02-28 2 1
äœæãããæ¥ä»ãã¬ãŒã ã«ã¯ãåè¡ã« XNUMX ã€ã®å®¶æã®åäŸã«é¢ããããŒã¿ãå«ãŸããŠããŸãã 家æã«ã¯ XNUMX 人ã XNUMX 人ã®åäŸããããããããŸããã ååäŸã«ã€ããŠãç幎ææ¥ãšæ§å¥ã«é¢ããããŒã¿ãæäŸãããååäŸã®ããŒã¿ã¯å¥ã ã®åã«ãããŸããç§ãã¡ã®ä»äºã¯ããã®ããŒã¿ãåæçšã«æ£ãã圢åŒã«å€æããããšã§ãã
ååäŸã«é¢ããæ
å ±ãå«ã XNUMX ã€ã®å€æ°ãããããšã«æ³šæããŠãã ãã: æ§å¥ãšç幎ææ¥ (æ¥é èŸãä»ããŠããå) æŽç€Œ ç幎ææ¥ãå«ããæ¥é èŸä»ãã®å æ§å¥ åäŸã®æ§å¥ãå«ãŸããŸãïŒã æåŸ
ãããçµæã¯ãããããå¥ã
ã®åã«è¡šç€ºãããããšã§ãã ãããè¡ãã«ã¯ã次ã®åãå«ãä»æ§ãçæããŸãã .value
XNUMXã€ã®ç°ãªãæå³ã«ãªããŸãã
spec <- family %>%
pivot_longer_spec(-family) %>%
separate(col = name, into = c(".value", "child"))%>%
mutate(child = parse_number(child))
#> # A tibble: 4 x 3
#> .name .value child
#> <chr> <chr> <dbl>
#> 1 dob_child1 dob 1
#> 2 dob_child2 dob 2
#> 3 gender_child1 gender 1
#> 4 gender_child2 gender 2
ããã§ã¯ãäžèšã®ã³ãŒãã«ãã£ãŠå®è¡ãããã¢ã¯ã·ã§ã³ã段éçã«èŠãŠã¿ãŸãããã
pivot_longer_spec(-family)
â ãã¡ããªåãé€ããã¹ãŠã®æ¢åã®åãå§çž®ããä»æ§ãäœæããŸããseparate(col = name, into = c(".value", "child"))
- åãåå²ãã .nameã®ããœãŒã¹ãã£ãŒã«ãã®ååãå«ãŸããŠãããã¢ã³ããŒã¹ã³ã¢ã䜿çšããŠçµæã®å€ãåã«å ¥åããŸãã ãå€ Ðž å.mutate(child = parse_number(child))
â ãã£ãŒã«ãå€ãå€æããŸã å ããã¹ã ããŒã¿åããæ°å€ããŒã¿åãžã
ããã§ãçµæã®ä»æ§ãå ã®ããŒã¿ãã¬ãŒã ã«é©çšããããŒãã«ãç®çã®åœ¢åŒã«ããããšãã§ããŸãã
family %>%
pivot_longer(spec = spec, na.rm = T)
#> # A tibble: 9 x 4
#> family child dob gender
#> <int> <dbl> <date> <int>
#> 1 1 1 1998-11-26 1
#> 2 1 2 2000-01-29 2
#> 3 2 1 1996-06-22 2
#> 4 3 1 2002-07-11 2
#> 5 3 2 2004-04-05 2
#> 6 4 1 2004-10-10 1
#> 7 4 2 2009-08-27 1
#> 8 5 1 2000-12-05 2
#> 9 5 2 2005-02-28 1
åŒæ°ã䜿çšããŸã na.rm = TRUE
çŸåšã®ããŒã¿åœ¢åŒã§ã¯ãååšããªã芳枬å€ã«å¯ŸããŠäœåãªè¡ã®äœæã匷å¶ãããããã§ãã ãªããªã家æ2ã«ã¯åäŸãXNUMX人ã ãã§ããã na.rm = TRUE
ãã¡ã㪠2 ã®åºåã«ã¯ XNUMX è¡ãå«ãŸããããšãä¿èšŒãããŸãã
æ¥ä»ãã¬ãŒã ããã³ã°åœ¢åŒããã¯ã€ã圢åŒã«å€æãã
pivot_wider()
- ã¯éå€æã§ããããã®éã¯è¡æ°ãæžããããšã§æ¥ä»ãã¬ãŒã ã®åæ°ãå¢ãããŸãã
ããŒã¿ãæ£ç¢ºãªåœ¢åŒã«ããããã«ãã®çš®ã®å€æã䜿çšãããããšã¯ã»ãšãã©ãããŸãããããã®ææ³ã¯ãã¬ãŒã³ããŒã·ã§ã³ã§äœ¿çšããããããã ããŒãã«ãäœæããããä»ã®ããŒã«ãšçµ±åãããããå Žåã«åœ¹ç«ã¡ãŸãã
å®éã®æ©èœã¯ pivot_longer()
О pivot_wider()
ã¯å¯Ÿç§°çã§ãããäºãã«éã®ã¢ã¯ã·ã§ã³ãçæããŸããã€ãŸãã次ã®ãšããã§ãã df %>% pivot_longer(spec = spec) %>% pivot_wider(spec = spec)
О df %>% pivot_wider(spec = spec) %>% pivot_longer(spec = spec)
å
ã® df ãè¿ããŸãã
ããŒãã«ãã¯ã€ã圢åŒã«å€æããæãåçŽãªäŸ
é¢æ°ãã©ã®ããã«åäœãããã瀺ããã pivot_wider()
ããŒã¿ã»ããã䜿çšããŸã éãšã®åºäŒããããŸããŸãªã¹ããŒã·ã§ã³ãå·ã«æ²¿ã£ãéã®åããã©ã®ããã«èšé²ãããã«é¢ããæ
å ±ãä¿åãããŠããŸãã
#> # A tibble: 114 x 3
#> fish station seen
#> <fct> <fct> <int>
#> 1 4842 Release 1
#> 2 4842 I80_1 1
#> 3 4842 Lisbon 1
#> 4 4842 Rstr 1
#> 5 4842 Base_TD 1
#> 6 4842 BCE 1
#> 7 4842 BCW 1
#> 8 4842 BCE2 1
#> 9 4842 BCW2 1
#> 10 4842 MAE 1
#> # ⊠with 104 more rows
ã»ãšãã©ã®å Žåãåã¹ããŒã·ã§ã³ã®æ å ±ãå¥ã®åã«è¡šç€ºãããšããã®è¡šã®æ å ±ãããã«å€ããªãã䜿ãããããªããŸãã
fish_encounters %>% pivot_wider(names_from = station, values_from = seen)
fish_encounters %>% pivot_wider(names_from = station, values_from = seen)
#> # A tibble: 19 x 12
#> fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE
#> <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 4842 1 1 1 1 1 1 1 1 1 1
#> 2 4843 1 1 1 1 1 1 1 1 1 1
#> 3 4844 1 1 1 1 1 1 1 1 1 1
#> 4 4845 1 1 1 1 1 NA NA NA NA NA
#> 5 4847 1 1 1 NA NA NA NA NA NA NA
#> 6 4848 1 1 1 1 NA NA NA NA NA NA
#> 7 4849 1 1 NA NA NA NA NA NA NA NA
#> 8 4850 1 1 NA 1 1 1 1 NA NA NA
#> 9 4851 1 1 NA NA NA NA NA NA NA NA
#> 10 4854 1 1 NA NA NA NA NA NA NA NA
#> # ⊠with 9 more rows, and 1 more variable: MAW <int>
ãã®ããŒã¿ã»ããã¯ãã¹ããŒã·ã§ã³ã«ãã£ãŠéãæ€åºããããšãã®æ å ±ã®ã¿ãèšé²ããŸãã ã©ããã®ã¹ããŒã·ã§ã³ã§éãèšé²ãããŠããªãå Žåããã®ããŒã¿ã¯ããŒãã«ã«ã¯å«ãŸããŸããã ããã¯ãåºåã NA ã§åããããããšãæå³ããŸãã
ãã ãããã®å Žåãèšé²ããªããšããããšã¯éãèŠãããªãã£ãããšãæå³ããããšãããã£ãŠããã®ã§ã次ã®åŒæ°ã䜿çšã§ããŸãã å€_ãã£ã« æ©èœçã« pivot_wider()
ãããã®æ¬ æå€ããŒãã§åããŸãã
fish_encounters %>% pivot_wider(
names_from = station,
values_from = seen,
values_fill = list(seen = 0)
)
#> # A tibble: 19 x 12
#> fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE
#> <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 4842 1 1 1 1 1 1 1 1 1 1
#> 2 4843 1 1 1 1 1 1 1 1 1 1
#> 3 4844 1 1 1 1 1 1 1 1 1 1
#> 4 4845 1 1 1 1 1 0 0 0 0 0
#> 5 4847 1 1 1 0 0 0 0 0 0 0
#> 6 4848 1 1 1 1 0 0 0 0 0 0
#> 7 4849 1 1 0 0 0 0 0 0 0 0
#> 8 4850 1 1 0 1 1 1 1 0 0 0
#> 9 4851 1 1 0 0 0 0 0 0 0 0
#> 10 4854 1 1 0 0 0 0 0 0 0 0
#> # ⊠with 9 more rows, and 1 more variable: MAW <int>
è€æ°ã®ãœãŒã¹å€æ°ããååãçæãã
補åãåœã幎ã®çµã¿åãããå«ãããŒãã«ããããšæ³åããŠãã ããã ãã¹ãæ¥ä»ãã¬ãŒã ãçæããã«ã¯ã次ã®ã³ãŒããå®è¡ããŸãã
df <- expand_grid(
product = c("A", "B"),
country = c("AI", "EI"),
year = 2000:2014
) %>%
filter((product == "A" & country == "AI") | product == "B") %>%
mutate(value = rnorm(nrow(.)))
#> # A tibble: 45 x 4
#> product country year value
#> <chr> <chr> <int> <dbl>
#> 1 A AI 2000 -2.05
#> 2 A AI 2001 -0.676
#> 3 A AI 2002 1.60
#> 4 A AI 2003 -0.353
#> 5 A AI 2004 -0.00530
#> 6 A AI 2005 0.442
#> 7 A AI 2006 -0.610
#> 8 A AI 2007 -2.77
#> 9 A AI 2008 0.899
#> 10 A AI 2009 -0.106
#> # ⊠with 35 more rows
ç§ãã¡ã®ã¿ã¹ã¯ã¯ã補åãšåœã®çµã¿åããããšã« XNUMX ã€ã®åã«ããŒã¿ãå«ãŸããããã«ããŒã¿ ãã¬ãŒã ãæ¡åŒµããããšã§ãã ãããè¡ãã«ã¯ãåŒæ°ãæž¡ãã ãã§ã ååã®ç±æ¥ ããŒãžããããã£ãŒã«ãã®ååãå«ããã¯ãã«ã
df %>% pivot_wider(names_from = c(product, country),
values_from = "value")
#> # A tibble: 15 x 4
#> year A_AI B_AI B_EI
#> <int> <dbl> <dbl> <dbl>
#> 1 2000 -2.05 0.607 1.20
#> 2 2001 -0.676 1.65 -0.114
#> 3 2002 1.60 -0.0245 0.501
#> 4 2003 -0.353 1.30 -0.459
#> 5 2004 -0.00530 0.921 -0.0589
#> 6 2005 0.442 -1.55 0.594
#> 7 2006 -0.610 0.380 -1.28
#> 8 2007 -2.77 0.830 0.637
#> 9 2008 0.899 0.0175 -1.30
#> 10 2009 -0.106 -0.195 1.03
#> # ⊠with 5 more rows
é¢æ°ã«ä»æ§ãé©çšããããšãã§ããŸã pivot_wider()
ã ããããæåºãããšã pivot_wider()
ä»æ§ã§ã¯éã®å€æãè¡ãããŸã pivot_longer()
: ã§æå®ãããå .nameã®ã®å€ã䜿çšããŠã ãå€ ãã®ä»ã®ã³ã©ã ãã
ãã®ããŒã¿ã»ããã§ã¯ãããŒã¿å ã«ååšãããã®ã ãã§ãªããèãããããã¹ãŠã®åœãšè£œåã®çµã¿åããã«ç¬èªã®åãæããããå Žåã¯ãã«ã¹ã¿ã ä»æ§ãçæã§ããŸãã
spec <- df %>%
expand(product, country, .value = "value") %>%
unite(".name", product, country, remove = FALSE)
#> # A tibble: 4 x 4
#> .name product country .value
#> <chr> <chr> <chr> <chr>
#> 1 A_AI A AI value
#> 2 A_EI A EI value
#> 3 B_AI B AI value
#> 4 B_EI B EI value
df %>% pivot_wider(spec = spec) %>% head()
#> # A tibble: 6 x 5
#> year A_AI A_EI B_AI B_EI
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2000 -2.05 NA 0.607 1.20
#> 2 2001 -0.676 NA 1.65 -0.114
#> 3 2002 1.60 NA -0.0245 0.501
#> 4 2003 -0.353 NA 1.30 -0.459
#> 5 2004 -0.00530 NA 0.921 -0.0589
#> 6 2005 0.442 NA -1.55 0.594
æ°ããtidyrã³ã³ã»ããã䜿çšããããã€ãã®å é²çãªäŸ
äŸãšããŠç±³åœåœå¢èª¿æ»ã®åå ¥ãšå®¶è³ã®ããŒã¿ã»ããã䜿çšããŠããŒã¿ãã¯ãªãŒã³ã¢ããããŸãã
ããŒã¿ã»ãã us_rent_income 2017 幎ã®ç±³åœåå·ã®åå ¥ãšå®¶è³ã®äžå€®å€æ å ±ãå«ãŸããŠããŸã (ããŒã¿ã»ããã¯ããã±ãŒãžã§å ¥æå¯èœ) åœå¢èª¿æ»).
us_rent_income
#> # A tibble: 104 x 5
#> GEOID NAME variable estimate moe
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 01 Alabama income 24476 136
#> 2 01 Alabama rent 747 3
#> 3 02 Alaska income 32940 508
#> 4 02 Alaska rent 1200 13
#> 5 04 Arizona income 27517 148
#> 6 04 Arizona rent 972 4
#> 7 05 Arkansas income 23789 165
#> 8 05 Arkansas rent 709 5
#> 9 06 California income 29454 109
#> 10 06 California rent 1358 3
#> # ⊠with 94 more rows
ããŒã¿ãããŒã¿ã»ããã«ä¿åãããåœ¢åŒ us_rent_income ããããæ±ãã®ã¯éåžžã«äžäŸ¿ãªã®ã§ãåãå«ãããŒã¿ã»ãããäœæããããšæããŸãã 家è³, ã¬ã³ã¿ã«ã¢ãš, æ¥ãŸã, åå ¥èãã ãã®ä»æ§ãäœæããã«ã¯å€ãã®æ¹æ³ããããŸãããéèŠãªç¹ã¯ãå€æ°å€ãšå€æ°ã®ããããçµã¿åãããçæããå¿ èŠããããšããããšã§ãã æšå®/èããããŠååãçæããŸãã
spec <- us_rent_income %>%
expand(variable, .value = c("estimate", "moe")) %>%
mutate(
.name = paste0(variable, ifelse(.value == "moe", "_moe", ""))
)
#> # A tibble: 4 x 3
#> variable .value .name
#> <chr> <chr> <chr>
#> 1 income estimate income
#> 2 income moe income_moe
#> 3 rent estimate rent
#> 4 rent moe rent_moe
ãã®ä»æ§ãæäŸãã pivot_wider()
æ¢ããŠããçµæãåŸãããŸãã
us_rent_income %>% pivot_wider(spec = spec)
#> # A tibble: 52 x 6
#> GEOID NAME income income_moe rent rent_moe
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 01 Alabama 24476 136 747 3
#> 2 02 Alaska 32940 508 1200 13
#> 3 04 Arizona 27517 148 972 4
#> 4 05 Arkansas 23789 165 709 5
#> 5 06 California 29454 109 1358 3
#> 6 08 Colorado 32401 109 1125 5
#> 7 09 Connecticut 35326 195 1123 5
#> 8 10 Delaware 31560 247 1076 10
#> 9 11 District of Columbia 43198 681 1424 17
#> 10 12 Florida 25952 70 1077 3
#> # ⊠with 42 more rows
äžçéè¡
ããŒã¿ã»ãããç®çã®åœ¢åŒã«ããã«ã¯ãããã€ãã®æé ãå¿
èŠã«ãªãå ŽåããããŸãã
ããŒã¿ã»ãã ã¯ãŒã«ããã³ã¯ããã 2000 幎ãã 2018 幎ãŸã§ã®ååœã®äººå£ã«é¢ããäžçéè¡ã®ããŒã¿ãå«ãŸããŠããŸãã
#> # A tibble: 1,056 x 20
#> country indicator `2000` `2001` `2002` `2003` `2004` `2005` `2006`
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ABW SP.URB.T⊠4.24e4 4.30e4 4.37e4 4.42e4 4.47e+4 4.49e+4 4.49e+4
#> 2 ABW SP.URB.G⊠1.18e0 1.41e0 1.43e0 1.31e0 9.51e-1 4.91e-1 -1.78e-2
#> 3 ABW SP.POP.T⊠9.09e4 9.29e4 9.50e4 9.70e4 9.87e+4 1.00e+5 1.01e+5
#> 4 ABW SP.POP.G⊠2.06e0 2.23e0 2.23e0 2.11e0 1.76e+0 1.30e+0 7.98e-1
#> 5 AFG SP.URB.T⊠4.44e6 4.65e6 4.89e6 5.16e6 5.43e+6 5.69e+6 5.93e+6
#> 6 AFG SP.URB.G⊠3.91e0 4.66e0 5.13e0 5.23e0 5.12e+0 4.77e+0 4.12e+0
#> 7 AFG SP.POP.T⊠2.01e7 2.10e7 2.20e7 2.31e7 2.41e+7 2.51e+7 2.59e+7
#> 8 AFG SP.POP.G⊠3.49e0 4.25e0 4.72e0 4.82e0 4.47e+0 3.87e+0 3.23e+0
#> 9 AGO SP.URB.T⊠8.23e6 8.71e6 9.22e6 9.77e6 1.03e+7 1.09e+7 1.15e+7
#> 10 AGO SP.URB.G⊠5.44e0 5.59e0 5.70e0 5.76e0 5.75e+0 5.69e+0 4.92e+0
#> # ⊠with 1,046 more rows, and 11 more variables: `2007` <dbl>,
#> # `2008` <dbl>, `2009` <dbl>, `2010` <dbl>, `2011` <dbl>, `2012` <dbl>,
#> # `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>, `2017` <dbl>
ç§ãã¡ã®ç®æšã¯ãåå€æ°ãç¬èªã®åã«å«ããã¡ããšããããŒã¿ã»ãããäœæããããšã§ãã æ£ç¢ºã«ã©ã®ãããªæé ãå¿ èŠãã¯äžæã§ãããæãæçœãªåé¡ãã€ãŸã幎ãè€æ°ã®åã«ãŸããã£ãŠããããšããå§ããŸãã
ãããä¿®æ£ããã«ã¯ãé¢æ°ã䜿çšããå¿
èŠããããŸã pivot_longer()
.
pop2 <- world_bank_pop %>%
pivot_longer(`2000`:`2017`, names_to = "year")
#> # A tibble: 19,008 x 4
#> country indicator year value
#> <chr> <chr> <chr> <dbl>
#> 1 ABW SP.URB.TOTL 2000 42444
#> 2 ABW SP.URB.TOTL 2001 43048
#> 3 ABW SP.URB.TOTL 2002 43670
#> 4 ABW SP.URB.TOTL 2003 44246
#> 5 ABW SP.URB.TOTL 2004 44669
#> 6 ABW SP.URB.TOTL 2005 44889
#> 7 ABW SP.URB.TOTL 2006 44881
#> 8 ABW SP.URB.TOTL 2007 44686
#> 9 ABW SP.URB.TOTL 2008 44375
#> 10 ABW SP.URB.TOTL 2009 44052
#> # ⊠with 18,998 more rows
次ã®ã¹ãããã§ã¯ãææšå€æ°ã確èªããŸãã
pop2 %>% count(indicator)
#> # A tibble: 4 x 2
#> indicator n
#> <chr> <int>
#> 1 SP.POP.GROW 4752
#> 2 SP.POP.TOTL 4752
#> 3 SP.URB.GROW 4752
#> 4 SP.URB.TOTL 4752
ããã§ãSP.POP.GROW ã¯äººå£å¢å ãSP.POP.TOTL ã¯ç·äººå£ãSP.URB ã¯äººå£å¢å ã§ãã â»åæ§ã§ããéœåžéšã®ã¿ãšãªããŸãã ãããã®å€ã XNUMX ã€ã®å€æ°ãã€ãŸãé¢ç© - é¢ç© (ç·é¢ç©ãŸãã¯éœåž) ãšå®éã®ããŒã¿ãå«ãå€æ° (人å£ãŸãã¯æé·) ã«åå²ããŸãããã
pop3 <- pop2 %>%
separate(indicator, c(NA, "area", "variable"))
#> # A tibble: 19,008 x 5
#> country area variable year value
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 ABW URB TOTL 2000 42444
#> 2 ABW URB TOTL 2001 43048
#> 3 ABW URB TOTL 2002 43670
#> 4 ABW URB TOTL 2003 44246
#> 5 ABW URB TOTL 2004 44669
#> 6 ABW URB TOTL 2005 44889
#> 7 ABW URB TOTL 2006 44881
#> 8 ABW URB TOTL 2007 44686
#> 9 ABW URB TOTL 2008 44375
#> 10 ABW URB TOTL 2009 44052
#> # ⊠with 18,998 more rows
ããšã¯ãå€æ°ã XNUMX ã€ã®åã«åå²ããã ãã§ãã
pop3 %>%
pivot_wider(names_from = variable, values_from = value)
#> # A tibble: 9,504 x 5
#> country area year TOTL GROW
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 ABW URB 2000 42444 1.18
#> 2 ABW URB 2001 43048 1.41
#> 3 ABW URB 2002 43670 1.43
#> 4 ABW URB 2003 44246 1.31
#> 5 ABW URB 2004 44669 0.951
#> 6 ABW URB 2005 44889 0.491
#> 7 ABW URB 2006 44881 -0.0178
#> 8 ABW URB 2007 44686 -0.435
#> 9 ABW URB 2008 44375 -0.698
#> 10 ABW URB 2009 44052 -0.731
#> # ⊠with 9,494 more rows
é£çµ¡å ãªã¹ã
æåŸã®äŸãšããŠãWeb ãµã€ãããã³ããŒããŠè²Œãä»ããé£çµ¡å ãªã¹ãããããšæ³åããŠãã ããã
contacts <- tribble(
~field, ~value,
"name", "Jiena McLellan",
"company", "Toyota",
"name", "John Smith",
"company", "google",
"email", "[email protected]",
"name", "Huxley Ratcliffe"
)
ã©ã®ããŒã¿ãã©ã®é£çµ¡å ã«å±ããããèå¥ããå€æ°ããªãããããã®ãªã¹ããè¡šã«ãŸãšããã®ã¯éåžžã«å°é£ã§ãã ãã®åé¡ã¯ãåæ°ããé£çµ¡å ã®ããŒã¿ããnameãã§å§ãŸãããšã«æ³šæããããšã§ä¿®æ£ã§ããŸãããã®ãããäžæã®èå¥åãäœæãããã£ãŒã«ãåã«å€ãnameããå«ãŸãããã³ã« XNUMX ãã€å¢ããããšãã§ããŸãã
contacts <- contacts %>%
mutate(
person_id = cumsum(field == "name")
)
contacts
#> # A tibble: 6 x 3
#> field value person_id
#> <chr> <chr> <int>
#> 1 name Jiena McLellan 1
#> 2 company Toyota 1
#> 3 name John Smith 2
#> 4 company google 2
#> 5 email [email protected] 2
#> 6 name Huxley Ratcliffe 3
åé£çµ¡å ã«äžæã® ID ãèšå®ãããã®ã§ããã£ãŒã«ããšå€ãåã«å€æã§ããŸãã
contacts %>%
pivot_wider(names_from = field, values_from = value)
#> # A tibble: 3 x 4
#> person_id name company email
#> <int> <chr> <chr> <chr>
#> 1 1 Jiena McLellan Toyota <NA>
#> 2 2 John Smith google [email protected]
#> 3 3 Huxley Ratcliffe <NA> <NA>
ãŸãšã
ç§ã®å人çãªæèŠãšããŠã¯ãæ°ããã³ã³ã»ããã¯ã ãã¡ããšãã åŸæ¥ã®æ©èœãããçã«çŽæçã§æ©èœçã«ã倧å¹
ã«åªããŠããŸã spread()
О gather()
ã ãã®èšäºãããªãã®å¯ŸåŠã«åœ¹ç«ã€ããšãé¡ã£ãŠããŸã pivot_longer()
О pivot_wider()
.
åºæïŒ habr.com