Pulogalamu ya ProHoster > Blog > Ulamuliro > Kukulitsa zipilala zosungidwa - mindandanda pogwiritsa ntchito chilankhulo cha R (tidyr phukusi ndi ntchito za banja losavomerezeka)
Kukulitsa zipilala zosungidwa - mindandanda pogwiritsa ntchito chilankhulo cha R (tidyr phukusi ndi ntchito za banja losavomerezeka)
Nthawi zambiri, mukamagwira ntchito ndi yankho lolandiridwa kuchokera ku API, kapena ndi data ina iliyonse yomwe ili ndi mtengo wovuta, mumayang'anizana ndi mawonekedwe a JSON ndi XML.
Mawonekedwewa ali ndi zabwino zambiri: amasunga deta mokhazikika ndikukulolani kuti mupewe kubwereza kosafunikira kwa chidziwitso.
Kuipa kwa mawonekedwewa ndizovuta za kachitidwe ndi kusanthula kwawo. Deta yosasinthika siyingagwiritsidwe ntchito powerengera ndipo zowonera sizingamangidwe pamenepo.
Nkhaniyi ndi kupitiriza komveka kwa kufalitsa "R phukusi tidyr ndi ntchito zake zatsopano pivot_longer ndi pivot_wider". Она поможет вам привести неструктурированные конструкции данных к привычному, и пригодному для анализа табличному виду с помощью пакета tidyr, yophatikizidwa pakatikati pa laibulale tidyverse, ndi dongosolo lake la ntchito unnest_*().
Zamkatimu
Ngati mukufuna kusanthula deta, mungakhale ndi chidwi changa uthengawo и Youtube njira. Zambiri mwazomwe zimaperekedwa ku chilankhulo cha R.
Rectangling(zolemba za womasulira, sindinapeze njira zomasulira zokwanira za mawuwa, choncho tizisiya momwe zilili.) ndi njira yobweretsera deta yosalongosoka yokhala ndi zisa mu tebulo la mbali ziwiri lomwe lili ndi mizere yodziwika bwino. MU tidyr Pali ntchito zingapo zomwe zingakuthandizeni kukulitsa mizati yandandanda ndikuchepetsa deta kukhala mawonekedwe a rectangular, tabular:
unnest_auto() zimangozindikira kuti ndi ntchito iti yomwe ili yabwino kugwiritsa ntchito unnest_longer() kapena unnest_wider().
hoist() ofanana ndi unnest_wider() koma amasankha zigawo zomwe zafotokozedwa ndikukulolani kuti mugwire ntchito ndi magawo angapo a zisa.
Mavuto ambiri okhudzana ndi kubweretsa deta yosasinthika ndi magawo angapo a zisa mu tebulo lamitundu iwiri akhoza kuthetsedwa mwa kuphatikiza ntchito zomwe zalembedwa ndi dplyr.
Kuti tiwonetse njirazi, tidzagwiritsa ntchito phukusi repurrrsive, yomwe imapereka mindandanda yambiri yovuta, yamitundu yambiri yochokera pa intaneti ya API.
Yambani ndi gh_ogwiritsa, mndandanda womwe uli ndi zambiri za ogwiritsa ntchito a GitHub asanu ndi limodzi. Choyamba tiyeni tisinthe mndandanda gh_ogwiritsa в tibble chimango:
users <- tibble( user = gh_users )
Izi zikuwoneka ngati zotsutsana pang'ono: chifukwa chiyani perekani mndandanda gh_ogwiritsa, ku dongosolo la deta lovuta kwambiri? Koma chimango cha deta chili ndi ubwino waukulu: chimaphatikiza ma vector angapo kuti zonse zizitsatiridwa mu chinthu chimodzi.
Pachifukwa ichi, tili ndi tebulo lokhala ndi mizati 30, ndipo sitidzafuna zambiri, kotero ife tikhoza m'malo mwake. unnest_wider() ntchito hoist(). hoist() imatithandiza kuchotsa zigawo zosankhidwa pogwiritsa ntchito mawu ofanana ndi purrr::pluck():
users %>% hoist(user,
followers = "followers",
login = "login",
url = "html_url"
)
#> # A tibble: 6 x 4
#> followers login url user
#> <int> <chr> <chr> <list>
#> 1 303 gaborcsardi https://github.com/gaborcsardi <named list [27]>
#> 2 780 jennybc https://github.com/jennybc <named list [27]>
#> 3 3958 jtleek https://github.com/jtleek <named list [27]>
#> 4 115 juliasilge https://github.com/juliasilge <named list [27]>
#> 5 213 leeper https://github.com/leeper <named list [27]>
#> 6 34 masalmon https://github.com/masalmon <named list [27]>
hoist() imachotsa zigawo zomwe zatchulidwa pamndandanda wosuta, поэтому вы можете рассматривать hoist() monga kusuntha zigawo kuchokera pamndandanda wamkati wa chimango cha deti kupita pamlingo wake wapamwamba.
Zosungirako za Github
Kuwongolera mndandanda gh_repos timayamba mofanana ndi kutembenuza izo tibble:
Nthawi iyi zinthu wosuta kuyimira mndandanda wa nkhokwe za wogwiritsa ntchitoyu. Chosungira chilichonse ndikuwonera padera, kotero malinga ndi lingaliro la data yabwino (прим. tidy data) они должны стать новыми строками, в связи с чем мы используем unnest_longer() koma ayi unnest_wider():
repos <- repos %>% unnest_longer(repo)
repos
#> # A tibble: 176 x 1
#> repo
#> <list>
#> 1 <named list [68]>
#> 2 <named list [68]>
#> 3 <named list [68]>
#> 4 <named list [68]>
#> 5 <named list [68]>
#> 6 <named list [68]>
#> 7 <named list [68]>
#> 8 <named list [68]>
#> 9 <named list [68]>
#> 10 <named list [68]>
#> # … with 166 more rows
got_chars ali ndi dongosolo lofanana gh_users: Awa ndi mindandanda ya mayina, pomwe gawo lililonse la mndandanda wamkati limafotokoza zamtundu wa Game of Thrones. Kubweretsa got_chars к табличному виду мы начинаем с создания дата фрейма, так же как и в приведённых ранее примерах, а затем переведём каждый элемент в отдельный столбец:
chars <- tibble(char = got_chars)
chars
#> # A tibble: 30 x 1
#> char
#> <list>
#> 1 <named list [18]>
#> 2 <named list [18]>
#> 3 <named list [18]>
#> 4 <named list [18]>
#> 5 <named list [18]>
#> 6 <named list [18]>
#> 7 <named list [18]>
#> 8 <named list [18]>
#> 9 <named list [18]>
#> 10 <named list [18]>
#> # … with 20 more rows
chars2 <- chars %>% unnest_wider(char)
chars2
#> # A tibble: 30 x 18
#> url id name gender culture born died alive titles aliases father
#> <chr> <int> <chr> <chr> <chr> <chr> <chr> <lgl> <list> <list> <chr>
#> 1 http… 1022 Theo… Male Ironbo… In 2… "" TRUE <chr … <chr [… ""
#> 2 http… 1052 Tyri… Male "" In 2… "" TRUE <chr … <chr [… ""
#> 3 http… 1074 Vict… Male Ironbo… In 2… "" TRUE <chr … <chr [… ""
#> 4 http… 1109 Will Male "" "" In 2… FALSE <chr … <chr [… ""
#> 5 http… 1166 Areo… Male Norvos… In 2… "" TRUE <chr … <chr [… ""
#> 6 http… 1267 Chett Male "" At H… In 2… FALSE <chr … <chr [… ""
#> 7 http… 1295 Cres… Male "" In 2… In 2… FALSE <chr … <chr [… ""
#> 8 http… 130 Aria… Female Dornish In 2… "" TRUE <chr … <chr [… ""
#> 9 http… 1303 Daen… Female Valyri… In 2… "" TRUE <chr … <chr [… ""
#> 10 http… 1319 Davo… Male Wester… In 2… "" TRUE <chr … <chr [… ""
#> # … with 20 more rows, and 7 more variables: mother <chr>, spouse <chr>,
#> # allegiances <list>, books <list>, povBooks <list>, tvSeries <list>,
#> # playedBy <list>
kapangidwe got_chars penapake chovuta kuposa gh_users, т.к. некоторые компоненты списка char okha ndi mndandanda, chifukwa chake timapeza mizati - mindandanda:
Zochita zanu zowonjezereka zimadalira zolinga za kusanthula. Mwina muyenera kuyika zambiri pamizere ya buku lililonse ndi mndandanda womwe munthu akuwonekera:
chars2 %>%
select(name, books, tvSeries) %>%
pivot_longer(c(books, tvSeries), names_to = "media", values_to = "value") %>%
unnest_longer(value)
#> # A tibble: 180 x 3
#> name media value
#> <chr> <chr> <chr>
#> 1 Theon Greyjoy books A Game of Thrones
#> 2 Theon Greyjoy books A Storm of Swords
#> 3 Theon Greyjoy books A Feast for Crows
#> 4 Theon Greyjoy tvSeries Season 1
#> 5 Theon Greyjoy tvSeries Season 2
#> 6 Theon Greyjoy tvSeries Season 3
#> 7 Theon Greyjoy tvSeries Season 4
#> 8 Theon Greyjoy tvSeries Season 5
#> 9 Theon Greyjoy tvSeries Season 6
#> 10 Tyrion Lannister books A Feast for Crows
#> # … with 170 more rows
Kapena mukufuna kupanga tebulo lomwe limakupatsani mwayi wofanana ndi mawonekedwe ndi ntchito:
chars2 %>%
select(name, title = titles) %>%
unnest_longer(title)
#> # A tibble: 60 x 2
#> name title
#> <chr> <chr>
#> 1 Theon Greyjoy Prince of Winterfell
#> 2 Theon Greyjoy Captain of Sea Bitch
#> 3 Theon Greyjoy Lord of the Iron Islands (by law of the green lands)
#> 4 Tyrion Lannister Acting Hand of the King (former)
#> 5 Tyrion Lannister Master of Coin (former)
#> 6 Victarion Greyjoy Lord Captain of the Iron Fleet
#> 7 Victarion Greyjoy Master of the Iron Victory
#> 8 Will ""
#> 9 Areo Hotah Captain of the Guard at Sunspear
#> 10 Chett ""
#> # … with 50 more rows
(Dziwani zinthu zopanda pake "" m'munda title, это связано с ошибками допущенными при вводе данных в got_chars: m'malo mwake, zilembo zomwe palibe buku lofananira ndi mitu yapa TV pamunda title iyenera kukhala ndi vector yautali 0, osati vekitala yautali 1 yokhala ndi chingwe chopanda kanthu.)
Tikhoza kulembanso chitsanzo pamwambapa pogwiritsa ntchito ntchitoyi unnest_auto(). Njirayi ndi yabwino kusanthula nthawi imodzi, koma musadalire unnest_auto() kuti agwiritsidwe ntchito pafupipafupi. Mfundo ndi yakuti ngati deta yanu isintha unnest_auto() может поменять выбранный механизм преобразования данных, если изначально он разворачивал столбцы-списки в строки используя unnest_longer(), ndiye pamene dongosolo la deta yomwe ikubwera ikusintha, malingaliro angasinthidwe mokomera unnest_wider(), ndipo kugwiritsa ntchito njirayi mosalekeza kungayambitse zolakwika zosayembekezereka.
tibble(char = got_chars) %>%
unnest_auto(char) %>%
select(name, title = titles) %>%
unnest_auto(title)
#> Using `unnest_wider(char)`; elements have 18 names in common
#> Using `unnest_longer(title)`; no element has names
#> # A tibble: 60 x 2
#> name title
#> <chr> <chr>
#> 1 Theon Greyjoy Prince of Winterfell
#> 2 Theon Greyjoy Captain of Sea Bitch
#> 3 Theon Greyjoy Lord of the Iron Islands (by law of the green lands)
#> 4 Tyrion Lannister Acting Hand of the King (former)
#> 5 Tyrion Lannister Master of Coin (former)
#> 6 Victarion Greyjoy Lord Captain of the Iron Fleet
#> 7 Victarion Greyjoy Master of the Iron Victory
#> 8 Will ""
#> 9 Areo Hotah Captain of the Guard at Sunspear
#> 10 Chett ""
#> # … with 50 more rows
Geocoding ndi Google
Kenako, tiwona mawonekedwe ovuta kwambiri a data yopezedwa kuchokera ku ntchito ya geocoding ya Google. Zidziwitso za caching zimatsutsana ndi malamulo ogwirira ntchito ndi API ya mapu a Google, kotero ndilemba kaye kapepala kosavuta kuzungulira API. Zomwe zimatengera kusunga kiyi ya Google Maps API mukusintha kwachilengedwe; Ngati mulibe kiyi yogwirira ntchito ndi API ya Google Maps yosungidwa m'malo anu osinthika, zidutswa zamakhodi zomwe zaperekedwa mugawoli sizichitika.
has_key <- !identical(Sys.getenv("GOOGLE_MAPS_API_KEY"), "")
if (!has_key) {
message("No Google Maps API key found; code chunks will not be run")
}
# https://developers.google.com/maps/documentation/geocoding
geocode <- function(address, api_key = Sys.getenv("GOOGLE_MAPS_API_KEY")) {
url <- "https://maps.googleapis.com/maps/api/geocode/json"
url <- paste0(url, "?address=", URLencode(address), "&key=", api_key)
jsonlite::read_json(url)
}
Mndandanda womwe ntchitoyi imabwerera ndi yovuta kwambiri:
Mwamwayi, tikhoza kuthetsa vuto la kutembenuza deta iyi kukhala mawonekedwe a tabular sitepe ndi sitepe pogwiritsa ntchito ntchito tidyr. Чтобы сделать задачу немного сложнее и реалистичнее, я начну с геокодирования нескольких городов:
city <- c ( "Houston" , "LA" , "New York" , "Chicago" , "Springfield" ) city_geo <- purrr::map (city, geocode)
Ndisintha zotsatira kukhala tibble, для удобства добавлю столбец с соответствующим названием города.
loc <- tibble(city = city, json = city_geo)
loc
#> # A tibble: 5 x 2
#> city json
#> <chr> <list>
#> 1 Houston <named list [2]>
#> 2 LA <named list [2]>
#> 3 New York <named list [2]>
#> 4 Chicago <named list [2]>
#> 5 Springfield <named list [2]>
Gawo loyamba lili ndi zigawo status и result, zomwe tingawonjezere nazo unnest_wider() :
loc %>%
unnest_wider(json)
#> # A tibble: 5 x 3
#> city results status
#> <chr> <list> <chr>
#> 1 Houston <list [1]> OK
#> 2 LA <list [1]> OK
#> 3 New York <list [1]> OK
#> 4 Chicago <list [1]> OK
#> 5 Springfield <list [1]> OK
Zindikirani kuti results ndi mndandanda wamagulu ambiri. Mizinda yambiri ili ndi chinthu chimodzi (choyimira mtengo wapadera wogwirizana ndi geocoding API), koma Springfield ili ndi ziwiri. Tikhoza kuwakokera iwo mu mizere osiyana ndi unnest_longer() :
loc %>%
unnest_wider(json) %>%
unnest_longer(results)
#> # A tibble: 5 x 3
#> city results status
#> <chr> <list> <chr>
#> 1 Houston <named list [5]> OK
#> 2 LA <named list [5]> OK
#> 3 New York <named list [5]> OK
#> 4 Chicago <named list [5]> OK
#> 5 Springfield <named list [5]> OK
Теперь все они имеют одинаковые компоненты, в чём можно убедиться с помощью unnest_wider():
loc %>%
unnest_wider(json) %>%
unnest_longer(results) %>%
unnest_wider(results)
#> # A tibble: 5 x 7
#> city address_componen… formatted_addre… geometry place_id types status
#> <chr> <list> <chr> <list> <chr> <lis> <chr>
#> 1 Houst… <list [4]> Houston, TX, USA <named … ChIJAYWN… <lis… OK
#> 2 LA <list [4]> Los Angeles, CA… <named … ChIJE9on… <lis… OK
#> 3 New Y… <list [3]> New York, NY, U… <named … ChIJOwg_… <lis… OK
#> 4 Chica… <list [4]> Chicago, IL, USA <named … ChIJ7cv0… <lis… OK
#> 5 Sprin… <list [5]> Springfield, MO… <named … ChIJP5jI… <lis… OK
Titha kupeza mayendedwe a latitude ndi longitude a mzinda uliwonse pokulitsa mndandanda geometry:
loc %>%
unnest_wider(json) %>%
unnest_longer(results) %>%
unnest_wider(results) %>%
unnest_wider(geometry)
#> # A tibble: 5 x 10
#> city address_compone… formatted_addre… bounds location location_type
#> <chr> <list> <chr> <list> <list> <chr>
#> 1 Hous… <list [4]> Houston, TX, USA <name… <named … APPROXIMATE
#> 2 LA <list [4]> Los Angeles, CA… <name… <named … APPROXIMATE
#> 3 New … <list [3]> New York, NY, U… <name… <named … APPROXIMATE
#> 4 Chic… <list [4]> Chicago, IL, USA <name… <named … APPROXIMATE
#> 5 Spri… <list [5]> Springfield, MO… <name… <named … APPROXIMATE
#> # … with 4 more variables: viewport <list>, place_id <chr>, types <list>,
#> # status <chr>
Ndiyeno malo amene muyenera kuwonjezera location:
loc %>%
unnest_wider(json) %>%
unnest_longer(results) %>%
unnest_wider(results) %>%
unnest_wider(geometry) %>%
unnest_wider(location)
#> # A tibble: 5 x 11
#> city address_compone… formatted_addre… bounds lat lng location_type
#> <chr> <list> <chr> <list> <dbl> <dbl> <chr>
#> 1 Hous… <list [4]> Houston, TX, USA <name… 29.8 -95.4 APPROXIMATE
#> 2 LA <list [4]> Los Angeles, CA… <name… 34.1 -118. APPROXIMATE
#> 3 New … <list [3]> New York, NY, U… <name… 40.7 -74.0 APPROXIMATE
#> 4 Chic… <list [4]> Chicago, IL, USA <name… 41.9 -87.6 APPROXIMATE
#> 5 Spri… <list [5]> Springfield, MO… <name… 37.2 -93.3 APPROXIMATE
#> # … with 4 more variables: viewport <list>, place_id <chr>, types <list>,
#> # status <chr>
Ndiponso, unnest_auto() imathandizira ntchito yomwe yafotokozedwayo ndi zoopsa zina zomwe zingayambike chifukwa cha kusintha kwa data yomwe ikubwera:
loc %>%
unnest_auto(json) %>%
unnest_auto(results) %>%
unnest_auto(results) %>%
unnest_auto(geometry) %>%
unnest_auto(location)
#> Using `unnest_wider(json)`; elements have 2 names in common
#> Using `unnest_longer(results)`; no element has names
#> Using `unnest_wider(results)`; elements have 5 names in common
#> Using `unnest_wider(geometry)`; elements have 4 names in common
#> Using `unnest_wider(location)`; elements have 2 names in common
#> # A tibble: 5 x 11
#> city address_compone… formatted_addre… bounds lat lng location_type
#> <chr> <list> <chr> <list> <dbl> <dbl> <chr>
#> 1 Hous… <list [4]> Houston, TX, USA <name… 29.8 -95.4 APPROXIMATE
#> 2 LA <list [4]> Los Angeles, CA… <name… 34.1 -118. APPROXIMATE
#> 3 New … <list [3]> New York, NY, U… <name… 40.7 -74.0 APPROXIMATE
#> 4 Chic… <list [4]> Chicago, IL, USA <name… 41.9 -87.6 APPROXIMATE
#> 5 Spri… <list [5]> Springfield, MO… <name… 37.2 -93.3 APPROXIMATE
#> # … with 4 more variables: viewport <list>, place_id <chr>, types <list>,
#> # status <chr>
Titha kungoyang'ananso adilesi yoyamba ya mzinda uliwonse:
loc %>%
unnest_wider(json) %>%
hoist(results, first_result = 1) %>%
unnest_wider(first_result) %>%
unnest_wider(geometry) %>%
unnest_wider(location)
#> # A tibble: 5 x 11
#> city address_compone… formatted_addre… bounds lat lng location_type
#> <chr> <list> <chr> <list> <dbl> <dbl> <chr>
#> 1 Hous… <list [4]> Houston, TX, USA <name… 29.8 -95.4 APPROXIMATE
#> 2 LA <list [4]> Los Angeles, CA… <name… 34.1 -118. APPROXIMATE
#> 3 New … <list [3]> New York, NY, U… <name… 40.7 -74.0 APPROXIMATE
#> 4 Chic… <list [4]> Chicago, IL, USA <name… 41.9 -87.6 APPROXIMATE
#> 5 Spri… <list [5]> Springfield, MO… <name… 37.2 -93.3 APPROXIMATE
#> # … with 4 more variables: viewport <list>, place_id <chr>, types <list>,
#> # status <chr>
Kapena ntchito hoist() для многоуровневого погружения, чтобы перейти непосредственно к lat и lng.
loc %>%
hoist(json,
lat = list("results", 1, "geometry", "location", "lat"),
lng = list("results", 1, "geometry", "location", "lng")
)
#> # A tibble: 5 x 4
#> city lat lng json
#> <chr> <dbl> <dbl> <list>
#> 1 Houston 29.8 -95.4 <named list [2]>
#> 2 LA 34.1 -118. <named list [2]>
#> 3 New York 40.7 -74.0 <named list [2]>
#> 4 Chicago 41.9 -87.6 <named list [2]>
#> 5 Springfield 37.2 -93.3 <named list [2]>
Дискография Шарлы Гельфанд
Pomaliza, tiwona mawonekedwe ovuta kwambiri - discography ya Sharla Gelfand. Monga m'zitsanzo zomwe zili pamwambazi, timayamba ndi kutembenuza mndandandawo kukhala chigawo chimodzi cha deta, ndikuchikulitsa kuti chigawo chilichonse chikhale chosiyana. Komanso ndikusintha gawo date_added kufika pa deti ndi nthawi yoyenera mu R.
discs <- tibble(disc = discog) %>%
unnest_wider(disc) %>%
mutate(date_added = as.POSIXct(strptime(date_added, "%Y-%m-%dT%H:%M:%S")))
discs
#> # A tibble: 155 x 5
#> instance_id date_added basic_information id rating
#> <int> <dttm> <list> <int> <int>
#> 1 354823933 2019-02-16 17:48:59 <named list [11]> 7496378 0
#> 2 354092601 2019-02-13 14:13:11 <named list [11]> 4490852 0
#> 3 354091476 2019-02-13 14:07:23 <named list [11]> 9827276 0
#> 4 351244906 2019-02-02 11:39:58 <named list [11]> 9769203 0
#> 5 351244801 2019-02-02 11:39:37 <named list [11]> 7237138 0
#> 6 351052065 2019-02-01 20:40:53 <named list [11]> 13117042 0
#> 7 350315345 2019-01-29 15:48:37 <named list [11]> 7113575 0
#> 8 350315103 2019-01-29 15:47:22 <named list [11]> 10540713 0
#> 9 350314507 2019-01-29 15:44:08 <named list [11]> 11260950 0
#> 10 350314047 2019-01-29 15:41:35 <named list [11]> 11726853 0
#> # … with 145 more rows
Pamulingo uwu, timapeza zambiri za nthawi yomwe chimbale chilichonse chidawonjezedwa ku discography ya Sharla, koma sitikuwona chilichonse chokhudza ma diski amenewo. Kuti tichite izi tifunika kukulitsa ndime basic_information:
discs %>% unnest_wider(basic_information)
#> Column name `id` must not be duplicated.
#> Use .name_repair to specify repair.
Mutha kuwajowinanso ku dataset yoyambirira ngati pakufunika.
Pomaliza
Pakatikati pa laibulale tidyverse входят множество полезных пакетов объединённые общей философией обработки данных.
M'nkhani ino takambirana za banja la ntchito unnest_*(), zomwe cholinga chake ndi kugwira ntchito ndi kuchotsa zinthu kuchokera pamndandanda wosungidwa. Phukusili lili ndi zinthu zina zambiri zothandiza zomwe zimapangitsa kuti zikhale zosavuta kusintha deta malinga ndi lingaliro Dongosolo Ladongosolo.