R package tidyr and its new functions pivot_longer and pivot_wider
Plastic bag tidyr is included in the core of one of the most popular libraries in the R language - tidyverse.
The main purpose of the package is to bring the data to a neat form.
Already on Habré publication dedicated to this package, but it dates back to 2015. And I want to tell you about the most relevant changes, which were announced a few days ago by its author Hedley Wickham.
SJK: Will the gather() and spread() functions be deprecated?
Hadley Wickham: To some extent. We will no longer recommend the use of these functions, and fix bugs in them, but they will continue to be present in the package in their current state.
Content
If you are interested in data analysis, you might be interested in my telegram и youtube channels. Most of the content of which is devoted to the R language.
Goal tidyr - to help you bring the data to the so-called neat form. Neat data is data where:
Each variable is in a column.
Each observation is a line.
Each value is a cell.
It is much easier and more convenient to work with data that are brought to tidy data during analysis.
The main functions included in the tidyr package
tidyr contains a set of functions for transforming tables:
fill() - filling in the missing values in the column with the previous values;
separate() - splits one field into several through a separator;
unite() - performs the operation of combining several fields into one, the action is the opposite of the function separate();
pivot_longer() - a function that converts data from a wide format to a long one;
pivot_wider() - a function that converts data from long format to wide format. The operation is the reverse of that which the function performs. pivot_longer().
gather()outdated - a function that converts data from a wide format to a long one;
spread()outdated - a function that converts data from long format to wide format. The operation is the reverse of that which the function performs. gather().
New concept for converting data from wide to long format and vice versa
Previously, for this kind of transformation, the functions gather() и spread(). Over the years of the existence of these functions, it became obvious that for most users, including the author of the package, the names of these functions, and their arguments, were rather obscure, and caused difficulties in finding them and understanding which of these functions converts a date frame from wide to long format and vice versa.
In connection with this, tidyr two new, important functions have been added that are designed to transform data frames.
New Features pivot_longer() и pivot_wider() were inspired by some of the features in the package cdatacreated by John Mount and Nina Zumel.
Installing the latest version of tidyr 0.8.3.9000
To install a new, most up-to-date package version tidyr0.8.3.9000, where new features are available, use the following code.
devtools::install_github("tidyverse/tidyr")
At the time of writing, these functions are only available in the dev version of the package on GitHub.
Transition to new features
In fact, it is not difficult to translate old scripts to work with new functions, for better understanding, I will take an example from the documentation of old functions and show how the same operations are performed using new ones. pivot_*() functions.
Convert wide format to long format.
Sample code from the documentation of the gather function
# example
library(dplyr)
stocks <- data.frame(
time = as.Date('2009-01-01') + 0:9,
X = rnorm(10, 0, 1),
Y = rnorm(10, 0, 2),
Z = rnorm(10, 0, 4)
)
# old
stocks_gather <- stocks %>% gather(key = stock,
value = price,
-time)
# new
stocks_long <- stocks %>% pivot_longer(cols = -time,
names_to = "stock",
values_to = "price")
Convert long format to wide format.
Sample code from the documentation of the spread function
# old
stocks_spread <- stocks_gather %>% spread(key = stock,
value = price)
# new
stock_wide <- stocks_long %>% pivot_wider(names_from = "stock",
values_from = "price")
Because in the above examples of working with pivot_longer() и pivot_wider(), in the original table stocks no columns listed in arguments names_to и values_to their names must be enclosed in quotation marks.
A table with the help of which it will be most easy for you to figure out how to switch to working with a new concept tidyr.
Note from the author
All the text below is adaptive, I would even say free translation vignettes from the official website of the tidyverse library.
A simple example of converting data from wide to long
pivot_longer () - makes datasets longer by reducing the number of columns and increasing the number of rows.
To execute the examples presented in the article, you first need to connect the necessary packages:
library(tidyr)
library(dplyr)
library(readr)
Let's say we have a table with the results of a survey that (among other things) asked people about their religion and annual income:
This table contains the religion data of the respondents in rows, and the income level is scattered across the column names. The number of respondents from each category is stored in cell values at the intersection of religion and income level. To bring the table to a neat, correct format, it is enough to use pivot_longer():
The first argument cols, describes which columns to merge. In this case, all columns except team.
Argument names_to gives the name of the variable that will be created from the column names we have concatenated.
values_to gives the name of the variable that will be created from the data stored in the cell values of the merged columns.
Specifications
This is a new package feature. tidyr, which was previously unavailable when working with legacy features.
A specification is a data frame, each row of which corresponds to one column in the new output data frame, and two special columns that start with:
. Name contains the original column name.
.value contains the name of the column that will contain the cell values.
The remaining columns of the specification reflect how the name of the compressed columns will be displayed in the new column. . Name.
The spec describes the metadata stored in a column name, with one row for each column and one column for each variable concatenated with the column name, which may seem confusing at the moment, but after looking at a few examples it will become much clearer.
The point of the specification is that you can retrieve, modify, and set new metadata for the converted dataframe.
To work with specifications when converting a table from wide format to long format, use the function pivot_longer_spec().
The way this function works is that it takes any data frame and generates its metadata in the manner described above.
For example, let's take the who dataset that comes with the package tidyr. This dataset contains information provided by the International Health Organization on tuberculosis incidence.
who
#> # A tibble: 7,240 x 60
#> country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534
#> <chr> <chr> <chr> <int> <int> <int> <int>
#> 1 Afghan… AF AFG 1980 NA NA NA
#> 2 Afghan… AF AFG 1981 NA NA NA
#> 3 Afghan… AF AFG 1982 NA NA NA
#> 4 Afghan… AF AFG 1983 NA NA NA
#> 5 Afghan… AF AFG 1984 NA NA NA
#> 6 Afghan… AF AFG 1985 NA NA NA
#> 7 Afghan… AF AFG 1986 NA NA NA
#> 8 Afghan… AF AFG 1987 NA NA NA
#> 9 Afghan… AF AFG 1988 NA NA NA
#> 10 Afghan… AF AFG 1989 NA NA NA
#> # … with 7,230 more rows, and 53 more variables
Let's build its specification.
spec <- who %>%
pivot_longer_spec(new_sp_m014:newrel_f65, values_to = "count")
fields country, iso2, iso3 are already variables. Our task is to flip the columns with new_sp_m014 by newrel_f65.
The following information is stored in the names of these columns:
Prefix new_ indicates that the column contains data on new cases of tuberculosis, the current date frame contains information only on new diseases, so this prefix in the current context does not carry any semantic load.
sp/rel/sp/ep describes a method for diagnosing a disease.
m/f gender of the patient.
014/1524/2535/3544/4554/65 patient age range.
We can split these columns with the function extract(), using a regular expression.
Finally, in order to apply the specification we created to the original data frame who we need to use the argument spec in function pivot_longer().
who %>% pivot_longer(spec = spec)
#> # A tibble: 405,440 x 8
#> country iso2 iso3 year diagnosis gender age count
#> <chr> <chr> <chr> <int> <chr> <fct> <ord> <int>
#> 1 Afghanistan AF AFG 1980 sp m 014 NA
#> 2 Afghanistan AF AFG 1980 sp m 1524 NA
#> 3 Afghanistan AF AFG 1980 sp m 2534 NA
#> 4 Afghanistan AF AFG 1980 sp m 3544 NA
#> 5 Afghanistan AF AFG 1980 sp m 4554 NA
#> 6 Afghanistan AF AFG 1980 sp m 5564 NA
#> 7 Afghanistan AF AFG 1980 sp m 65 NA
#> 8 Afghanistan AF AFG 1980 sp f 014 NA
#> 9 Afghanistan AF AFG 1980 sp f 1524 NA
#> 10 Afghanistan AF AFG 1980 sp f 2534 NA
#> # … with 405,430 more rows
Everything we have just done can be schematically depicted as follows:
Specification using multiple values(.value)
In the example above, the specification column .value contained only one value, in most cases this is the case.
But occasionally a situation may arise when you need to collect data from columns with different data types in values. With a deprecated function spread() this would be quite difficult to do.
The example below is taken from vignettes to the package data.table.
The created date frame in each line contains data about the children of one family. Families may have one or two children. For each child, data on the date of birth and gender is provided, and the data for each child goes in separate columns, our task is to bring this data into the correct format for analysis.
Note that we have two variables with information about each child: their gender and date of birth (columns prefixed baptism contain date of birth, prefixed columns gender contain the gender of the child). In the expected result, they should go in separate columns. We can do this by generating a specification where the column .value will have two different meanings.
spec <- family %>%
pivot_longer_spec(-family) %>%
separate(col = name, into = c(".value", "child"))%>%
mutate(child = parse_number(child))
#> # A tibble: 4 x 3
#> .name .value child
#> <chr> <chr> <dbl>
#> 1 dob_child1 dob 1
#> 2 dob_child2 dob 2
#> 3 gender_child1 gender 1
#> 4 gender_child2 gender 2
So, let's take a look at the steps of the actions that are performed by the above code.
pivot_longer_spec(-family) - create a specification that compresses all available columns, except for the family column.
separate(col = name, into = c(".value", "child")) - split the column . Name, which contains the names of the original fields, by underscore and put the resulting values into columns .value и child.
mutate(child = parse_number(child)) - transform the field values child from text to numeric data type.
Now we can apply the resulting specification to the original dataframe, and bring the table to the desired form.
We use the argument na.rm = TRUE, because the current form of the data forces the creation of extra rows for nonexistent observations. Because family 2 has only one child, na.rm = TRUE guarantees that family 2 will have one row in the output.
Converting data frames from long format to wide format
pivot_wider() - is the inverse transformation, and vice versa increases the number of columns of the date frame by reducing the number of rows.
This kind of transformation is extremely rarely used to bring data to a neat form, however, this technique can be useful for creating pivot tables used in presentations, or for integrating with some other tools.
Actually the functions pivot_longer() и pivot_wider() are symmetrical, and perform the opposite actions to each other, i.e.: df %>% pivot_longer(spec = spec) %>% pivot_wider(spec = spec) и df %>% pivot_wider(spec = spec) %>% pivot_longer(spec = spec) will return the original df.
The simplest example of converting a table to a wide format
To demonstrate how the function works pivot_wider() we will use the dataset fish_encounters, which stores information about how various stations record the movement of fish along the river.
#> # A tibble: 114 x 3
#> fish station seen
#> <fct> <fct> <int>
#> 1 4842 Release 1
#> 2 4842 I80_1 1
#> 3 4842 Lisbon 1
#> 4 4842 Rstr 1
#> 5 4842 Base_TD 1
#> 6 4842 BCE 1
#> 7 4842 BCW 1
#> 8 4842 BCE2 1
#> 9 4842 BCW2 1
#> 10 4842 MAE 1
#> # … with 104 more rows
In most cases, this table will be more informative and easier to use if the information for each station is presented in a separate column.
fish_encounters %>% pivot_wider(names_from = station, values_from = seen)
#> # A tibble: 19 x 12
#> fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE
#> <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 4842 1 1 1 1 1 1 1 1 1 1
#> 2 4843 1 1 1 1 1 1 1 1 1 1
#> 3 4844 1 1 1 1 1 1 1 1 1 1
#> 4 4845 1 1 1 1 1 NA NA NA NA NA
#> 5 4847 1 1 1 NA NA NA NA NA NA NA
#> 6 4848 1 1 1 1 NA NA NA NA NA NA
#> 7 4849 1 1 NA NA NA NA NA NA NA NA
#> 8 4850 1 1 NA 1 1 1 1 NA NA NA
#> 9 4851 1 1 NA NA NA NA NA NA NA NA
#> 10 4854 1 1 NA NA NA NA NA NA NA NA
#> # … with 9 more rows, and 1 more variable: MAW <int>
This dataset records information only when fish were detected by the station, i.e. if any fish was not recorded by some station, then these data will not be in the table. This means that the output will be filled with NA.
However, in this case, we know that the absence of the entry means that the fish was not seen, so we can use the argument values_fill in function pivot_wider() and fill those missing values with zeros:
Generating a column name from multiple source variables
Imagine we have a table containing a combination of product, country, and year. To generate a test date frame, you can run the following code:
df <- expand_grid(
product = c("A", "B"),
country = c("AI", "EI"),
year = 2000:2014
) %>%
filter((product == "A" & country == "AI") | product == "B") %>%
mutate(value = rnorm(nrow(.)))
#> # A tibble: 45 x 4
#> product country year value
#> <chr> <chr> <int> <dbl>
#> 1 A AI 2000 -2.05
#> 2 A AI 2001 -0.676
#> 3 A AI 2002 1.60
#> 4 A AI 2003 -0.353
#> 5 A AI 2004 -0.00530
#> 6 A AI 2005 0.442
#> 7 A AI 2006 -0.610
#> 8 A AI 2007 -2.77
#> 9 A AI 2008 0.899
#> 10 A AI 2009 -0.106
#> # … with 35 more rows
Our task is to expand the data frame so that one column contains data for each combination of product and country. To do this, it is enough to pass in the argument names_from a vector containing the names of the fields to merge.
You can also apply specifications to a function pivot_wider(). But when submitted to pivot_wider() the specification performs the opposite conversion pivot_longer(): creates the columns specified in . Name, using values from .value and other columns.
For this data set, you can generate a custom specification if you want every possible combination of country and product to have its own column, not just those present in the data:
#> # A tibble: 4 x 4
#> .name product country .value
#> <chr> <chr> <chr> <chr>
#> 1 A_AI A AI value
#> 2 A_EI A EI value
#> 3 B_AI B AI value
#> 4 B_EI B EI value
df %>% pivot_wider(spec = spec) %>% head()
#> # A tibble: 6 x 5
#> year A_AI A_EI B_AI B_EI
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 2000 -2.05 NA 0.607 1.20
#> 2 2001 -0.676 NA 1.65 -0.114
#> 3 2002 1.60 NA -0.0245 0.501
#> 4 2003 -0.353 NA 1.30 -0.459
#> 5 2004 -0.00530 NA 0.921 -0.0589
#> 6 2005 0.442 NA -1.55 0.594
Some advanced examples of working with the new tidyr concept
Tidying Up the Data Using the US Census of Income and Rent Dataset
Data set us_rent_income contains information on median income and rent for each state in the US for 2017 (dataset available in package tidycensus).
us_rent_income
#> # A tibble: 104 x 5
#> GEOID NAME variable estimate moe
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 01 Alabama income 24476 136
#> 2 01 Alabama rent 747 3
#> 3 02 Alaska income 32940 508
#> 4 02 Alaska rent 1200 13
#> 5 04 Arizona income 27517 148
#> 6 04 Arizona rent 972 4
#> 7 05 Arkansas income 23789 165
#> 8 05 Arkansas rent 709 5
#> 9 06 California income 29454 109
#> 10 06 California rent 1358 3
#> # … with 94 more rows
In the form in which the data is stored in the dataset us_rent_income working with them is extremely inconvenient, so we would like to create a dataset with columns: rent, rent_moe, like, income_moe. There are many ways to create this specification, but the main thing is that we need to generate each combination of variable values and estimate/moeand then generate the column name.
Sometimes it takes several steps to get a data set into the right shape.
Dataset world_bank_pop contains World Bank data on the population of each country between 2000 and 2018.
Our goal is to create a neat dataset where each variable is in its own column. It's not yet clear exactly what steps are needed, but we'll start with the most obvious problem: the year is spread across multiple columns.
In order to fix this, you need to use the function pivot_longer().
The next step is to consider the indicator variable. pop2 %>% count(indicator)
#> # A tibble: 4 x 2
#> indicator n
#> <chr> <int>
#> 1 SP.POP.GROW 4752
#> 2 SP.POP.TOTL 4752
#> 3 SP.URB.GROW 4752
#> 4 SP.URB.TOTL 4752
Where SP.POP.GROW is population growth, SP.POP.TOTL is total population, and SP.URB. * the same, but only for urban areas. Let's split these values into two variables: area - the area (total or urban) and a variable containing the actual data (population or growth):
Tabulating this list is difficult because there is no variable to identify which data belongs to which contact. We can fix this by noticing that the data for each new contact starts with a name ("name"), so we can create a unique ID, and increment it by one each time the value "name" is encountered in the field column:
#> # A tibble: 6 x 3
#> field value person_id
#> <chr> <chr> <int>
#> 1 name Jiena McLellan 1
#> 2 company Toyota 1
#> 3 name John Smith 2
#> 4 company google 2
#> 5 email [email protected] 2
#> 6 name Huxley Ratcliffe 3
Now that we have a unique ID for each contact, we can turn the field and value into columns:
#> # A tibble: 3 x 4
#> person_id name company email
#> <int> <chr> <chr> <chr>
#> 1 1 Jiena McLellan Toyota <NA>
#> 2 2 John Smith google [email protected]
#> 3 3 Huxley Ratcliffe <NA> <NA>
Conclusion
My personal opinion is that the new concept tidyr is really more intuitive, and significantly outperforms deprecated functions in terms of functionality spread() и gather(). I hope this article has helped you understand pivot_longer() и pivot_wider().