Which language to choose for working with data - R or Python? Both! Migrating from pandas to tidyverse and data.table and back

When you search for R or Python on the Internet, you will find millions of articles and kilometer-long discussions on the topic of which one is better, faster and more convenient for working with data. But unfortunately, all these articles and disputes do not carry much benefit.

Which language to choose for working with data - R or Python? Both! Migrating from pandas to tidyverse and data.table and back

The purpose of this article is to compare the main data processing techniques in the most popular packages of both languages. And to help readers as quickly as possible to master what they do not yet know. For those who write in Python, learn how to do the same in R, and vice versa.

In the course of the article, we will analyze the syntax of the most popular R packages. These are the packages included in the library tidyverseand also the package data.table. And compare their syntax with pandas, the most popular package for data analysis in Python.

We will step by step go through the entire path of data analysis from data loading to the execution of analytical, window functions using Python and R.

Content

This article can be used as a cheat sheet if you forgot how to perform some data processing operation in one of the packages under consideration.

Which language to choose for working with data - R or Python? Both! Migrating from pandas to tidyverse and data.table and back

  1. Main Syntax Differences in R and Python
    1.1. Calling Package Functions
    1.2. Assignment
    1.3. Индексация
    1.4. Methods and OOP
    1.5. Pipelines
    1.6. Data structures
  2. A few words about the packages that we will use
    2.1. tidyverse
    2.2. data.table
    2.3. pandas
  3. Package Installation
  4. Loading data
  5. Creating dataframes
  6. Selecting the desired columns
  7. Row filtering
  8. Grouping and aggregation
  9. Vertical join of tables (UNION)
  10. Horizontal table join (JOIN)
  11. Basic Window Functions and Computed Columns
  12. Correspondence table of data processing methods in R and Python
  13. Conclusion
  14. A little poll about which package you are using

If you are interested in data analysis, you may find my telegram и youtube channels. Most of the content of which is devoted to the R language.

Main Syntax Differences in R and Python

To make it easier for you to switch from Python to R, or vice versa, here are a few main points that you need to pay attention to.

Calling Package Functions

Once a package has been loaded into R, there is no need to specify the package name to call its functions. In most cases, this is not accepted in R, but it is allowed. You can generally not import the package if you need any one of its functions in the code, but simply call it by specifying the package name and the function name. The separator between a package name and a function in R is a double colon package_name::function_name().

In Python, on the contrary, it is classic to call the functions of a package by explicitly specifying its name. When a package is loaded, it is usually given an abbreviated name, for example for pandas commonly used alias pd. Calling a package function goes through a dot package_name.function_name().

Assignment

In R, it is customary to use an arrow to assign a value to an object. obj_name <- value, although a single equals sign is also allowed, the single equals sign in R is mainly used to pass values ​​to function arguments.

In Python, assignment is done exclusively with a single equals sign. obj_name = value.

Индексация

Here, too, there are significant differences. In R, indexing starts at one and includes all specified elements in the resulting range,

in Python, indexing starts at zero and the selectable range does not include the last element specified in the index. So design x[i:j] in Python will not include the j element.

There are also differences in negative indexing, in R the notation x[-1] will return all elements of the vector except the last one. In Python, a similar notation will only return the last element.

Methods and OOP

OOP is implemented in R in its own way, I wrote about this in an article "OOP in R (part 1): S3 classes". In general, R is a functional language, and everything in it is built on functions. Therefore, for example, for Excel users, go to tydiverse will be easier than pandas. Although perhaps this is my subjective opinion.

In short, objects in R do not have methods (if we talk about S3 classes, but there are other OOP implementations that are much less common). There are only generic functions that, depending on the class of the object, process them differently.

Pipelines

Perhaps this is the name for pandas will not be entirely correct, but I will try to explain the meaning.

In order not to save intermediate calculations and not to produce unnecessary objects in the working environment, you can use a kind of pipeline. Those. pass the result of a calculation from one function to the next, and do not store intermediate results.

Let's take the following code example, in which we save intermediate calculations into separate objects:

temp_object <- func1()
temp_object2 <- func2(temp_object )
obj <- func3(temp_object2 )

We sequentially performed 3 operations, and the result of each was saved in a separate object. But in fact, we do not need these intermediate objects.

Or even worse, but more familiar to Excel users.

obj  <- func3(func2(func1()))

In this case, we did not save intermediate results of calculations, but it is extremely inconvenient to read code with functions nested inside each other.

We will consider several approaches to data processing in R, and they perform such operations in different ways.

Pipelines in the library tidyverse implemented by the operator %>%.

obj <- func1() %>% 
            func2() %>%
            func3()

Thus, we take the result of the work func1() and pass it as the first argument to func2(), then we pass the result of this calculation as the first argument func3(). And in the end, all the calculations performed are written to the object obj <-.

Better than words, all of the above illustrates this meme:
Which language to choose for working with data - R or Python? Both! Migrating from pandas to tidyverse and data.table and back

В data.table chains are used in a similar way.

newDT <- DT[where, select|update|do, by][where, select|update|do, by][where, select|update|do, by]

In each of the square brackets, you can use the result of the previous operation.

В pandas such operations are separated by a dot.

obj = df.fun1().fun2().fun3()

Those. we take our table df and use its method fun1(), then apply the method to the result obtained fun2()after fun3(). The result is saved to an object. obj .

Data structures

The data structures in R and Python are similar but have different names.

Description
Title in R
Name in Python/pandas

Table structure
data.frame, data.table, tibble
DataFrame

One-Dimensional List of Values
Vector
Series in pandas or list (list) in pure Python

Multilevel non-table structure
List
Dictionary (dict)

Some other features and syntax differences will be discussed later.

A few words about the packages that we will use

To begin with, I will tell you a little about the packages that you will get acquainted with in the course of this article.

tidyverse

Official site: tidyverse.org
Which language to choose for working with data - R or Python? Both! Migrating from pandas to tidyverse and data.table and back
Library tidyverse written by Hadley Wickham, Senior Scientist at RStudio. tidyverse consists of an impressive set of packages that simplify data processing, 5 of which are in the top 10 downloaded from the CRAN repository.

The library core consists of the following packages: ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats. Each of these packages is aimed at solving a specific problem. For example dplyr designed to manipulate data tidyr to bring the data to a neat form, stringr makes it easier to work with strings, and ggplot2 is one of the most popular data visualization tools.

The advantage tidyverse is the simplicity and readability of the syntax, which is in many ways similar to the SQL query language.

data.table

Which language to choose for working with data - R or Python? Both! Migrating from pandas to tidyverse and data.table and backOfficial site: r-datatable.com

By the author data.table is Matt Dole of H2O.ai.

The first release of the library took place in 2006.

Package syntax is not as convenient as in tidyverse and more reminiscent of classic dataframes in R, but at the same time significantly expanded in functionality.

All manipulations with the table in this package are described in square brackets, and if you translate the syntax data.table in SQL, you get something like this: data.table[ WHERE, SELECT, GROUP BY ]

The strength of this package is the speed of processing large amounts of data.

pandas

Official site: pandas.pydata.org Which language to choose for working with data - R or Python? Both! Migrating from pandas to tidyverse and data.table and back

The name of the library comes from the econometric term "panel data", used to describe multidimensional structured sets of information.

By the author pandas is an American Wes McKinney.

When it comes to data analysis in Python, equal pandas No. A very multifunctional, high-level package that allows you to perform any manipulations with data, from loading data from any source to visualizing it.

Installing additional packages

The packages discussed in this article are not included in the base distributions of R and Python. Although there is a small caveat, if you installed the Anaconda distribution then install additionally pandas not required.

Installing packages in R

If you have ever opened the RStudio development environment, you probably already know how to install the required package in R. To install packages, use the standard command install.packages() by running it directly in R itself.

# установка пакетов
install.packages("vroom")
install.packages("readr")
install.packages("dplyr")
install.packages("data.table")

After installation, the packages must be connected, for which in most cases the command is used library().

# подключение или импорт пакетов в рабочее окружение
library(vroom)
library(readr)
library(dplyr)
library(data.table)

Installing packages in Python

So, if you have pure Python installed, then pandas you need to install manually. Open a command prompt or terminal, depending on your operating system, and enter the following command.

pip install pandas

Then we return to Python and import the installed package with the command import.

import pandas as pd

Loading data

Data mining is one of the most important steps in data analysis. Both Python and R provide you with extensive options for getting data from any source, if you wish: local files, files from the Internet, websites, all kinds of databases.

Which language to choose for working with data - R or Python? Both! Migrating from pandas to tidyverse and data.table and back

In the course of the article, we will use several data sets:

  1. Two downloads from Google Analytics.
  2. Titanic Passenger Dataset.

All data is on my GitHub as csv and tsv files. From where we will ask them.

Loading data into R: tidyverse, vroom, readr

To load data in the library tidyverse there are two packages: vroom, readr. vroom more up-to-date, but in the future the packages may be merged.

Quote from the official documentation vroom.

vroom vs reader
What does the release of vroom mean for readr? For now we plan to let the two packages evolve separately, but likely we will unite the packages in the future. One disadvantage to vroom's lazy reading is certain data problems can't be reported up front, so how best to unify them requires some thought.

vroom vs readr
What does release mean vroom for readr? At the moment we plan to develop both packages separately, but we will probably merge them in the future. One of the disadvantages of lazy reading vroom is that some problems with the data cannot be reported in advance, therefore, it is necessary to think about how best to combine them.

In this article, we will look at both packages for loading data:

Loading data into R: vroom package

# install.packages("vroom")
library(vroom)

# Чтение данных
## vroom
ga_nov  <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec  <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- vroom("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")

Loading data into R: readr

# install.packages("readr")
library(readr)

# Чтение данных
## readr
ga_nov  <- read_tsv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec  <- read_tsv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")

In the package vroom, regardless of the csv / tsv data format, loading is carried out by the function of the same name vroom(), in the package readr we use a different function for each format read_tsv() и read_csv().

Loading data into R: data.table

В data.table there is a function to load data fread().

Loading data into R: package data.table

# install.packages("data.table")
library(data.table)

## data.table
ga_nov  <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_nowember.csv")
ga_dec  <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/ga_december.csv")
titanic <- fread("https://raw.githubusercontent.com/selesnow/publications/master/data_example/r_python_data/titanic.csv")

Loading Data in Python: pandas

When compared with R packages, in this case, the closest in syntax to pandas will readrBecause pandas can request data from anywhere, and this package has a whole family of functions read_*().

  • read_csv()
  • read_excel()
  • read_sql()
  • read_json()
  • read_html()

And many other functions designed to read data from various formats. But for our purposes it is enough read_table() or read_csv() using argument Sep to specify the column separator.

Loading Data in Python: pandas

import pandas as pd

ga_nov  = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/ga_nowember.csv", sep = "t")
ga_dec  = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/ga_december.csv", sep = "t")
titanic = pd.read_csv("https://raw.githubusercontent.com/selesnow/publications/master/data_example/russian_text_in_r/titanic.csv")

Creating dataframes

Table titanicthat we have uploaded has a field Sex, which stores the gender ID of the passenger.

But for a more convenient presentation of data in terms of the passenger's gender, you should use not the gender code, but the name.

To do this, we will create a small reference book, a table in which there will be only 2 columns (code and gender name) and 2 rows, respectively.

Creating a dataframe in R: tidyverse, dplyr

In the code example below, we create the desired dataframe using the function tibble() .

Creating a dataframe in R: dplyr

## dplyr
### создаём справочник
gender <- tibble(id = c(1, 2),
                 gender = c("female", "male"))

Creating a dataframe in R: data.table

Creating a dataframe in R: data.table

## data.table
### создаём справочник
gender <- data.table(id = c(1, 2),
                    gender = c("female", "male"))

Creating a dataframe in Python: pandas

В pandas the creation of frames is carried out in several stages, first we create a dictionary, and then we convert the dictionary into a dataframe.

Creating a dataframe in Python: pandas

# создаём дата фрейм
gender_dict = {'id': [1, 2],
               'gender': ["female", "male"]}
# преобразуем словарь в датафрейм
gender = pd.DataFrame.from_dict(gender_dict)

Column selection

The tables you work with can contain dozens or even hundreds of columns of data. But for analysis, as a rule, you do not need all the columns that are available in the source table.

Which language to choose for working with data - R or Python? Both! Migrating from pandas to tidyverse and data.table and back

Therefore, one of the first operations that you will perform with the original table is to clear it of unnecessary information, and free the memory that this information occupies.

Selecting columns in R: tidyverse, dplyr

Syntax dplyr very similar to the SQL query language, if you are familiar with it, you will quickly master this package.

To select columns use the function select().

Below are code examples with which you can select columns in the following ways:

  • By listing the names of the required columns
  • Access column names using regular expressions
  • By data type or any other property of the data contained in the column

Selecting columns in R: dplyr

# Выбор нужных столбцов
## dplyr
### выбрать по названию столбцов
select(ga_nov, date, source, sessions)
### исключь по названию столбцов
select(ga_nov, -medium, -bounces)
### выбрать по регулярному выражению, стобцы имена которых заканчиваются на s
select(ga_nov, matches("s$"))
### выбрать по условию, выбираем только целочисленные столбцы
select_if(ga_nov, is.integer)

Selecting columns in R: data.table

The same operations in data.table are performed a little differently, at the beginning of the article I gave a description of what arguments are inside square brackets in data.table.

DT[i,j,by]

Where:
i - where, i.e. string filtering
j - select|update|do, i.e. selecting columns and converting them
by - data grouping

Selecting columns in R: data.table

## data.table
### выбрать по названию столбцов
ga_nov[ , .(date, source, sessions) ]
### исключь по названию столбцов
ga_nov[ , .SD, .SDcols = ! names(ga_nov) %like% "medium|bounces" ]
### выбрать по регулярному выражению
ga_nov[, .SD, .SDcols = patterns("s$")]

Variable .SD allows you to access all columns, and .SDcols filter the desired columns using regular expressions, or other functions to filter the names of the columns you need.

Selecting columns in Python, pandas

To select columns by name in pandas it is enough to pass a list of their names. And to select or exclude columns by name using regular expressions, you must use the functions drop() и filter(), and the argument axis=1, with which you indicate that it is necessary to process not rows but columns.

To select a field by data type, use the function select_dtypes(), and into the arguments includes or exclude pass in a list of data types corresponding to the fields you need to select.

Selecting columns in Python: pandas

# Выбор полей по названию
ga_nov[['date', 'source', 'sessions']]
# Исключить по названию
ga_nov.drop(['medium', 'bounces'], axis=1)
# Выбрать по регулярному выражению
ga_nov.filter(regex="s$", axis=1)
# Выбрать числовые поля
ga_nov.select_dtypes(include=['number'])
# Выбрать текстовые поля
ga_nov.select_dtypes(include=['object'])

Row filtering

For example, the source table may contain data for several years, but you only need to analyze the last month. Again, extra lines will slow down the processing of data and will clog up the PC's memory.

Which language to choose for working with data - R or Python? Both! Migrating from pandas to tidyverse and data.table and back

Row filtering in R: tydyverse, dplyr

В dplyr function is used to filter rows filter(). It takes a dataframe as its first argument, then you list the filter conditions.

When writing logical expressions to filter a table, in this case, specify the column names without quotes, and without declaring the table name.

When applying multiple boolean expressions to filter, use the following operators:

  • & or comma - logical AND
  • | - logical OR

Row filtering in R: dplyr

# фильтрация строк
## dplyr
### фильтрация строк по одному условию
filter(ga_nov, source == "google")
### фильтр по двум условиям соединённым логическим и
filter(ga_nov, source == "google" & sessions >= 10)
### фильтр по двум условиям соединённым логическим или
filter(ga_nov, source == "google" | sessions >= 10)

Row filtering in R: data.table

As I wrote above, in data.table the data conversion syntax is enclosed in square brackets.

DT[i,j,by]

Where:
i - where, i.e. string filtering
j - select|update|do, i.e. selecting columns and converting them
by - data grouping

The argument is used to filter rows. i, which has the first position in square brackets.

Columns are referred to in logical expressions without quotes and without specifying the table name.

Logical expressions are related to each other in the same way as in dplyr through the & and | operators.

Row filtering in R: data.table

## data.table
### фильтрация строк по одному условию
ga_nov[source == "google"]
### фильтр по двум условиям соединённым логическим и
ga_nov[source == "google" & sessions >= 10]
### фильтр по двум условиям соединённым логическим или
ga_nov[source == "google" | sessions >= 10]

Row filtering in Python: pandas

Filtering by rows in pandas similar to filtering data.table, and is enclosed in square brackets.

The reference to columns in this case is carried out necessarily with the name of the dataframe, then the name of the column can also be specified in quotation marks in square brackets (example df['col_name']), or without quotes after the dot (example df.col_name).

If you need to filter a dataframe by several conditions, each of the conditions must be enclosed in parentheses. The logical conditions are interconnected by operators & и |.

Row filtering in Python: pandas

# Фильтрация строк таблицы
### фильтрация строк по одному условию
ga_nov[ ga_nov['source'] == "google" ]
### фильтр по двум условиям соединённым логическим и
ga_nov[(ga_nov['source'] == "google") & (ga_nov['sessions'] >= 10)]
### фильтр по двум условиям соединённым логическим или
ga_nov[(ga_nov['source'] == "google") | (ga_nov['sessions'] >= 10)]

Grouping and aggregation of data

One of the most commonly used operations in data analysis is grouping and aggregation.

Which language to choose for working with data - R or Python? Both! Migrating from pandas to tidyverse and data.table and back

The syntax for performing these operations is fragmented in all the packages we consider.

In this case, we will take a dataframe as an example. titanic, and calculate the number and average cost of tickets depending on the cabin class.

Grouping and aggregating data in R: tidyverse, dplyr

В dplyr grouping function is used group_by(), and for aggregation summarise(). Actually at dplyr there is a whole family of functions summarise_*(), but the purpose of this article is to compare the basic syntax, so we will not get into such a jungle.

Main aggregation functions:

  • sum() - summation
  • min() / max() - minimum and maximum value
  • mean() - average
  • median() - median
  • length() - quantity

Grouping and aggregation in R: dplyr

## dplyr
### группировка и агрегация строк
group_by(titanic, Pclass) %>%
  summarise(passangers = length(PassengerId),
            avg_price  = mean(Fare))

In function group_by() we passed a table as the first argument titanic, and then indicated the field Pclass, by which we will group our table. The result of this operation using the operator %>% passed as the first argument to the function summarise(), and added 2 more fields: passangers и avg_price. In the first, using the function length() calculated the number of tickets, and in the second using the function mean() received the average ticket price.

Grouping and aggregating data in R: data.table

В data.table for aggregation is an argument j which has the second position in square brackets, and for grouping by or keyby, which have the third position.

The list of aggregating functions in this case is identical to that described in dplyr, because these are functions from basic R syntax.

Grouping and aggregation in R: data.table

## data.table
### фильтрация строк по одному условию
titanic[, .(passangers = length(PassengerId),
            avg_price  = mean(Fare)),
        by = Pclass]

Grouping and Aggregating Data in Python: pandas

Grouping in pandas similar to dplyr, but the aggregation is not similar to dplyr not on data.table.

For grouping, use the method groupby(), to which you need to pass a list of columns by which the dataframe will be grouped.

For aggregation, you can use the method agg(), which accepts a dictionary. The keys of the dictionary are the columns to which you will apply the aggregate functions, and the values ​​will be the names of the aggregate functions.

Aggregating functions:

  • sum() - summation
  • min() / max() - minimum and maximum value
  • mean() - average
  • median() - median
  • count() - quantity

Function reset_index() in the example below is used to reset nested indexes that pandas by default sets after data aggregation.

Symbol allows you to move to the next line.

Grouping and aggregation in Python: pandas

# группировка и агрегация данных
titanic.groupby(["Pclass"]).
    agg({'PassengerId': 'count', 'Fare': 'mean'}).
        reset_index()

Vertical join of tables

An operation where you join two or more tables of the same structure. There are tables in the data we uploaded ga_nov и ga_dec. These tables are identical in structure, i.e. have the same columns, and the data types in those columns.

Which language to choose for working with data - R or Python? Both! Migrating from pandas to tidyverse and data.table and back

This is a download from Google Analytics for the month of November and December, in this section we will combine this data into one table.

Vertical table joins in R: tidyverse, dplyr

В dplyr You can merge 2 tables into one using the function bind_rows(), passing tables as its arguments.

Row filtering in R: dplyr

# Вертикальное объединение таблиц
## dplyr
bind_rows(ga_nov, ga_dec)

Vertical join of tables in R: data.table

Nothing complicated, just use rbind().

Row filtering in R: data.table

## data.table
rbind(ga_nov, ga_dec)

Vertical table joins in Python: pandas

В pandas function to join tables concat(), to which you need to pass a list of frames to merge them.

Row filtering in Python: pandas

# вертикальное объединение таблиц
pd.concat([ga_nov, ga_dec])

Horizontal join of tables

An operation in which columns from the second table are added to the first table by key. It is often used when enriching a fact table (for example, a table with sales data) with some reference data (for example, the cost of goods).

Which language to choose for working with data - R or Python? Both! Migrating from pandas to tidyverse and data.table and back

There are several types of union:

Which language to choose for working with data - R or Python? Both! Migrating from pandas to tidyverse and data.table and back

In the previously loaded table titanic we have a column Sex, which corresponds to the passenger's gender code:

1 - female
2 - male

Also, we have created a table - a reference gender. For a more convenient presentation of data on the gender of passengers, we need to add the name of the gender, from the directory gender to the table titanic.

Horizontal table joins in R: tidyverse, dplyr

В dplyr for horizontal joining there is a whole family of functions:

  • inner_join()
  • left_join()
  • right_join()
  • full_join()
  • semi_join()
  • nest_join()
  • anti_join()

The most commonly used in my practice is left_join().

As the first two arguments, the functions listed above take two tables to join, and as the third argument by you must specify the columns to merge.

Horizontal table joins in R: dplyr

# объединяем таблицы
left_join(titanic, gender,
          by = c("Sex" = "id"))

Horizontally joining tables in R: data.table

В data.table join tables by key is necessary using the function merge().

merge() function arguments in data.table

  • x, y — Tables to merge
  • by — The column that is the key to join if it has the same name in both tables
  • by.x, by.y — Column names to merge if they have different names in tables
  • all, all.x, all.y - Join type, all will return all rows from both tables, all.x corresponds to the LEFT JOIN operation (leaves all the rows of the first table), all.y - corresponds to the RIGHT JOIN operation (leaves all the rows of the second table ).

Horizontally joining tables in R: data.table

# объединяем таблицы
merge(titanic, gender, by.x = "Sex", by.y = "id", all.x = T)

Horizontal table join in Python: pandas

As well as in data.tableIn pandas function is used to join tables merge().

merge() function arguments in pandas

  • how — Connection type: left, right, outer, inner
  • on - The column that is the key if it has the same name in both tables
  • left_on, right_on — Key column names, in case they have different names in tables

Horizontal table join in Python: pandas

# объединяем по ключу
titanic.merge(gender, how = "left", left_on = "Sex", right_on = "id")

Basic Window Functions and Computed Columns

Window functions are similar in meaning to aggregating functions, and are also often used in data analysis. But unlike aggregate functions, window functions do not change the number of rows in the output dataframe.

Which language to choose for working with data - R or Python? Both! Migrating from pandas to tidyverse and data.table and back

In fact, with the help of window functions, we break the incoming dataframe into parts according to some attribute, i.e. by field value, or multiple fields. And we perform arithmetic operations on each window. The result of these operations will be returned in each row, i.e. without changing the total number of rows in the table.

For example, let's take a table titanic. We can calculate what percentage was the cost of each ticket within its cabin class.

To do this, we need in each line to get the total cost of the ticket for the current class of cabins to which the ticket in this line belongs, then divide the cost of each ticket by the total cost of all tickets of the same class of cabins.

Window functions in R: tidyverse, dplyr

To add new columns without applying row grouping, in dplyr serves function mutate().

You can solve the problem described above by grouping data by field Pclass and summing the field in the new column Do. Next, ungroup the table and divide the field values Do to what happened in the last step.

Window functions in R: dplyr

group_by(titanic, Pclass) %>%
  mutate(Pclass_cost = sum(Fare)) %>%
  ungroup() %>%
  mutate(ticket_fare_rate = Fare / Pclass_cost)

Window functions in R: data.table

The solution algorithm remains the same as in dplyr, we need to split the table into windows by field Pclass. Display in a new column the sum for the group corresponding to each row, and add a column in which we calculate the share of the cost of each ticket in its group.

To add new columns to data.table there is an operator :=. Below is an example of solving the problem using the package data.table

Window functions in R: data.table

titanic[,c("Pclass_cost","ticket_fare_rate") := .(sum(Fare), Fare / Pclass_cost), 
        by = Pclass]

Window functions in Python: pandas

One way to add a new column in pandas - use function assign(). To sum the cost of tickets by class of cabins, without grouping rows, we will use the function transform().

Below is an example solution in which we add to the table titanic the same 2 columns.

Window functions in Python: pandas

titanic.assign(Pclass_cost      =  titanic.groupby('Pclass').Fare.transform(sum),
               ticket_fare_rate = lambda x: x['Fare'] / x['Pclass_cost'])

Correspondence table of functions and methods

Next, I give a table of correspondence of methods for performing various operations with data in the packages we have considered.

Description
tidyverse
data.table
pandas

Loading data
vroom()/ readr::read_csv() / readr::read_tsv()
fread()
read_csv()

Creating dataframes
tibble()
data.table()
dict() + from_dict()

Column selection
select()
argument j, second position in square brackets
we pass the list of required columns in square brackets / drop() / filter() / select_dtypes()

Row filtering
filter()
argument i, first position in square brackets
list the filtering conditions in square brackets / filter()

Grouping and aggregation
group_by() + summarise()
arguments j + by
groupby() + agg()

Vertical join of tables (UNION)
bind_rows()
rbind()
concat()

Horizontal table join (JOIN)
left_join() / *_join()
merge()
merge()

Basic Window Functions and Adding Calculated Columns
group_by() + mutate()
argument j using the operator := + argument by
transform() + assign()

Conclusion

Perhaps in the article I described not the most optimal implementations of data processing, so I will be glad if you correct my mistakes in the comments, or simply supplement the information provided in the article with other methods of working with data in R / Python.

As I wrote above, the purpose of the article was not to impose my opinion on which language is better, but to simplify the ability to learn both languages, or, if necessary, migrate between them.

If you liked the article, I will be glad to new subscribers to my youtube и telegrams channels.

Interview

Which of the following packages do you use in your work?

You can write the reason for your choice in the comments.

Only registered users can participate in the survey. Sign in, you are welcome.

Which data processing package are you using (multiple choices possible)

  • 45,2%tidyverse19

  • 33,3%data.table14

  • 54,8%pandas23

42 users voted. 9 users abstained.

Source: habr.com

Add a comment