Data Scientist Notes: A Personal Review of Data Query Languages

Data Scientist Notes: A Personal Review of Data Query Languages
I tell from personal experience what came in handy where and when. Overview and thesis, so that it is clear what and where you can dig further - but here I have an exclusively subjective personal experience, maybe everything is completely different for you.

Why is it important to know and be able to handle query languages? At its core, Data Science has several most important stages of work, and the very first and most important (without it, nothing will certainly work!) Is obtaining or extracting data. Most often, the data in some form is sitting somewhere and they need to be “gotten” from there. 

Query languages ​​just allow you to extract this very data! And today I will tell you about those query languages ​​that came in handy for me and I will tell and show you exactly where and how - why it is needed for learning.

In total there will be three main blocks of data query types, which we will analyze in this article:

  • "Standard" query languages ​​are what is commonly understood when talking about a query language, such as relational algebra or SQL.
  • Query scripting languages: like pandas, numpy or shell scripting python stuff.
  • Query languages ​​for knowledge graphs and graph databases.

Everything written here is just a personal experience that came in handy, with a description of the situations and “why it was needed” - everyone can try on how similar situations you may encounter and try to prepare for them in advance, having dealt with these languages ​​​​before you have to to (urgently) apply on a project or even get on a project where they are needed.

"Standard" query languages

Standard query languages ​​in the sense that we usually think about them when we talk about queries.

Relational algebra

Why is relational algebra needed today? In order to have a good idea of ​​why query languages ​​are designed the way they are and to use them consciously, you need to understand the underlying core.

What is relational algebra?

The formal definition is as follows: relational algebra is a closed system of operations on relations in a relational data model. If a little more humanly, this is a system of operations on tables, such that the result is also always a table.

See all relational operations in this article from Habr - here we describe why you need to know and where it comes in handy.

What for?

You begin to understand what query languages ​​are in general and what operations are behind the expressions of specific query languages ​​- often gives a deeper understanding of what and how it works in query languages.

Data Scientist Notes: A Personal Review of Data Query Languages
Taken from this articles. Operation example: join, which joins tables.

Materials for study:

Good introductory course from Stanford. In general, there are a lot of materials on relational algebra and theory - Сoursera, Udacity. There is also a huge amount of material online, including some good ones. academic courses. My personal advice: you need to understand relational algebra very well - this is the basis of the basics.

SQL

Data Scientist Notes: A Personal Review of Data Query Languages
Taken from this Article.

SQL is, in fact, an implementation of relational algebra - with an important caveat, SQL is declarative! That is, when writing a query in the language of relational algebra, you actually say how to calculate - but with SQL you specify what you want to extract, and then the DBMS already generates (effective) expressions in the language of relational algebra (their equivalence is known to us under Codd's theorem).

Data Scientist Notes: A Personal Review of Data Query Languages
Taken from this Article.

What for?

Relational DBMS: Oracle, Postgres, SQL Server, etc are still virtually everywhere and there is an incredibly high chance that you will have to interact with them, which means that you will either have to read SQL (which is very likely) or write it ( also not unbelievable).

What to read and study

According to the same links above (about relational algebra), there is an incredible amount of material, for example, this.

By the way, what is NoSQL?

"It's worth emphasizing once again that the term 'NoSQL' is completely spontaneous in origin and has no generally accepted definition or scientific institution behind it." Relevant article on Habr.

In fact, people realized that a complete relational model is not needed to solve many problems, especially for those where, for example, performance is fundamental and certain simple queries with aggregation dominate - it is critical to quickly calculate metrics and write them to the database, and most of the features are relational turned out to be not only unnecessary, but also harmful - why normalize something if it will spoil the most important thing for us (for some specific task) - performance?

Also, flexible schemas are often needed instead of the fixed mathematical schemas of the classical relational model - and this makes it incredibly easy to develop applications when it is critical to deploy the system and start working quickly, processing the results - or the schema and types of stored data are not so important.

For example, we are creating an expert system and we want to store information on a specific domain along with some meta-information - we may not know all the fields and simply store JSON for each record - this gives us a very flexible environment for expanding the data model and fast iteration - therefore, in this case NoSQL would be even preferable and more readable. An example entry (from one of my projects where NoSQL was right where it needed to be).

{"en_wikipedia_url":"https://en.wikipedia.org/wiki/Johnny_Cash",
"ru_wikipedia_url":"https://ru.wikipedia.org/wiki/?curid=301643",
"ru_wiki_pagecount":149616,
"entity":[42775,"Джонни Кэш","ru"],
"en_wiki_pagecount":2338861}

Read more here about NoSQL.

What to study?

Here, rather, you just need to analyze your task well, what properties it has and what NoSQL systems are available that would fit this description - and already study this system.

Scripting Query Languages

At first, it seems, what does Python have to do with it - this is a programming language, and not about queries at all.

Data Scientist Notes: A Personal Review of Data Query Languages

  • Pandas is a real Swiss knife of Data Science, a huge amount of data transformation, aggregation, etc. takes place in it.
  • Numpy - vector calculations, matrices and linear algebra there.
  • Scipy - a lot of math in this package, especially stats.
  • Jupyter lab - a lot of exploratory data analysis fits well into notebooks - useful to be able to.
  • Requests - work with the network.
  • Pyspark is very popular among data engineers, most likely you will have to interact with this or Spark, simply because of their popularity.
  • * Selenium - very useful for collecting data from sites and resources, sometimes there is simply no other way to get data.

My top tip: Learn Python!

pandas

Let's take the following code as an example:

import pandas as pd
df = pd.read_csv(“data/dataset.csv”)
# Calculate and rename aggregations
all_together = (df[df[‘trip_type’] == “return”]
    .groupby(['start_station_name','end_station_name'])
                  	    .agg({'trip_duration_seconds': [np.size, np.mean, np.min, np.max]})
                           .rename(columns={'size': 'num_trips', 
           'mean': 'avg_duration_seconds',    
           'amin': min_duration_seconds', 
           ‘amax': 'max_duration_seconds'}))

In fact, we see that the code fits into the classic SQL pattern.

SELECT start_station_name, end_station_name, count(trip_duration_seconds) as size, …..
FROM dataset
WHERE trip_type = ‘return’
GROUPBY start_station_name, end_station_name

But the important part is that this code is part of the script and the pipeline, in fact we are embedding queries in the Python pipeline. In this situation, the query language comes to us from libraries such as Pandas or pySpark.

In general, in pySpark we see a similar type of data transformation through a query language in the spirit of:

df.filter(df.trip_type = “return”)
  .groupby(“day”)
  .agg({duration: 'mean'})
  .sort()

Where and what to read

According to the python in general not a problem find study materials. There are a huge number of tutorials on the network pandas, pySpark and courses on Spark (as well as by DS). In general, the materials here are great to google, and if I had to choose one package to focus on, it would be pandas, of course. By a bunch of DS+Python materials too lots of.

Shell as a query language

A lot of data processing and analysis projects that I had to work with are, in fact, shell scripts that call python code, java code, and the actual shell commands themselves. Therefore, in general, we can consider pipelines in bash/zsh/etc as some kind of high-level request (you can, of course, push cycles there, but this is not typical for DS code in shell languages), let's give a simple example - I needed to map the QID of the wikidata and a full link to the Russian and English wikis, for this I wrote a simple query from the commands in the bash and for the output I wrote a simple script in python, which I put together like this:

pv “data/latest-all.json.gz” | 
unpigz -c  | 
jq --stream $JQ_QUERY | 
python3 scripts/post_process.py "output.csv"

where

JQ_QUERY = 'select((.[0][1] == "sitelinks" and (.[0][2]=="enwiki" or .[0][2] =="ruwiki") and .[0][3] =="title") or .[0][1] == "id")' 

It was, in fact, the entire pipeline that created the necessary mapping, as we see everything, it worked in stream mode:

  • pv filepath - Gives a progress bar based on file size and passes its content on
  • unpigz -c read part of the archive and returned jq
  • jq with the key - stream immediately produced the result and passed it to the postprocessor (just like with the very first example) in python
  • inside the postprocessor is a simple state machine that formatted the output 

In total, a complex pipeline running in stream mode on big data (0.5TB), without significant resources and made from a simple pipeline and a couple of tools.

Another important tip: be good and efficient at terminal and writing bash/zsh/etc.

Where will it be useful? Yes, almost everywhere - again, there are VERY many materials for study on the net. In particular, here this my previous post.

R scripting

Again, the reader may exclaim - well, this is a whole programming language! And of course he will be right. However, usually I had to deal with R always in such a context that, in fact, it was very similar to a query language.

R is a statistical computing environment and a language for static computing and visualization (according to this).

Data Scientist Notes: A Personal Review of Data Query Languages
Taken hence. By the way, I recommend, good material.

Why does a data scientist need to know R? At least, because there is a huge layer of non-IT people who are engaged in data analysis in R. I met in the following places:

  • pharmaceutical sector.
  • Biologists.
  • Financial sector.
  • People with a purely mathematical education, dealing with stats.
  • Specialized statistical and machine learning models (often found only in the author's version as an R package).

Why is it actually a query language? In the form in which it often occurs, this is actually a request to create a model, including reading data and fixing query parameters (models), as well as visualizing data in packages such as ggplot2 - this is also a form of writing queries.

Sample Requests for Visualization

ggplot(data = beav, 
       aes(x = id, y = temp, 
           group = activ, color = activ)) +
  geom_line() + 
  geom_point() +
  scale_color_manual(values = c("red", "blue"))

In general, many ideas from R have migrated to python packages such as pandas, numpy or scipy, like dataframes and data vectorization - so in general, a lot of things in R will seem familiar and comfortable to you.

There are many sources to study, for example, this.

Knowledge Graphs

Here I have a slightly unusual experience, because I still quite often have to work with knowledge graphs and query languages ​​for graphs. Therefore, we will only briefly go through the basics, since this part is a little more exotic.

In classical relational databases, we have a fixed schema - here the schema is flexible, each predicate is actually a “column” and even more.

Imagine that you would model a person and want to describe key things, for example, let's take a specific person, Douglas Adams, and take this description as a basis.

Data Scientist Notes: A Personal Review of Data Query Languages
www.wikidata.org/wiki/Q42

If we used a relational database, we would have to create a huge table or tables with a huge number of columns, most of which would be NULL or filled with some default False value, for example, it is unlikely that many of us have an entry in the national Korean library - of course, we could put them in separate tables, but this would ultimately be an attempt to model a flexible logical scheme with predicates, using a fixed relational one.

Data Scientist Notes: A Personal Review of Data Query Languages
So imagine that all data is stored as a graph or as binary and unary boolean expressions.

Where can you even encounter this? First, working with wiki data, and with any graph databases or connected data.

The following are the main query languages ​​that I have used and worked with.

SPARQL

Wiki:
SPARQL(recursive acronym from Engl. SPARQL Protocol and RDF Query Language) - data query language, represented by the model RDFand protocol to transmit these requests and responses to them. SPARQL is a recommendation W3C consortium and one of the technologies semantic web.

But in reality it is a query language for logical unary and binary predicates. You simply conditionally state what is fixed in the boolean expression and what is not (very simplistic).

The RDF (Resource Description Framework) base itself, on which SPARQL queries are executed, is a triple object, predicate, subject - and the query selects the necessary triples according to the specified restrictions in the spirit: find such X that p_55(X, q_33) is true - where, of course, p_55 is some kind of relationship with id 55, and q_33 is an object with id 33 (here and the whole tale, again omitting all sorts of details).

Data representation example:

Data Scientist Notes: A Personal Review of Data Query Languages
Pictures and an example with countries here hence.

Basic Query Example

Data Scientist Notes: A Personal Review of Data Query Languages

In fact, we want to find the value of the ?country variable, such as for the predicate
member_of is correct that member_of(?country,q458) and q458 is the European Union ID.

An example of a real SPARQL query inside a python engine:

Data Scientist Notes: A Personal Review of Data Query Languages

As a rule, I had to read SPARQL, not write - in such a situation, most likely, it will be a useful skill to understand the language at least at a basic level in order to understand exactly how data is retrieved. 

There are a lot of materials to study online: for example, here this и this. I myself usually google specific designs and examples, and so far there is enough.

Logical query languages

You can read more on the topic in my article. here. And here, we will only briefly analyze why logical languages ​​​​are well suited for writing queries. Essentially, RDF is just a collection of logical assertions like p(X) and h(X,Y), and the logical query looks like this:

output(X) :- country(X), member_of(X,“EU”).

Here we are talking about creating a new predicate output / 1 (/1 means unary), provided that for X it is true that country (X) - i.e., X is a country and also member_of (X, “EU ”).

That is, we have both the data and the rules in this case are generally presented in the same way, which makes it very easy and good to model tasks.

Where did you meet in the industry: a whole large project with a company that writes requests in such a language, as well as on the current project in the core of the system - it would seem that a rather exotic thing, but sometimes it occurs.

An example of a code fragment in a logical language that processes wikidata:

Data Scientist Notes: A Personal Review of Data Query Languages

Materials: I will give here a couple of links to the modern logical programming language Answer Set Programming - I recommend studying it:

Data Scientist Notes: A Personal Review of Data Query Languages

Source: habr.com

Add a comment