The most in-demand skills in the data engineer profession

According to statistics 2019, data engineer is currently the profession, the demand for which is growing faster than all others. A data engineer plays a critical role in an organization - it creates and maintains pipelines and databases that are used to process, transform and store data. What skills do representatives of this profession need in the first place? Is the list different from what is required of data scientists? You will learn about all this from my article.

I analyzed jobs for the position of data engineer as they are in January 2020 to understand what skills in the field of technology are most popular. Then I compared the results with the statistics for vacancies for the position of data scientist - while revealing some interesting differences.

Without long introductions, here are the top ten technologies that are most often mentioned in job postings:

The most in-demand skills in the data engineer profession

Technology references in data engineer jobs in 2020

Let's deal.

Responsibilities of a data engineer

Today, the work that data engineers do is of great importance for organizations - it is these people who are responsible for storing information and putting it in such a form that other employees can work with it. Data engineers build pipelines to get data, stream or batch, from multiple sources. Next, the pipelines perform extraction, transformation and loading operations (in other words, ETL processes), making the data more suitable for further use. After that, the data is submitted to analysts and data scientists for deeper processing. Finally, the data ends its journey in dashboards, reports, and machine learning models.

I was looking for information that would allow me to conclude which technologies are most in demand in the work of a data engineer at the moment.

Methods

I collected information from three job search sites - simplyhired, Indeed ΠΈ Monster and looked at what keywords came across in conjunction with β€œdata engineer” in the texts of vacancies designed for US residents. For this task, I used two Python libraries - Requests ΠΈ Beautiful soup. Among the keywords, I included both those that were included in the previous list for analyzing vacancies for the position of data scientist, and those that I manually selected by reading job offers for data engineers. LinkedIn was not included in the number of sources, as I was banned there after the last attempt to collect data.

For each keyword, I calculated the percentage of hits from the total number of texts on each of the sites separately, and then calculated the average value for three sources.

The results

Below are the top XNUMX data engineering terms from the top performing data engineering terms across all three job sites.

The most in-demand skills in the data engineer profession

And here are the same numbers, but arranged in the form of a table:

The most in-demand skills in the data engineer profession

Let's go in order.

Results Overview

Both SQL and Python are in more than two-thirds of the jobs reviewed. It is these two technologies that make sense to study in the first place. Python is a very popular programming language used for working with data, creating websites and writing scripts. SQL stands for Structured Query Language (structured query language); it assumes a standard implemented by a group of languages ​​and is used to retrieve data from relational databases. It appeared a long time ago and proved to be highly resistant.

Spark is mentioned in about half of the vacancies. Apache Spark is a "unified big data analytics engine with built-in modules for streaming, SQL, machine learning, and graph processing." It is especially popular with those who work with large databases.

AWS is in about 45% of job postings. It is a cloud computing platform manufactured by Amazon; it has the largest market share among all cloud platforms.
Next come Java and Hadoop - a little over 40% per brother. Java is a widely spoken, battle-tested language that 2019 Stack Overflow Developer Survey was awarded the tenth place among the languages ​​​​that cause horror in programmers. In contrast, Python was the second most popular language. Java is run by Oracle, and everything you need to know about it can be understood from this screenshot of the official page from January 2020.

The most in-demand skills in the data engineer profession

It's like riding a time machine
Apache Hadoop uses the MapReduce programming model with server clusters for big data. Now this model is being abandoned more and more.

Then we see Hive, Scala, Kafka and NoSQL - each of these technologies is mentioned in a quarter of the submitted vacancies. Apache Hive is a data warehouse program that "makes it easy to read, write, and manage large datasets located in distributed stores using SQL." Ladder is a programming language that is actively used when working with big data. In particular, Spark was created on Scala. In the already mentioned ranking of fearsome languages, Scala ranks eleventh. Apache Kafka is a distributed platform for processing streaming messages. Very popular as a means of streaming data.

NoSQL databases oppose themselves to SQL. They differ in that they are not relational, unstructured, and horizontally scalable. NoSQL has gained some popularity, but the frenzy for this approach, to the point of prophesying that it will replace SQL as the dominant storage paradigm, seems to be over.

Comparison with terms in data scientist vacancies

Here are thirty technology terms most commonly used by data science employers. I got this list in the same way that I described above for data engineering.

The most in-demand skills in the data engineer profession

Technology mentions in vacancies for the position of data scientist in 2020

If we talk about the total number, compared to the previously considered set, there were 28% more vacancies (12 versus 013). Let's see which technologies are less common in vacancies for data scientists than for data engineers.

More popular in data engineering

The chart below shows keywords with an average value difference greater than 10% or less than -10%.

The most in-demand skills in the data engineer profession

The biggest differences in keyword frequency between data engineer and data scientist

AWS shows the most significant increase: in data engineering, it appears 25% more regularly than in data science (approximately 45% and 20% of the total number of vacancies, respectively). The difference is palpable!

Here's the same data in a slightly different view - in the graph, the results for the same keyword in vacancies for the position of data engineer and data scientist are located side by side.

The most in-demand skills in the data engineer profession

The biggest differences in keyword frequency between data engineer and data scientist

The next biggest jump I noticed was with Spark - a data engineer often has to work with big data. Kafka also grew by 20%, that is, almost four times compared to the result for data scientist vacancies. Data communication is one of the key responsibilities of a data engineer. Finally, the number of mentions was 15% higher in data engineering for Java, NoSQL, Redshift, SQL, and Hadoop.

Less popular in data engineering

Now let's see which technologies are less popular in data engineer jobs.
The sharpest decline compared to the field of data science happened in R: there he appeared in approximately 56% of vacancies, here - only in 17%. Impressive. R is a programming language that is popular with scientists and statisticians, as well as the eighth place in the rating of terrible languages.

SAS also found in vacancies for the position of data engineer significantly less often - the difference is 14%. SAS is a proprietary language designed to work with statistics and data. An interesting point: judging by the results my job research for data scientists, it has lost ground a lot lately - more so than any other technology.

Demanded in both data engineering and data science

It should be noted that eight of the first ten positions in both sets are the same. SQL, Python, Spark, AWS, Java, Hadoop, Hive, and Scala made the top ten for both data engineering and data science. In the graph below, you can see the fifteen most popular technologies for data engineer employers, and next to them is their job metric for data scienctists.

The most in-demand skills in the data engineer profession

Recommendations

If you want to do data engineering, I would advise you to master the following technologies - I list them in order of approximate priority.

Learn SQL. I'm leaning towards PostgreSQL because it's open source, very popular in the community, and in a growth phase. How to use the language can be found in the book My Memorable SQL - its pilot version is available here.

Master Python, even if not at the most hardcore level. The book My Memorable Python is designed just for beginners. It can be bought at Amazon, electronic copy or physical copy of your choice, or download in pdf or epub format on this site.

Once you're familiar with Python, move on to pandas, a Python library used for data cleansing and manipulation. If you're aiming for a job at a company that requires the ability to write in Python (which is the majority), you can be sure that knowledge of pandas will be assumed by default. I am currently finishing up an introductory tutorial for working with pandas - you can Subscribeso as not to miss the exit moment.

Master AWS. If you want to become a data engineer, you can't do without a cloud platform in your backyard, and AWS is the most popular of them. The courses helped me a lot. Linux Academywhen i was studying data engineering on Google Cloud, I think that they also have good materials on AWS.

If you have already mastered this entire list and want to further grow in the eyes of employers as a data engineer, I suggest adding Apache Spark to work with big data. Although my research on data science vacancies showed a decline in interest, for data engineers it still flickers in almost every second vacancy.

At last

I hope you found this overview of the most in-demand technologies for data engineers helpful. If you're wondering what's going on with analyst vacancies, read my other article. Successful engineering!

Source: habr.com

Add a comment