According to
I analyzed jobs for the position of data engineer as they are in January 2020 to understand what skills in the field of technology are most popular. Then I compared the results with the statistics for vacancies for the position of data scientist - while revealing some interesting differences.
Without long introductions, here are the top ten technologies that are most often mentioned in job postings:
Technology references in data engineer jobs in 2020
Responsibilities of a data engineer
Today, the work that data engineers do is of great importance for organizations - it is these people who are responsible for storing information and putting it in such a form that other employees can work with it. Data engineers build pipelines to get data, stream or batch, from multiple sources. Next, the pipelines perform extraction, transformation and loading operations (in other words, ETL processes), making the data more suitable for further use. After that, the data is submitted to analysts and data scientists for deeper processing. Finally, the data ends its journey in dashboards, reports, and machine learning models.
I was looking for information that would allow me to conclude which technologies are most in demand in the work of a data engineer at the moment.
Methods
I collected information from three job search sites -
For each keyword, I calculated the percentage of hits from the total number of texts on each of the sites separately, and then calculated the average value for three sources.
The results
Below are the top XNUMX data engineering terms from the top performing data engineering terms across all three job sites.
And here are the same numbers, but arranged in the form of a table:
Let's go in order.
Results Overview
Both SQL and Python are in more than two-thirds of the jobs reviewed. It is these two technologies that make sense to study in the first place.
Spark is mentioned in about half of the vacancies.
AWS is in about 45% of job postings. It is a cloud computing platform manufactured by Amazon; it has the largest market share among all cloud platforms.
Next come Java and Hadoop - a little over 40% per brother.
It's like riding a time machine
Then we see Hive, Scala, Kafka and NoSQL - each of these technologies is mentioned in a quarter of the submitted vacancies. Apache Hive is a data warehouse program that "makes it easy to read, write, and manage large datasets located in distributed stores using SQL."
Comparison with terms in data scientist vacancies
Here are thirty technology terms most commonly used by data science employers. I got this list in the same way that I described above for data engineering.
Technology mentions in vacancies for the position of data scientist in 2020
If we talk about the total number, compared to the previously considered set, there were 28% more vacancies (12 versus 013). Let's see which technologies are less common in vacancies for data scientists than for data engineers.
More popular in data engineering
The chart below shows keywords with an average value difference greater than 10% or less than -10%.
The biggest differences in keyword frequency between data engineer and data scientist
AWS shows the most significant increase: in data engineering, it appears 25% more regularly than in data science (approximately 45% and 20% of the total number of vacancies, respectively). The difference is palpable!
Here's the same data in a slightly different view - in the graph, the results for the same keyword in vacancies for the position of data engineer and data scientist are located side by side.
The biggest differences in keyword frequency between data engineer and data scientist
The next biggest jump I noticed was with Spark - a data engineer often has to work with big data.
Less popular in data engineering
Now let's see which technologies are less popular in data engineer jobs.
The sharpest decline compared to the field of data science happened in
Demanded in both data engineering and data science
It should be noted that eight of the first ten positions in both sets are the same. SQL, Python, Spark, AWS, Java, Hadoop, Hive, and Scala made the top ten for both data engineering and data science. In the graph below, you can see the fifteen most popular technologies for data engineer employers, and next to them is their job metric for data scienctists.
Recommendations
If you want to do data engineering, I would advise you to master the following technologies - I list them in order of approximate priority.
Learn SQL. I'm leaning towards PostgreSQL because it's open source, very popular in the community, and in a growth phase. How to use the language can be found in the book My Memorable SQL - its pilot version is available
Master Python, even if not at the most hardcore level. The book My Memorable Python is designed just for beginners. It can be bought at
Once you're familiar with Python, move on to pandas, a Python library used for data cleansing and manipulation. If you're aiming for a job at a company that requires the ability to write in Python (which is the majority), you can be sure that knowledge of pandas will be assumed by default. I am currently finishing up an introductory tutorial for working with pandas - you can
Master AWS. If you want to become a data engineer, you can't do without a cloud platform in your backyard, and AWS is the most popular of them. The courses helped me a lot.
If you have already mastered this entire list and want to further grow in the eyes of employers as a data engineer, I suggest adding Apache Spark to work with big data. Although my research on data science vacancies showed a decline in interest, for data engineers it still flickers in almost every second vacancy.
At last
I hope you found this overview of the most in-demand technologies for data engineers helpful. If you're wondering what's going on with analyst vacancies, read
Source: habr.com