Who are data engineers and how do they become one?

Hello again! The title of the article speaks for itself. Before the start of the course Data Engineer we offer to figure out who these data engineers are. The article has a lot of useful links. Happy reading.

Who are data engineers and how do they become one?

A simple guide on how to catch the wave of Data Engineering and not let it drag you into the abyss.

It seems like everyone wants to be a Data Scientist these days. But what about Data Engineering (data engineering)? In fact, this is a kind of hybrid of a data analyst and a data scientist; a data engineer is usually responsible for managing workflows, processing pipelines, and ETL processes. Due to the importance of these features, this is currently another popular professional jargon that is actively gaining momentum.

The high salary and huge demand are only a small part of what makes this job extremely attractive! If you want to join the ranks of heroes, it's never too late to start learning. In this post, I have collected all the information you need to help you take your first steps.

So, let's begin!

What is Data Engineering?

Honestly, there is no better explanation than this:

β€œA scientist can discover a new star, but he cannot create one. He'll have to ask an engineer to do it for him."

–Gordon Lindsay Glegg

Thus, the role of a data engineer is quite significant.

From the name it follows that data engineering is associated with data, namely with their delivery, storage and processing. Accordingly, the main task of engineers is to provide a reliable infrastructure for data. If we look at the AI ​​hierarchy of needs, data engineering takes the first 2-3 steps: collection, movement and storage, data preparation.

Who are data engineers and how do they become one?

What does a data engineer do?

With the advent of big data, the scope of responsibility has changed dramatically. If earlier these experts wrote large SQL queries and distilled data using tools such as Informatica ETL, Pentaho ETL, Talend, now the requirements for data engineers have increased.

Most companies with open vacancies for the position of data engineer have the following requirements:

  • Excellent knowledge of SQL and Python.
  • Experience with cloud platforms, in particular Amazon Web Services.
  • Knowledge of Java/Scala preferred.
  • Good understanding of SQL and NoSQL databases (data modeling, data storage).

Keep in mind, this is only the bare essentials. From this list, it can be assumed that data engineers are specialists in software development and backend.
For example, if a company starts generating a large amount of data from different sources, your task as a data engineer is to organize the collection of information, its processing and storage.

The list of tools used in this case may differ, it all depends on the volume of this data, the speed of their receipt and heterogeneity. Most companies do not deal with big data at all, so as a centralized repository, the so-called data warehouse, you can use a SQL database (PostgreSQL, MySQL, etc.) with a small set of scripts that send data to the warehouse.

IT giants such as Google, Amazon, Facebook or Dropbox have higher requirements: knowledge of Python, Java or Scala.

  • Experience with big data: Hadoop, Spark, Kafka.
  • Knowledge of algorithms and data structures.
  • Understanding the basics of distributed systems.
  • Experience with data visualization tools such as Tableau or ElasticSearch would be a big plus.

That is, there is a clear shift towards big data, namely in their processing under high loads. These companies have increased requirements for system fault tolerance.

Data Engineers Vs. data scientists

Who are data engineers and how do they become one?
Okay, that was a simple and funny comparison (nothing personal), but it's actually much more complicated.

First, you should be aware that there is a lot of confusion about the roles and skills of a data scientist and a data engineer. That is, you can easily be puzzled by what skills are needed to be a successful data engineer. Of course, there are certain skills that overlap with both roles. But there are also a number of diametrically opposed skills.

Data science is serious business, but we are moving towards a world of functional data science where practitioners are able to do their own analytics. To enable data pipelines and integrated data structures, you need data engineers, not scientists.

Is a data engineer more in demand than a data scientist?

- Yes, because before you can make a carrot cake, you must first collect, peel and stock up on carrots!

A data engineer understands programming better than any data scientist, but when it comes to statistics, everything is exactly the opposite.

But here is the advantage of a data engineer:

without him/her, the value of a prototype model, most often consisting of a piece of terrible quality code in a Python file, obtained from a data scientist and somehow producing a result, tends to zero.

Without a data engineer, this code will never become a project and no business problem will be effectively solved. The data engineer is trying to turn it all into a product.

Basic information that a data engineer should know

Who are data engineers and how do they become one?

So, if this job sparks the light in you and you are full of enthusiasm - you are able to learn this, you can master all the necessary skills and become a real rock star in the field of data mining. And, yes, you can do it even without programming skills or other technical knowledge. It's difficult, but possible!

What are the first steps?

You should have a general idea of ​​what is what.

First of all, Data Engineering refers to computer science. More specifically, you must understand efficient algorithms and data structures. Secondly, since data engineers work with data, it is necessary to understand how databases work and the structures that underlie them.

For example, conventional B-tree SQL databases are based on the B-Tree data structure, as well as, in modern distributed repositories, LSM-Tree and other modifications of hash tables.

*These steps are based on a great article Adilya Khashtamova. So, if you know Russian, support this author and read his post.

1. Algorithms and data structures

Using the right data structure can greatly improve the performance of an algorithm. Ideally, we should all learn data structures and algorithms in our schools, but this is rarely ever covered. In any case, it's never too late to find out.
So, here are my favorite free courses for learning data structures and algorithms:

Plus, do not forget about the classic work on Thomas Kormen's algorithms - Introduction to Algorithms. This is the perfect reference when you need to brush up on your memory.

You can also dive into the world of databases with these amazing Carnegie Mellon Youtube videos:

2. Learn SQL

Our whole life is data. And in order to retrieve this data from the database, you need to "speak" the same language with it.

SQL (Structured Query Language) is the language of communication in the data domain. Regardless of what anyone says, SQL has lived, is alive and will live for a very long time.

If you've been in development for a long time, you've probably noticed that SQL's imminent death rumors pop up from time to time. The language was developed in the early 70s and is still very popular among analysts, developers, and just enthusiasts.
Without knowledge of SQL, there is nothing to do in data engineering, since you will inevitably have to create queries to retrieve data. All modern big data stores support SQL:

  • Amazon RedShift
  • HP Vertica
  • Oracle
  • SQL Server

… and many others.

To analyze a large layer of data stored in distributed systems like HDFS, SQL engines were invented: Apache Hive, Impala, etc. See, it's not going anywhere.

How to learn SQL? Just do it in practice.

To do this, I would recommend that you familiarize yourself with an excellent tutorial, which, by the way, is free, from Mode Analytics.

  1. Intermediate SQL
  2. Joining data in SQL

A distinctive feature of these courses is the presence of an interactive environment in which you can write and execute SQL queries right in the browser. Resource Modern SQL won't be redundant. And you can apply this knowledge to Leetcode tasks in the database section.

3. Programming in Python and Java/Scala

Why it is worth learning the Python programming language, I already wrote in the article Python vs R. Choosing the best tool for AI, ML and Data Science. As for Java and Scala, most of the tools for storing and processing huge amounts of data are written in these languages. For example:

  • Apache Kafka (Scala)
  • Hadoop, HDFS (Java)
  • Apache Spark (Scala)
  • Apache Cassandra (Java)
  • HBase (Java)
  • Apache Hive (Java)

To understand how these tools work, you need to know the languages ​​they are written in. The functional approach of Scala allows you to effectively solve the problems of parallel data processing. Python, unfortunately, cannot boast of speed and parallel processing. In general, knowledge of several languages ​​and programming paradigms is good for the breadth of approaches to problem solving.

To dive into the Scala language, you can read Programming in Scala from the author of the language. Also, Twitter has published a good introductory guide βˆ’ Scala School.

As far as Python is concerned, I believe fluent python the best mid-level book.

4. Tools for working with big data

Here is a list of the most popular tools in the big data world:

  • Apache Spark
  • Apache Kafka
  • Apache Hadoop (HDFS, HBase, Hive)
  • Apache cassandra

You can find more information about building large blocks of data in this amazing interactive environment. The most popular tools are Spark and Kafka. They are definitely worth studying, it is desirable to understand how they work from the inside. Jay Kreps (co-author of Kafka) published a monumental work in 2013 The Log: What Every Software Developer Should Know About Real-Time Data Fusion AbstractionBy the way, the main ideas from this Talmud were used to create Apache Kafka.

5. Cloud platforms

Who are data engineers and how do they become one?

Knowledge of at least one cloud platform is on the list of basic requirements for applicants for the position of data engineer. Employers prefer Amazon Web Services, followed by Google Cloud Platform, and Microsoft Azure closes the top three.

You should be familiar with Amazon EC2, AWS Lambda, Amazon S3, DynamoDB.

6. Distributed systems

Working with big data implies the presence of clusters of independently operating computers, the connection between which is carried out over a network. The larger the cluster, the more likely it is for its member nodes to fail. To become a great data expert, you need to understand the problems and existing solutions for distributed systems. This area is old and complex.

Andrew Tanenbaum is considered a pioneer in this field. For those who are not afraid of theory, I recommend his book "Distributed Systems", for beginners it may seem complicated, but it will really help you hone your skills.

I believe "Designing Data-Intensive Applications" by Martin Kleppmann the best introductory book. By the way, Martin has a wonderful blog. His work will help systematize knowledge about building a modern infrastructure for storing and processing big data.
For those who like to watch videos, there is a course on Youtube Distributed computer systems.

7. Data pipelines

Who are data engineers and how do they become one?

Data pipelines are something you can't live without as a data engineer.

Most of the time, a data engineer builds a so-called pipeline date, that is, creates a process for delivering data from one place to another. These can be custom scripts that go to an external service API or make a SQL query, complete the data and place it in a centralized storage (data warehouse) or unstructured data storage (data lakes).

To sum up: the basic data engineer checklist

Who are data engineers and how do they become one?

In summary, a good understanding of the following is required:

  • Information Systems;
  • Software development (Agile, DevOps, Design Techniques, SOA);
  • Distributed systems and parallel programming;
  • Database fundamentals - planning, design, operation and troubleshooting;
  • Design Experiments - A/B tests to prove concepts, determine reliability, system performance, and develop robust paths to deliver good solutions quickly.

These are just a few of the requirements to become a data engineer, so learn and understand data systems, information systems, continuous delivery/deployment/integration, programming languages, and other computer science topics (not in all subject areas).

And finally, the last but very important thing I want to say.

The path to becoming Data Engineering is not as easy as it might seem. He does not forgive, he frustrates, and you must be prepared for this. Some moments in this journey may push you to quit everything. But this is a real work and learning process.

Just don't embellish it from the start. The whole point of traveling is to learn as much as possible and be ready for new challenges.
Here's a great picture I came across that illustrates this point well:

Who are data engineers and how do they become one?

And yes, don't forget to avoid burnout and rest. This is also very important. Good luck!

How do you like the article, friends? We invite you to free webinarwhich will take place today at 20.00. During the webinar, we will discuss how to build an efficient and scalable data processing system for a small company or startup at minimal cost. As a practice, let's get acquainted with the Google Cloud data processing tools. See you!

Source: habr.com

Add a comment