Data Engineer and Data Scientist: what's the difference?

The professions of Data Scientist and Data Engineer are often confused. Each company has its own specifics of working with data, different goals for their analysis and a different idea of ​​which of the specialists should be involved in which part of the work, therefore, each has its own requirements. 

We figure out what is the difference between these specialists, what business tasks they solve, what skills they have and how much they earn. The material turned out to be large, so it was divided into two publications.

In the first article, Elena Gerasimova, head of the faculty "Data Science and Analytics” in Netology, tells what is the difference between a Data Scientist and a Data Engineer and what tools they work with.

How do the roles of engineers and scientists differ?

A data engineer is a specialist who, on the one hand, develops, tests and maintains the infrastructure for working with data: databases, storages, and mass processing systems. On the other hand, this is the one who cleans and β€œcombs” data for use by analysts and data scientists, that is, creates data processing pipelines.

Data Scientist creates and trains predictive (and more) models using machine learning algorithms and neural networks, helping businesses find hidden patterns, predict developments, and optimize key business processes.

The main difference between a Data Scientist and a Data Engineer is that they usually have different goals. Both work to keep data accessible and of high quality. But the Data Scientist finds answers to his questions and tests hypotheses in the data ecosystem (for example, based on Hadoop), and the Data Engineer creates a service pipeline for the machine learning algorithm written by the data scientist in a Spark cluster within the same ecosystem. 

A data engineer brings value to a business by working as a team. Its mission is to act as an important link between different participants - from developers to business users of reporting - and to increase the productivity of analysts - from marketing and product to BI. 

The Data Scientist, on the other hand, takes an active part in the company's strategy and extracting insights, making decisions, implementing automation algorithms, modeling and generating value from data.
Data Engineer and Data Scientist: what's the difference?

Working with data is subject to the GIGO (garbage in - garbage out) principle: if analysts and data scientists deal with unprepared and potentially incorrect data, then the results of even the most sophisticated analysis algorithms will be incorrect. 

Data engineers solve this problem by building pipelines for processing, cleaning and transforming data and allowing the data scientist to work with high-quality data. 

There are many data tools on the market that cover each of the stages: from the appearance of data to the output to the dashboard for the board of directors. And it is important that the decision on their use is made by the engineer, not because it is fashionable, but because he will really help the other participants in the process in their work. 

Conditionally: if a company needs to make friends with BI and ETL - loading data and updating reports, here is a typical legacy foundation that a Data Engineer will have to deal with (it’s good if there is also an architect in the team besides him).

Responsibilities of a Data Engineer

  • Development, construction and maintenance of infrastructure for working with data.
  • Error handling and building robust data processing pipelines.
  • Bringing unstructured data from various dynamic sources to the form necessary for the work of analysts.
  • Providing recommendations to improve the consistency and quality of data.
  • Providing and maintaining the data architecture used by data scientists and data analysts.
  • Processing and storing data consistently and efficiently in a distributed cluster of tens or hundreds of servers.
  • Assess the technical trade-offs of tools to create simple yet robust architectures that can survive failures.
  • Control and support of data flows and related systems (setting up monitoring and alerts).

There is another specialization within the Data Engineer trajectory - ML engineer. In short, these engineers specialize in bringing machine learning models to industrial adoption and use. Often, a data scientist model is part of a study and may not work in combat.

Responsibilities of a Data Scientist

  • Extracting features from data to apply machine learning algorithms.
  • Using various machine learning tools to predict and classify patterns in data.
  • Improve the performance and accuracy of machine learning algorithms by fine-tuning and optimizing algorithms.
  • Formation of "strong" hypotheses in accordance with the company's strategy, which need to be tested.

Both the Data Engineer and the Data Scientist combine a tangible contribution to the development of a culture of working with data, through which a company can increase profits or reduce costs.

What languages ​​and tools do engineers and scientists work with?

Today, expectations from data scientists have changed. Previously, engineers built large SQL queries, manually wrote MapReduce, and processed data using tools such as Informatica ETL, Pentaho ETL, Talend. 

In 2020, a specialist cannot do without knowledge of Python and modern computing tools (for example, Airflow), an understanding of the principles of working with cloud platforms (using them to save on hardware, while observing security principles).

SAP, Oracle, MySQL, Redis are traditional data engineer tools in large companies. They are good, but the cost of licenses is so high that it only makes sense to learn how to work with them in industrial projects. At the same time, there is a free alternative in the form of Postgres - it is free and suitable not only for learning. 

Data Engineer and Data Scientist: what's the difference?
Historically, a request for Java and Scala has often been encountered, although as technologies and approaches develop, these languages ​​fade into the background.

However, hardcore BigData: Hadoop, Spark and the rest of the zoo is no longer a prerequisite for a data engineer, but a kind of tool for solving problems that traditional ETL cannot solve. 

The trend is services for using tools without knowing the language in which they are written (for example, Hadoop without knowledge of Java), as well as providing ready-made services for processing streaming data (recognition of voice or images on video).

Industrial solutions from SAS and SPSS are popular, while Tableau, Rapidminer, Stata and Julia are also widely used by data scientists for local tasks.

Data Engineer and Data Scientist: what's the difference?
Analysts and data scientists got the opportunity to build pipelines themselves just a couple of years ago: for example, it is already possible to send data to PostgreSQL-based storage with relatively simple scripts. 

Typically, the use of pipelines and integrated data structures is left to data engineers. But today, the trend for T-shaped specialists is stronger than ever - with broad competencies in related fields, because the tools are constantly being simplified.

Why a Data Engineer and a Data Scientist Work Together

By working closely with engineers, Data Scientist can focus on the research side, building machine learning algorithms that are ready to go.
And engineers should focus on scalability, data reuse, and ensure that data input and output pipelines in each individual project comply with the global architecture.

This segregation of duties ensures consistency across teams working on different machine learning projects. 

Collaboration helps to effectively create new products. Speed ​​and quality are achieved through a balance between creating a service for everyone (global storage or dashboard integration) and the implementation of each specific need or project (highly specialized pipeline, connecting external sources). 

Working closely with data scientists and analysts helps engineers develop analytical and research skills to write better code. Knowledge sharing is improved between users of data warehouses and data lakes, making projects more flexible and delivering more sustainable long-term results.

In companies that aim to develop a culture of working with data and building business processes based on it, Data Scientist and Data Engineer complement each other and create a complete data analysis system. 

In the next article, we will talk about what kind of education a Data Engineer and Data Scientists should have, what skills they need to develop and how the market works.

From the editors of Netology

If you are looking at the profession of Data Engineer or Data Scientist, we invite you to study the programs of our courses:

Source: habr.com

Add a comment