Pavel Klemenkov, NVIDIA: We are trying to reduce the gap between what a data scientist can do and what he should be able to do

The second intake of students of the master's program in data science and business intelligence Ozon Masters has started - and in order to make it easier to decide to submit an application and pass online testing, we asked program teachers about what to expect from training and working with data.

Pavel Klemenkov, NVIDIA: We are trying to reduce the gap between what a data scientist can do and what he should be able to do Chief Data Scientist NVIDIA and Instructor Big Data and Data Engineering course Pavel Klemenkov talked about why mathematicians write code and study at Ozon Masters for two years.

— Are there many companies that use data science algorithms?

- Actually a lot. Quite a lot of large companies that have really big data are either starting to work effectively with them, or have been working for a long time. It is clear that half of the market uses data that will fit in an Excel spreadsheet or can be calculated on a large server, but it is impossible to say that there are only a few businesses that can work with data.

— Tell us a little about the projects where data science is applied.

— For example, while working at Rambler, we were building an advertising system that worked on the principles of RTB (Real Time Bidding) — we needed to build many models that would optimize the purchase of advertising or, for example, could predict the likelihood of a click, conversion, and so on. At the same time, an advertising auction generates a lot of data: site request logs to potential advertising buyers, ad impression logs, click logs - this is tens of terabytes of data per day.

Moreover, for these tasks, we observed an interesting phenomenon: the more data you give to train the model, the higher its quality. Usually, on a certain amount of data, the quality of the forecast stops improving, and to further improve the accuracy, you need to use a fundamentally different model, a different approach to preparing data, features, and so on. Here we uploaded more data and the quality grew.

This is a typical case where analysts had to, firstly, work with large data sets in order to at least conduct an experiment, and where it was impossible to get by with a small sample that fits in a cozy macbook. At the same time, we needed distributed models, because otherwise it was impossible to train them. With the introduction of computer vision in production, such examples are becoming more common, since pictures are a large amount of data, and millions of pictures are needed to train a large model.

The question immediately arises: how to store all this information, how to process it efficiently, how to use distributed learning algorithms - the focus is shifting from pure mathematics to engineering. Even if you don't code in production, you need to be able to work with engineering tools in order to conduct an experiment.

— How has the approach to data science vacancies changed in recent years?

— Big data has ceased to be a hype and has become a reality. Hard drives are quite cheap, which means that it became possible to collect all the data in general, so that in the future they would be enough to test any hypotheses. As a result, knowledge of tools for working with big data is becoming very popular, and, as a result, more and more vacancies for data engineers appear.

In my understanding, the result of the work of a data scientist is not an experiment, but a product that has reached production. And just from this point of view, before the advent of hype around big data, the process was simpler: engineers were engaged in machine learning to solve specific problems, and there were no problems with bringing algorithms to production.

— What does it take to remain a sought-after specialist?

“Now a lot of people have come to data science who have learned mathematics, machine learning theory, participated in data analysis competitions, where ready-made infrastructure is provided: data is cleaned, metrics are defined, and there are no requirements for the solution to be reproducible and fast.

As a result, guys ill-prepared for the realities of business come to work, and a gap is formed between beginners and experienced developers.

With the development of tools that allow you to build your own model from ready-made modules - and Microsoft, Google and many others already have such solutions - and machine learning automation, this gap will become even more pronounced. In the future, the profession will be in demand for serious researchers who come up with new algorithms, and employees with developed engineering skills who will implement models and automate processes. Just the Ozon Masters course on data engineering is designed to develop engineering skills and the ability to use distributed machine learning algorithms on big data. We try to reduce the gap between what a data scientist can do and what he should be able to do in practice.

Why should a mathematician with a diploma go to study in business?

- The Russian data science community has come to understand that skill and experience are very quickly converted into money, therefore, as soon as a specialist has practical experience, its cost begins to grow very quickly, the most skillful people are very expensive - and this is true at the current moment of development market.

A big part of a data scientist's job is to go into the data, understand what's in there, consult with the people who are responsible for the business processes and generate that data, and only then use it to build models. To start working with big data, it is extremely important to have engineering skills - it is much easier to get around sharp corners, which are many in data science.

A typical story: you wrote a SQL query that is executed using the Hive framework that runs on big data. The request is processed in ten minutes, in the worst case, in an hour or two, and often, when you receive unloading of this data, you realize that you forgot to take into account some factor or additional information. You have to resubmit the request and wait for those minutes and hours. If you are an efficiency genius, then you will take up another task, but, as practice shows, we have few efficiency geniuses, and people are just waiting. Therefore, in the courses we will devote a lot of time to work efficiency in order to initially write queries that work not for two hours, but for several minutes. This skill multiplies productivity, and with it the value of a specialist.

– How is Ozon Masters different from other courses?

— Ozon Masters is taught by Ozon employees, and assignments are based on real business cases that are being solved in companies. In fact, in addition to the lack of engineering skills, a person who studied data science at university has another problem: the task of business is formulated in the language of business, and its goal is quite simple: to make more money. And a mathematician knows well how to optimize mathematical metrics - but finding an indicator that will correlate with a business metric is difficult. And you need to understand that you are solving a business problem, together with the business, formulate metrics that can be mathematically optimized. This skill is acquired through real cases, and they are given by Ozon.
And even if we discard the cases, the school teaches a lot of practitioners who solve business problems in real companies. As a result, the approach to teaching itself is still more practice-oriented. At least in my course, I will try to shift the focus to how to use the tools, what approaches exist, and so on. Together with the students, we will understand that there is a tool for each task, and each tool has an area of ​​applicability.

- The most famous data analysis training program, of course, is ShAD - what is the difference specifically from it?

- It is clear that ShAD and Ozon Masters, in addition to the educational function, solve the local problem of training personnel. Top ShAD graduates are primarily recruited to Yandex, but the catch is that Yandex, due to its specifics - and it is large and was created when there were few good tools for working with big data - has its own infrastructure and tools for working with data, which means , you have to learn them. Ozon Masters has a different message - if you successfully mastered the program and Ozon or one of 99% of other companies invites you to work, it will be much easier to start to benefit the business, the skill set acquired under Ozon Masters will be enough to just start working.

- The course lasts two years. Why do you need to spend so much time on this?

- Good question. It takes a long time, because in terms of content and the level of teachers, this is an integral master's program that requires a lot of time to master, including homework.

From the point of view of my course, it is common to expect a student to spend 2-3 hours per week on assignments. Firstly, tasks are performed on the training cluster, and any common cluster implies that several people use it at the same time. That is, you have to wait for the task to start executing, some resources can be selected and transferred to a higher priority queue. On the other hand, any work with big data takes a lot of time.

If you have any more questions about the program, working with big data or engineering skills - on Saturday, April 25 at 12:00, Ozon Masters has an online open day. Meet with faculty and students Zoom and YouTube.

Source: habr.com

Add a comment