Data Engineer or die: the story of one developer

In early December, I made a fatal mistake and made a pivotal decision in my life as a developer and moved to the Data Engineering (DE) team within the company. In the article, I will share some of the observations that I made during my two months of work in the DE team.

Data Engineer or die: the story of one developer

Why Data Engineering?

My journey to DE began in the summer of 2019, when we xneg let's go to School of distributed computingand there I attained enlightenment. Began to be interested in the topic, to study algorithms and even about them write, and then I thought about the scope and quickly found out that the practical application in our company is distributed databases.

What exactly does our team do? We, like all fashionable boys and girls, want to become a Data Driven Company. And in order for this to become possible, we need at least to build a reliable repository, on which it will be possible to build any reports that the company needs. But most importantly, the data in this repository must be trusted. Moreover, according to these data, one must be able to restore the state of the system at time t. All this is complicated by the fact that we live in a brave new world of microservices, and this ideology implies that each service implements its own little functionality, its database is its own business, and it can delete it at least every day, but at the same time we must be able to receive and process the state of the service.

If you want to be Data Driven, first become Event Driven

Not so simple. Events are different, and the developer and the data engineer look at them differently. Talking about events is a topic for a separate article, so I won’t go into it here. In addition, such an article is already wrote someone Martin Fowler, I will not take away his laurels, let him also become famous.

In general, there is something to think about and this area is attractive. It just so happened that in our company Data Engineer is a much broader area of ​​responsibility than just a person who writes ETL / ELT pipelines (if you don’t know what these abbreviations mean, come to meetup. As contextual advertising).

We deal with the architecture of building a warehouse, and data modeling, and issues related to data security, and the pipelines themselves, of course. And we also need to make sure that, on the one hand, our presence is not very burdensome for product developers and they have to be distracted as little as possible by our requirements when inserting new features into the system, and on the other hand, we need to provide conveniently arranged in storage data for analysts and BI teams. That's how we live.

Difficulties in the transition from development

On the very first day of my work, I encountered a number of difficulties that I want to share with you.

1. The first thing I saw was the lack of tooling and some practices. Take, for example, code coverage with tests. In development, we have hundreds of frameworks for testing. When working with data, everything is more complicated. Yes, we can test ETL pipelines on test data, but we have to do it all by hand and look for solutions for each specific case. As a result, test coverage is much worse. Fortunately, there is another layer of feedback in the form of monitoring and logs, but this already requires us to react reactively, rather than proactively, which infuriates and unnerves.

2. The world from the position of DE is not at all what it seems to an ordinary product developer (well, of course, the reader is not like that, and he already knows everything, but I didn’t know and now I’m raking). As a developer, I saw my microservice, put the data in [database of your choice], saved my state there, got something by ID and it's fine. Service is spinning, orders are muddled, that's all. They ask me in another service to rummage my state, well, I'll throw the event into some RabbitMQ and that's it. And here we again returned to the issue of events, described above.

What the service needs for operational work does not suit us for historical data, the issue of reworking service contracts and close work with development teams begins. You can’t even imagine how many hours it took us to agree on what kind of Event Driven it is in our company.

3. You need to think with your head. No, I don’t mean that developers don’t think (although who am I to speak for everyone), it’s just that in product development, very often you already have some kind of architecture, and you cut different things from the backlog. Of course, this requires planning and thinking, but this is a streaming work, where the main problem is just to do it well and with high quality.

It’s not so easy for us, because the transfer of various system components from a warm and cozy monolith to the world of wild microservice jungle is not so simple. When the service starts to pour events, then you need to reconsider the logic of filling the storage, because the data now looks different. Here you need to think a lot and thoroughly, not as a developer, but as a data engineer. A normal story when you spend days with a notebook and a pen or with a marker at the blackboard. It’s very difficult, I don’t like to think, I love shit-craw and in production.

4. Perhaps the most important thing is information. What do we do when we lack knowledge? Who said stackoverflow? Get this person out of the room. We go to read docs, books on the topic, and there is also a community that organizes forums, meetups and conferences. Documentation is great, but, unfortunately, it can be incomplete. We use Cosmos DB in a number of projects. Good luck with reading the documentation for this product. Books are the only salvation, fortunately they exist and can be found, they contain a lot of fundamental knowledge and you have to read a lot and constantly. But the trouble is with the community.

Now in our direction it is difficult to find at least one adequate conference or meetup. No, of course, there are a lot of meetups with the word Data, but next to this word there are usually strange abbreviations like ML or AI. So, this is not for us, we are talking about how to build storages, and not how to cover yourself with neurons. These hipsters are all over the place. As a result, we are without a community. By the way, if you are a Data Engineer and know good communities, please write in the comments.

Conclusions and announcement of the meetup

What do we end up with? My first experience tells me that it will be useful for every developer to feel in the shoes of a data engineer. It just allows us to look at things differently and not be surprised when our eyes bleed when we see how developers treat their data. So, if your company has a DE, just chat with these guys, you will learn a lot of new things (about yourself).

And finally, the announcement. Since meetups on our topic cannot be found during the day with fire, we decided to make our own. What, why are we worse? Luckily, we have an amazing Schvepsss and our friends from New Professions Lab, to whom, like us, it seems that the date engineers are unfairly deprived of attention.

I take this opportunity to invite all those who are not indifferent to come to our first community meetup with the promising name β€œDE or DIE”, which will be held on February 27.02.2020, XNUMX at the Dodo Pizza office. Details on timepad.

If anything, I'll be there, you can personally tell me in person how wrong I am about the developers.

Source: habr.com

Add a comment