While everyone was celebrating my birthday, I was fixing the cluster until the morning - and the developers blamed their mistakes on me

While everyone was celebrating my birthday, I was fixing the cluster until the morning - and the developers blamed their mistakes on me

Here is a story that forever changed my approach to working as a devops. Back in the pre-Covid times, long, long before them, when the guys and I were just thinking about our business and freelancing on random orders, one offer fell into my cart.

The company that wrote it was a data analytics company. She handled thousands of requests daily. They came to us with the words: guys, we have ClickHouse and we want to automate its configuration and installation. We want Ansible, Terraform, Docker, and all this to be stored in the git. We want a cluster of four nodes with two replicas in each.

Standard request, what dozens, and you need the same good standard solution. We said β€œokay”, and after 2-3 weeks everything was ready. They accepted the job and started moving to the new Clickhouse cluster using our utility.

None of them wanted or knew how to mess around with Clickhouse. Then we thought that this was their main problem, and therefore the company's service station simply gave the go-ahead to my team to automate the work as much as possible so as not to go there myself never again.

We accompanied the move, other tasks appeared - to set up backups and monitoring. At the same moment, the service station of this company merged into another project, leaving us for the commander of one of our own - Leonid. Lenya was not a very gifted guy. A simple developer who was suddenly put in charge of Clickhouse. It seems that this was his first appointment to supervise something, and from the honor that had fallen on him, star disease appeared in him.

Together we started making backups. I suggested backing up the original data right away. Just take, zip and elegantly throw into some c3. The initial data is gold. There was another option - to back up the tables themselves in Clickhouse, using frieze and copying. But Lenya came up with his own solution.

He announced that we needed a second Clickhouse cluster. And from now on, we will write data to two clusters - the main and backup. I tell him, they say, Lenya, not a backup will come out - but an active replica. And if data starts to be lost in production, your backup will be the same.

But Lyonya firmly grabbed the steering wheel and refused to listen to my arguments. We chatted with him for a long time, but there was nothing to do - Lenya was driving the project, we were just hired guys from the street.

We monitored the state of the cluster and charged only for the work of admins. Pure administration of Clickhouse without getting into the data. The cluster was available, the disks were fine, the nodes were fine.

We did not yet suspect that we received this order due to a terrible misunderstanding within their team.

The manager was unhappy that Clickhouse was slow and data was sometimes lost. He set his service station the task of sorting it out. He figured it out as best he could, and concluded that you just need to automate Clickhouse - that's all. But as it soon became clear, they didn’t need a team of devops at all.

All this turned out to be very, very painful. And the worst part, it was on my birthday.

Friday evening. I booked a table at my favorite wine bar and called my homies.

Almost before the exit, we get a task to make an alter, we do it, everything is okay. Alter passed, clickhouse confirmed. We have already gathered at the bar, and they write to us that there is not enough data. I figured it was just enough. And they went to celebrate.

The restaurant was noisy on a Friday. Having ordered drinks, food, collapsed on sofas. All this time, my slack was slowly flooded with messages. Wrote something about the lack of data. I thought the morning is wiser than the evening. Especially today.

Closer to eleven they started calling. It was the head of the company ... "Probably decided to congratulate me," I thought very uncertainly, picked up the phone.

And I heard something like: β€œYou pissed off our data! I pay you, but nothing works! You were in charge of the backups, and you didn't do shit! Let's do it!" - only rougher.

β€œYou know what, go fuck yourself!” It's my birthday today, and now I'm going to drink, and not do your June homemade shit and sticks!

That's what I didn't say. Instead, he took out a laptop and set to work.

No, I bombed, I bombed like hell! He poured caustic β€œI told you so” into the chat - because the backup, which was not a backup, of course, did not save anything.

The boys and I figured out how to manually stop the recording and check everything. Really made sure that some of the data is not written.

We stopped recording, counted the number of events that are there per day. They threw in more data, from which only a third did not sign up. Three shards of 2 replicas. You insert 100.000 lines - 33.000 are not recorded.

There was complete confusion. Everyone sent each other to hell in turn: Lenya went there first, followed by myself and the founder of the company. Only the service station that joined tried to withdraw our phone calls with shouts and correspondence in the direction of finding a solution to the problem.

What really happened - no one understood

The guys and I just went nuts when we realized that a third of all the data was not just not recorded - it was lost! It turned out that the order in the company was as follows: after the insertion, the data was permanently deleted, the events were pissed off in batches. I imagined how Sergey converts all this into lost rubles.

My birthday was also going to the trash. We sat at the bar and generated ideas, trying to solve the thrown riddle. The reason for Clickhouse's downfall was not clear. Maybe it's the network, maybe it's the Linux settings. Yes, anything, hypotheses sounded enough.

I didn't take an oath as a developer, but leaving the guys on the other end of the line was dishonorable, even if they blamed us for everything. I was 99% sure that the problem lay not in our decisions, not on our side. 1% chance that we still screwed up, burned with anxiety. But no matter which side the trouble was on, it had to be fixed. Leaving customers, whatever they may be, with such a terrible data leak is too cruel.

Until three in the morning we worked at the restaurant table. We threw up events, insert select - and drove to fill in the gaps. When you pissed off the data, it's done like this - you take the average data for the previous days and insert them into the pissed ones.

After three in the morning, my friend and I went to my house, ordered a beer from the alcohol market. I was sitting with a laptop and Clickhouse's problems, a friend was telling me something. As a result, an hour later he was offended that I was working, and not drinking beer with him, and left. Classic - was a friend of the devops.

By 6 am, I recreated the table again, and the data began to flood. Everything worked without loss.

Then it was hard. Everyone blamed each other for data loss. If there was a new bug, I'm sure there would be a shootout

In these battles, we finally began to understand - the company thought that we were the guys who work with data and monitor the structure of tables. They confused admins with dibies. And they came to ask us far from being admins.

Their main complaint is what the fuck, you were responsible for the backups and didn’t do them properly, gouged the data. And all this with mats-remats.

I wanted justice. I dug up the correspondence and attached all the screenshots, where Leonid with all his might forces to make such a backup as was made. Their service station sided with us after my phone call. Later, Lenya also admitted his guilt.

The head of the company, on the contrary, did not want to blame his own. Screenshots and words had no effect on him. He believed that since we were experts here, we had to convince everyone and insist on our decision. Apparently, our task was to teach Lenya and, moreover, bypassing him, appointed as the project manager, reach the main point and personally pour out all our doubts about the concept of backups to him.

Chatik oozed with hatred, hidden and undisguised aggression. I didn't know how to be. Everything went to a standstill. And then I was advised the easiest way - to write to the manager in a personal and arrange a meeting with him. Vasya, people in real life are not as greyhounds as they are in chat. The boss replied to my message: come, no question.

It was the worst meeting in my career. My ally from the client - the service station - could not find the time. I went to the meeting to the boss and Lena.

Time after time, I replayed our possible dialogue in my head. Managed to arrive much in advance, half an hour. Nervousness began, I smoked 10 cigarettes. I understood everything - I'm fucking alone. I can't convince them. And stepped into the elevator.

As he got up, he lit his lighter so hard that he broke it.

As a result, Leni was not at the meeting. And we had a great talk about everything with the main! Sergei told me about his pain. He didn't want to "automate Clickhouse" - he wanted "to make requests work."

I saw not a goat, but a good guy who cares about his business, immersed in work 24/7. Chat often draws us villains, scoundrels and stupid people. But in real life they are people just like you.

Sergey did not need a couple of devops for hire. The problem they had was much bigger.

I said that I could solve his problems - it's just that this is a completely different job, and I have a familiar DBA for it. If we had known from the beginning that this was a business for them, we would have avoided a lot. Late, but we realized that the problem lay in the shitty work with data, and not in the infrastructure.

We shook hands, the fee was raised two and a half times, but on the condition that I take absolutely all the smut with their data and Clickhouse for myself. In the elevator, I contacted the same DBA Max and connected him to work. It was necessary to shovel the entire cluster.

Treshaka in the adopted project was in bulk. Starting with the mentioned "backup". It turned out that the same β€œbackup” cluster was not isolated. Everything was tested on it, sometimes even put into production.

The in-house developers have created their own custom data inserter. It worked like this: it batched files, ran a script and poured data into a table. But the main problem was that a huge amount of data was received for one simple request. The request joynil data per second. All for the sake of one number - the amount per day.

In-house developers used the analytics tool incorrectly. They went to grafana, wrote their royal request. He uploaded data for 2 weeks. It turned out to be a nice chart. But in fact, the data request was every 10 seconds. All this accumulated in the queue, because Clickhouse simply did not take out the processing. Here lies the main reason. Nothing worked in Grafana, requests stood in a queue, old irrelevant data constantly arrived.

We reconfigured the cluster, redid the insert. Staff developers rewrote their "insert", and it began to shard the data correctly.

Max did a full audit of the infrastructure. He painted a plan for the transition to a full-fledged backend. But this did not suit the company. They were waiting for a magical secret from Max that would allow them to work the old fashioned way, but only effectively. The project was still in charge of Lenya, who had not learned anything. From all that was offered, he again chose his alternative. As always, it was the most selective... courageous decision. Lenya believed that his company had a special path. Thorny and full of icebergs.

Actually, on this we parted - we did what we could.

With bumps full, wiser by this story, we opened our own business and formed several principles for ourselves. Now we will never start the work the same as then.

DBA Max joined us after this project, and we still work great together. The Clickhouse case taught me to conduct a complete and thorough infrastructure audit before starting work. We understand how everything works, and only then we accept tasks. And if earlier we immediately rushed to maintain the infrastructure, now we first do a one-time project, which helps to understand how to bring it into working condition.

And yes, we bypass projects with shitty infrastructure. Even if for a lot of money, even if out of friendship. It is unprofitable to conduct sick projects. Knowing this has helped us grow. Either a one-time infrastructure clean-up project and then a maintenance contract, or we just fly by. Past another iceberg.

PS So if you have questions about your infrastructure, Feel free to submit a request.

We have 2 free audits per month, perhaps your project will be one of them.

Source: habr.com

Add a comment