Inheritance of legacy systems and processes or First 90 days as a CTO

It is known that the competence of the CTO is checked only for the second time in this role. Because it’s one thing to work for a company for several years, to evolve with it and, being in the same cultural context, gradually receive more responsibility. And it’s quite another thing to immediately come to the position of technical director in a company with legacy baggage and a bunch of problems neatly swept under the carpet.

In this sense, the experience of Leon Fire, which he shared on DevOpsConf, not exactly unique, but multiplied by the experience and the number of different roles that he managed to try on for 20 years, it is very useful. Under the cut is a 90-day timeline of events and a lot of tales that are nice to laugh at when they happen to someone else, but which are not so fun to face in person.

Leon speaks very colorfully in Russian, so if you have 35-40 minutes, I recommend watching the video. Text version to save time below.


The first version of the report was a well-structured description of working with people and processes, containing useful recommendations. But she did not convey all the surprises that met along the way. So I changed the format and laid out the problems that popped up in front of me in the new company, like a devil from a snuffbox, and methods for solving them in chronological order.

One month before

Like many good stories, this one started with alcohol. We were sitting with friends in a bar, and, as expected among IT people, everyone was crying about their problems. One of them had just changed jobs and was talking about his problems with technology, people, and the team. The more I listened, the more I realized that he should just hire me, because these are the problems I have been solving for the last 15 years. I told him so, and the next day we met in a working environment. The company was called Teaching Strategies.

Teaching Strategies is the market leader in educational programs for very young children, from birth to age three. The traditional "paper" company is already 40 years old, and the digital SaaS version of the platform is 10 years old. Relatively recently, the process of adapting digital technology to company standards has begun. The “new” version launched in 2017 and was almost like the old one, only it worked worse.

The most interesting thing is that the traffic of this company is very predictable - from day to day, from year to year, you can very clearly predict how many people will come and when. For example, between 13:15 and XNUMX:XNUMX in the afternoon, all children in kindergartens go to bed, and teachers begin to enter information. And this happens every day, except for weekends, because almost no one works on weekends.

Inheritance of legacy systems and processes or First 90 days as a CTO

Looking ahead a little, I will note that I started my work during the period of the highest annual traffic, which is interesting for various reasons.

The platform, which seemed to be only 2 years old, had a peculiar stack: ColdFusion & SQL Server 2008. ColdFusion, if you don’t know, but most likely you don’t know, is such an enterprise PHP that came out in the mid-90s, and since then I haven’t even heard about it. Also there were: Ruby, MySQL, PostgreSQL, Java, Go, Python. But the main monolith worked on ColdFusion and SQL Server.

Problems

The more I spoke with the company's employees about the work and what problems they encountered, the more I realized that the problems were not only technical in nature. Okay, the technology is old - and they didn’t work on this, but there were problems with the team and with the processes, and the company began to understand this.

Traditionally, they had techies sitting in the corner and doing some of their work. But more and more businesses have started going digital. Therefore, in the last year before the start of my work, new ones appeared in the company: the board of directors, CTO, CPO and QA director. That is, the company began to invest in the technological sphere.

Traces of a heavy legacy were not only in the systems. The company had legacy processes, legacy people, legacy culture. All this had to be changed. I thought that it would definitely not be boring, and decided to try it.

Two days before

Two days before starting a new job, I arrived at the office, filled out the last paperwork, got to know the team and found that the team was struggling with a problem at that time. It consisted in the fact that the average page load time jumped to 4 s, that is, 2 times.

Inheritance of legacy systems and processes or First 90 days as a CTO

Judging by the schedule, clearly something happened, and it is not clear what. It turned out that the problem was in the network latency in the data center: 5 ms latency in the data center turned into 2 s for users. Why this happened, I did not know, but in any case it became known that the problem was in the data center.

Day one

Two days passed, and on my first day at work, I found that the problem had not gone away.

Inheritance of legacy systems and processes or First 90 days as a CTO

For two days, page users loaded an average of 4 s. I ask if you have found what the problem is.

— Yes, we opened a ticket.
- And?
Well, they haven't answered us yet.

Then I realized that everything I had been told about before was just a small tip of the iceberg that I had to fight.

There is a good quote that is very appropriate for this occasion:

“Sometimes you have to change the organization to change the technology.”

But since I started work at the busiest time of the year, I had to look at both options for solving the problem: both quickly and in the long term. And start with what is critical right now.

Day three

So, loading lasts 4 seconds, and from 13 to 15 the biggest peaks.

Inheritance of legacy systems and processes or First 90 days as a CTO

On the third day in this period of time, the download speed looked like this:

Inheritance of legacy systems and processes or First 90 days as a CTO

From my point of view, nothing worked at all. From the point of view of everyone else, it worked a little slower than usual. But it doesn't just happen that way - it's a serious problem.

I tried to convince the team, to which they answered that they just needed more servers. This, of course, is a solution to the problem, but far from always the only and most effective one. I asked why there are not enough servers, how much traffic. I extrapolated the data and got that we have about 150 requests per second, which, in principle, fits within reasonable limits.

But we must not forget that before you get the right answer, you must ask the right question. My next question was how many frontend servers do we have. The answer “baffled me a little” - we had 17 frontend servers!

- I'm embarrassed to ask, 150 divided by 17, it will turn out to be about 8? Are you saying that each server passes 8 requests per second, and if tomorrow there are 160 requests per second, we will need 2 more servers?

Of course, we didn't need additional servers. The solution was in the code itself, and on the surface:

var currentClass = classes.getCurrentClass();
return currentClass;

Was a function getCurrentClass(), because everything on the site works in the context of the class - right. And for this one function on each page it was necessary 200+ requests.

The solution in this way was very simple, it didn't even have to be rewritten: just don't ask for the same information again.

if ( !isDefined("REQUEST.currentClass") ) {
    var classes = new api.private.classes.base();
   REQUEST.currentClass = classes.getCurrentClass();
}
return REQUEST.currentClass;

I was very happy, because I decided that on the third day I had found the main problem. As naive as I was, this was just one of so many problems.

Inheritance of legacy systems and processes or First 90 days as a CTO

But solving that first problem pushed the chart much lower.

At the same time, we were working on other optimizations. There were a lot of things that could be fixed. For example, on the same third day, I discovered that there was a cache in the system after all (at first I thought that all queries came directly from the database). When I think about cache, I think of standard Redis or Memcached. But only I thought so, because MongoDB and SQL Server were used for caching in that system - the same one from which the data was just read.

Day ten

The first week I dealt with problems that needed to be solved right now. Somewhere in the second week, I came to the stand-up for the first time to talk with the team, see what was happening and how the whole process was going.

Again something interesting happened. The team consisted of: 18 developers; 8 testers; 3 managers; 2 architects. And they all participated in common rituals, that is, more than 30 people came to the stand-up every morning and told what they were doing. It is clear that the meeting took not 5 or 15 minutes. No one listened to anyone, because everyone works on different systems. In this form, 2-3 tickets per hour for a grooming session were already a good result.

The first thing we did was split the team into several product lines. For different sections and systems, we have allocated separate teams, which included developers, testers, product managers, business analysts.

As a result, we got:

  • Reduction of stand-ups and rallies.
  • Subject knowledge of the product.
  • Feeling of ownership. When people used to rally around the systems all the time, they knew that someone else would most likely have to work with their bugs, but not themselves.
  • Cooperation between groups. You can’t say that QA didn’t communicate much with programmers before, the product did its own thing, etc. Now they have a common point of responsibility.

We mainly focused on efficiency, productivity and quality - these are the problems we tried to solve with the transformation of the team.

Day eleven

In the process of changing the structure of the team, I discovered how to count StoryPoints. 1 SP was equal to one day, and each ticket they had contained SP for both development and QA, that is, at least 2 SP.

How did I discover it?

Inheritance of legacy systems and processes or First 90 days as a CTO

Found a bug: in one of the reports, where the start and end dates of the period for which the report is needed, the last day is not taken into account. That is, somewhere in the request it was not <=, but simply <. I was told that these are three Story Points, that is 3 day.

After that we:

  • Revised the Story Points scoring system. Now fixing minor bugs that can be quickly passed through the system reaches the user faster.
  • We started to merge related tickets for development and testing. Previously, every ticket, every bug was a closed ecosystem, unattached to anything else. Changing three buttons on the same page could have been three different tickets with three different QA processes instead of one automated test per page.
  • We began to work with developers on an approach to estimating labor costs. Three days to change one button is not funny.

Day Twentieth

Somewhere in the middle of the first month, the situation stabilized a little, I figured out what was basically happening, and I already began to look to the future and think about long-term solutions.

Long-term goals:

  • managed platform. Hundreds of requests on each page is not serious.
  • predictable trends. There were periodic peaks of traffic, which at first glance did not correlate with other metrics - it was necessary to understand why this happens and learn how to predict.
  • Platform extension. The business is constantly growing, more and more users are coming, traffic is increasing.

It has often been said in the past: “Let’s rewrite everything in [language/framework], everything will work better!”

In most cases, this does not work, it's good if the rewrite will work at all. Therefore, we needed to create a roadmap - a specific strategy illustrating step by step how business goals will be achieved (what we will do and why), which:

  • reflects the mission and goals of the project;
  • prioritizes the main goals;
  • contains a schedule for achieving them.

Prior to this, no one talked to the team about the purpose for which any changes are being made. This requires the right indicators of success. For the first time in the history of the company, we set KPIs for the technical group, and these indicators were tied to organizational ones.

Inheritance of legacy systems and processes or First 90 days as a CTO

That is, organizational KPIs are supported by teams, and team KPIs are already supported by individual ones. Otherwise, if technological KPIs do not converge with organizational ones, then everyone pulls the blanket over himself.

For example, one of the organizational KPIs is to increase market share through new products.

How can you support the goal of having more new products?

  • First, we want to spend more time developing new products instead of fixing defects. This is a logical decision that is easy to measure.
  • Secondly, we want to support an increase in the volume of transactions, because the greater the market share, the more users and, accordingly, the more traffic.

Inheritance of legacy systems and processes or First 90 days as a CTO

Then the individual KPIs that can be executed within the group will, for example, be in the place where the main defects come from. If you focus on this section, you can make it so that there will be much fewer defects, and then the time for developing new products and, again, for maintaining organizational KPIs, will increase.

Thus, every decision, including rewriting the code, must support the specific goals that the company has set for us (growth of the organization, new functions, recruitment).

During this process, an interesting thing came to light that became news not only for techies, but in general in the company: all tickets must be focused on at least one KPI. That is, if a product says they want to make a new feature, the first question should be asked: “What KPI does this feature support?” If none, then sorry - it seems like an unnecessary feature.

Day thirtieth

At the end of the month, I discovered another nuance that none of my Ops team had ever seen the contracts that we enter into with clients. You may ask why see contacts.

  • Firstly, because SLAs are written in the contracts.
  • Secondly, SLAs are different. Each client came with his own requirements, and the sales department signed without looking.

Another interesting nuance is that the contract with one of the largest clients says that all software versions supported by the platform must be n-1, that is, not the latest version, but the penultimate one.

It is clear how far we were from n-1 if the platform was on ColdFusion and SQL Server 2008, which was no longer supported at all in July.

Day forty five

Somewhere in the middle of the second month, I had enough time to sit down and do valuestreammapping for the entire process. These are the necessary steps that need to be taken, from creating a product to delivering it to the consumer, and they need to be painted in as much detail as possible.

You break the process into small pieces and see what takes too long, what can be optimized, improved, etc. For example, how long does a request from a product take, going through grooming, when it reaches a ticket that a developer can take, QA, etc. So you look in detail at each individual step and think what can be optimized.

When I did this, two things caught my eye:

  • a high percentage of tickets returned from QA back to developers;
  • pull request review was taking too long.

The problem was that they were conclusions like: it seems to take a long time, but we are not sure how much.

"You can't improve what you can't measure."

How to justify how serious the problem is? Does it waste days or hours?

To measure this, we added a couple of steps to the Jira process: “ready for dev” and “ready for QA” to measure how long each ticket waits and how many times it returns to a certain step.

Inheritance of legacy systems and processes or First 90 days as a CTO

We also added “in review” in order to know how many tickets are on average under review, and from this we can already dance. We had system metrics, now we have added new metrics, and began to measure:

  • Process efficiency: performance and scheduled/delivered.
  • Process quality: number of defects, defects from QA.

It really helps to understand what is going well and what is bad.

Day fiftieth

This is all, of course, good and interesting, but towards the end of the second month, something happened that, in principle, was predictable, although I did not expect such a scale. People began to leave, because the top has changed. New people came to the leadership, who began to change everything, and the old ones left. And usually in a company that is several years old, all friends and everyone knows each other.

This was expected, but unexpected was the scale of layoffs. For example, in one week, two team leads simultaneously filed a letter of resignation of their own free will. Therefore, I had to not only forget about other problems, but focus on team building. This is a long and difficult problem, but it had to be dealt with because we wanted to save the people who remained (or most of them). It was necessary to somehow respond to the fact that people left in order to maintain morale in the team.

In theory, this is good: a new person comes in who has full carte blanche, who can assess the skills of the team and replace personnel. In fact, you can’t just bring in new people for so many reasons. There is always a need for balance.

  • Old and new. We need to keep the old people who can change and support the mission. But at the same time, new blood needs to be brought in, we will talk about this a little later.
  • Experience. I talked a lot with good juniors who were on fire and wanted to work with us. But I could not take them, because there were not enough seniors who would support the juniors and be mentors for them. It was necessary first to gain the top and only then the youth.
  • Knut and gingerbread.

I do not have a good answer to the question of what is the right balance, how to maintain it, how many people to leave and how much to push. This is a purely individual process.

day fifty one

I began to look closely at the team in order to understand who I have, and once again I remembered:

“Most problems are problems with people.”

I found that the team as such - both developers and Ops - has three big problems:

  • Satisfaction with the current state of affairs.
  • Lack of responsibility - because no one has ever brought the results of the work of performers to an impact on the business.
  • Fear of change.

Inheritance of legacy systems and processes or First 90 days as a CTO

Change always takes you out of your comfort zone, and the younger people are, the more they dislike change because they don't understand why and they don't understand how. The most common answer I've heard is, "We've never done that before." Moreover, it reached the point of complete absurdity - the slightest changes did not take place without someone being indignant. And no matter how much the changes were related to their work, people said: “No, why? It won't work."

But you can't get better without changing anything.

I had an absolutely absurd conversation with an employee, I told him my optimization ideas, to which he told me:
“Ah, you didn’t see what we had last year!”
- So what?
“Now it is much better than it was.
"So it couldn't be any better?"
- What for?

Good question - why? Like if it's better now than it used to be, then it's good enough. This leads to a lack of responsibility, which in principle is absolutely normal. Like I said, the tech team was a bit off the beaten track. The company believed that they should be, but no one ever set the standards. Tech support never saw an SLA, so it was quite “acceptable” for the group (and this struck me the most):

  • 12 seconds loading;
  • 5-10 minutes downtime each release;
  • fixing critical problems takes days and weeks;
  • lack of attendants 24x7 / on-call.

No one ever tried to ask why we couldn't do it better, and no one ever realized that it shouldn't be like this.

As a bonus, there was another problem: lack of experience. The seniors left, and the remaining young team grew up under the previous regime and was poisoned by it.

To all this, people were also afraid to fail, to seem incompetent. This is reflected in the fact that they, firstly, never asked for help. How many times have we talked in a group and individually, and I said, "Ask a question if you don't know how to do something." I am confident in myself and know that I can solve any problem, but it will take time. So if I can ask someone who knows how to solve it in 10 minutes, I will. The less experience you have, the more you are afraid to ask because you think you will be considered incompetent.

This fear of asking a question comes in interesting forms. For example, you ask: “How are you doing with this task?” “Just a couple of hours left, I’m already finishing.” The next day you ask again, you get an answer that everything is fine, but there was one problem, it will definitely be ready by the end of the day. Another day passes, and until you press against the wall and force someone to talk, it goes on like this. A person wants to solve the problem himself, he believes that if he does not solve it himself, it will be a big failure.

That is why the developers overestimated. It was that anecdote, when we were discussing a certain task, I was given such a figure that I was very surprised. To which I was told that in the estimates the developer includes both the time that the ticket will return from QA, because they will find errors there, and the time that the PR will take, and the time until the people who need to view it will be busy - that is, all , which is only possible.

Secondly, people who are afraid of appearing incompetent overanalyze. When you say what exactly needs to be done, it starts: “No, what if we think here?” In this sense, our company is not unique, it is a standard problem for young people.

In response, I introduced the following practices:

  • 30 minute rule. If in half an hour you cannot solve the problem, ask someone to help. This works with varying degrees of success, because the people still do not ask, but at least the process has begun.
  • Exclude everything but the essence, in estimating the deadline for the task, that is, consider only how long it will take to write the code.
  • Lifelong learning for those who overanalyze. It's just a constant work with people.

Day sixty

While I was doing all this, it's time to deal with the budget. Of course, I found a lot of interesting things about where we spent money. For example, we had a whole rack in a separate data center that had one FTP server that was used by one client. It turned out that "... we moved, but he remained, we did not change him." It was 2 years ago.

Of particular interest was the bill for cloud services. I'm sure the main reason for the high cloud bill is developers who, for the first time in their lives, have unlimited access to servers. They don't have to ask, "Give me a test server, please," they can take it themselves. Plus, developers always want to build such a cool system that Facebook and Netflix will be jealous.

But developers do not have the experience of purchasing servers and the skill to determine the right size of servers, because they did not need it before. And usually they don't quite understand the difference between scalability and performance.

Inventory results:

  • They left one data center.
  • Terminated the contract with 3 log services. Because we had 5 of them - every developer who started playing with something took a new one.
  • Turned off 7 AWS systems. Again, no one stopped the dead projects, they all continued to work.
  • Reduced software costs by 6 times.

Day seventy five

Time passed, and in two and a half months I had to meet with the board of directors. Our board of directors is no better or worse than others; like all boards of directors, it wants to know everything. People invest money and want to understand how what we do fits into the set KPIs.

The Board of Directors receives a lot of information every month: the number of users, their growth, what services they use and how, performance and productivity, and finally, the average page load speed.

The only problem is that I believe that the average is pure evil. But it is very difficult to explain this to the board of directors. They are accustomed to operating with aggregated numbers, and not, for example, the spread of download times per second.

In this regard, there were interesting moments. For example, I said that you need to split traffic between separate web servers depending on the type of content.

Inheritance of legacy systems and processes or First 90 days as a CTO

That is, ColdFusion goes through Jetty and nginx and runs the pages. And pictures, JS and CSS go through a separate nginx with their own configurations. This is a fairly standard practice, which I писал even a couple of years ago. As a result, images load much faster, and ... the average download speed increased by 200 ms.

Inheritance of legacy systems and processes or First 90 days as a CTO

This happened because the graph is based on the data that comes with Jetty. That is, fast content is not included in the calculation - the average value has jumped. We understood this, we laughed, but how to explain to the board of directors why we did something and became worse by 12%?

Day eighty five

At the end of the third month, I realized that one thing I didn’t count on at all was time. Everything I've talked about takes time.

Inheritance of legacy systems and processes or First 90 days as a CTO

This is my real calendar for the week - just a working week, not very busy. There isn't enough time for everything. Therefore, again, you need to recruit people who will help to cope with problems.

Conclusion

That's not all. In this story, I have not even gotten to how we worked with the product and tried to tune in to the general wave, or how we integrated technical support, or how we solved other technical problems. For example, I accidentally found out that on the largest tables in the database, we do not use SEQUENCE. We have a custom function nextID, and it is not used in a transaction.

There were a million other similar things that could be talked about for a long time. But the most important thing to talk about is culture.

Inheritance of legacy systems and processes or First 90 days as a CTO

It is culture, or lack thereof, that leads to all other problems. We are trying to build a culture where people:

  • not afraid of failure;
  • learn from mistakes;
  • collaborate with other teams;
  • show initiative;
  • take responsibility;
  • welcome the result as a goal;
  • celebrate success.

With that, everything else will come.

Leon Fire in twitter, Facebook and medium.

With regard to legacy, there are two strategies: avoid working with it at all costs, or courageously overcome the accompanying difficulties. We c DevOpsConf we follow the second path, changing processes and approaches. Join us at youtube , mailing list и telegram, and together we will implement the DevOps culture.

Source: habr.com

Add a comment