🥇How the bank "broke"

Unsuccessful migration of IT infrastructure resulted in damage to 1,3 billion bank customer records. The fault was insufficient testing and a frivolous attitude towards complex IT systems. Cloud4Y tells how it was.

In 2018 English bank TSB realized that his two-year-old "divorce" from the banking group Lloyds (both companies merged in 1995) was too costly. TSB was still tied to the former partner through Lloyds' hastily cloned IT systems. Worst of all, the bank had to pay “alimony”—an annual license fee of $127 million.

Few people like to pay money to their ex, so on April 22, 2018 at 18:00 pm, TSB began the final stage of an 18-month plan that was supposed to change everything. It was planned to transfer billions of customer records to the IT system of the Spanish company Banco Sabadell, which bought TSB for $2,2 billion back in 2015.

Banco Sabadell CEO José Olyu spoke about the upcoming event 2 weeks before Christmas 2017 during a celebratory staff meeting at a prestigious conference hall in Barcelona. The most important migration tool was to be a new version of the system developed by Banco Sabadell: Proteo. It was even renamed to Proteo4UK specifically for the TSB migration project.

At the presentation of Proteo4UK, Banco Sabadell CEO Jaime Guardiola Romojaro boasted that the new system is a large-scale project that has no analogues in Europe, on which more than 1000 specialists worked. And that its implementation will give a significant impetus to the growth of Banco Sabadell in the UK.

April 22, 2018 was designated as Migration Day. It was a quiet Sunday evening in the middle of spring. The bank's IT systems were shut down as records were being transferred from one system to another. With the restoration of public access to bank accounts late on Sunday evening, one could expect the bank to slowly and smoothly return to operation.

But while Olyu and Guardiola Romojaro happily spoke from the stage about the implementation of the Proteo4UK project, the employees responsible for the migration process were very nervous. The project, which was given 18 months, was seriously behind schedule and over budget. There was no time to conduct additional tests. But transferring all the company's data (and this, we recall, billions of records) to another system is a titanic work.

It turned out that the engineers were nervous for good reason.

A stub on a site that customers have seen for too long

20 minutes after TSB opened the accounts, being completely sure that the migration went smoothly, the first reports of problems came.

The savings of people suddenly disappeared from the accounts. Small purchases were erroneously recorded as spending in the thousands. Some people logged in to their personal accounts and saw not their bank accounts, but the accounts of completely different people.

At 21:00, TSB representatives told the local financial regulator (the UK Financial Conduct Authority, FCA) that the bank was in trouble. But the FCA has already taken notice: TSB really screwed up, and the customers were fooled. And, of course, they began to complain to social networks (and in our time, dropping a couple of lines on Twitter or Facebook is not difficult). At 23:30 p.m., the FCA was contacted by another financial regulator, the Prudential Regulation Authority (PRA), who also sensed something was wrong.

Already deep after midnight they managed to get through to one of the representatives of the bank. And ask them the only question: “what the hell is going on?”.

It took time to understand the extent of the tragedy, but now we know that 1,3 billion records of 5,4 million customers were damaged during the migration. For at least a week, clients could not manage their money from a computer and mobile devices. They failed to pay their loans, and many of the bank's customers received a stain on their credit history, as well as late fees.

This is what the online bank of TSB clients looked like

When the failures began to appear, almost immediately after, bank representatives assured that the problems were "intermittent". Three days later, a statement was issued that all systems were normal. But customers kept reporting problems. It wasn't until April 26, 2018 that the bank's CEO, Paul Pester, acknowledged that TSB was "on its knees" as the bank's IT infrastructure still had a "bandwidth problem" that kept about a million customers from accessing online banking services.

Two weeks after the migration began, crashes were still reported in the online banking application, which was throwing internal errors related to the SQL database.
Difficulties with payments, especially with business accounts and mortgage accounts, continued for up to four weeks. And ubiquitous journalists found out that TSB turned down an offer of help from Lloyds Banking Group at the very beginning of the migration crisis. In general, problems related to logging into online services and the ability to transfer money were observed until September 3rd.

A bit of history

The first ATM was opened on June 27, 1967 near Barclays in Enfield.

Banking IT systems are becoming more complex as customer needs and expectations from the bank grow. 40-60 years ago, we would have been happy to visit the local bank branch during business hours to deposit cash or withdraw it through the cashier.

The amount of money in the account was directly related to the cash and coins that we handed over to the bank. Our home bookkeeping could be tracked with pen and paper, and computer systems were out of reach for clients. Bank employees put data from passbooks and other media into devices that counted money.

But in 1967 in north London for the first time Was installed an ATM that was not located on the territory of the bank. And this event changed banking. User convenience has become a benchmark for the development of financial institutions. And this has helped banks become more sophisticated in terms of working with customers and their money. After all, while computer systems were available only to bank employees, they were satisfied with the old, “paper” way of interacting with a client. It was only when ATMs appeared, and then online banking, that the general public got direct access to the bank's IT systems.

ATMs were just the beginning. Soon, people were able to skip the checkout line by simply calling the bank on the phone. This required special cards inserted into a reader capable of deciphering the dual-tone multi-frequency (DTMF) signals transmitted when the user presses the "1" (withdraw) or "2" (deposit) key.

The Internet and mobile banking have brought customers closer to the main systems that ensure the operation of banks. Despite different restrictions and settings, all these systems must communicate effectively with each other and with the main mainframe, performing account balance checks, making money transfers, and so on.

Few customers think about how complicated the information goes when, for example, you go to an online bank to view or update information about the money in the account. When you log in, this data is passed through a set of servers, when you make a transaction, the system replicates this data in the back-end infrastructure, which then does the hard work - moving money from one account to another to pay bills, make payments, and continue subscriptions.

Now multiply this process by several billion. According to data compiled by the World Bank with the help of the Bill & Melinda Gates Foundation, 69 percent adults all over the world have a bank account. Each of these people has to pay bills. Someone pays a mortgage or transfers money for children's mugs, someone pays for a subscription to Netflix or rent a cloud server. And all these people use more than one bank.

Numerous internal IT systems of one bank (mobile banking, ATMs, etc.) should not just interact with each other. They need to interact with other banking systems in Brazil, China, Germany. A French ATM should be able to dispense money that is on a bank card issued somewhere in Bolivia.

Money has always been global, but never before has this system been so complex. The number of ways to use the bank's IT systems is increasing, but the old ways are still in use. The success of a bank largely depends on how “repairable” its IT infrastructure is and how effectively the bank can deal with a sudden failure that will cause the system to become idle.

No tests - prepare for problems

Banco de Sabadell CEO Jaime Guardiola (left) was confident that everything would go smoothly. Did not work out.

TSB's computer systems weren't very good at solving problems quickly. There were, of course, software failures, but in reality the bank “broke down” due to the excessive complexity of its IT systems. According to the report, which was prepared in the early days of the massive disruption, "the combination of new applications, increased use of microservices, combined with the use of two active (Active / Active) data centers has led to a complex risk in production."

Some banks, such as HSBC, operate globally and therefore also have very complex, interconnected systems. But they are regularly tested, migrated and updated, according to one of HSBC's Lancaster CIOs. He sees HSBC as a model for how other banks should manage their IT systems: allocating staff and spending their time. But at the same time, he admits that for a smaller bank, especially one with no experience in migration, doing it right is a very difficult task.

The TSB migration was difficult. And, according to experts, the bank's staff could simply fall short of this level of complexity in terms of qualifications. In addition, they did not even bother to test their solution, test the migration in advance.

During a speech in the British Parliament on banking issues, Andrew Bailey, chief executive of the FCA, confirmed this suspicion. The bad code probably only caused initial problems at TSB, but the interconnected systems of the global financial network meant that its errors were perpetuated and irreversible. The bank continued to see unexpected errors elsewhere in its IT architecture. Clients received messages that were meaningless or unrelated to their problems.

Regression testing could have helped prevent disaster by identifying bad code before it was released to production, and it caused damage by creating bugs that could not be rolled back. But the bank decided to run through a minefield that they didn't even know existed. The consequences were predictable. Another problem was the "optimization" of costs. What did she show up in? In that it was previously decided to do away with the backups stored at Lloyds, as they "ate" too much money.

British banks (and others too) are striving to achieve a level of availability of "four nines", that is, 99,99%. In practice, this means that the IT system must be available at all times, with downtime of up to 52 minutes per year. The system of "three nines", 99,9%, at first glance, does not differ much. But in reality it means downtime of up to 8 hours per year. For a bank, "four nines" is good, but "three nines" is not.

But every time a company makes changes to its IT infrastructure, it takes a risk. Because something can go wrong. Minimizing changes can help avoid problems, while required changes need to be thoroughly tested. And at this point the British regulators focused their attention.

Perhaps the easiest way to avoid downtime is to simply make fewer changes. But every bank, like any other company, is forced to implement more and more useful features for customers and their own business in order to remain competitive. At the same time, banks are still obliged to take care of their customers, protecting their savings and personal data, providing comfortable conditions for using the services. It turns out that organizations are forced to spend a lot of time and money on maintaining the health of their IT infrastructure, while simultaneously offering new services.

Reported technology failures in the financial services sector in the UK increased by 187 percent between 2017 and 2018, according to data released by the UK Financial Conduct Authority. Most often, the cause of failures is a problem in the operation of the new functionality. At the same time, it is critically important for banks to ensure the constant uninterrupted operation of all services and almost instantaneous reporting on transactions. Customers are always nervous when their money is dangling somewhere. And a client nervous about money is always in trouble, a sure sign.

A few months after the TSB crash (by which time the bank's CEO had resigned), the UK's financial regulators and the Bank of England issued a document to discuss operational sustainability issues. So they tried to raise the question of how deep the banks have gone in pursuit of innovation, and whether they can guarantee the stable operation of the system that is now in place.

The document also proposed to amend the legislation. It was about holding employees within the company accountable for what goes wrong in that company's IT systems. British parliamentarians explained it this way: “When you are personally responsible, and you can be bankrupt or sent to prison, this will greatly change the attitude towards work, including increasing the amount of time devoted to the issue of reliability and safety.”

Results

Every update and patch comes down to risk management, especially when it comes to hundreds of millions of dollars. After all, if something goes wrong, it can be costly in terms of money and reputation. It would seem obvious things. And the failure of the bank during the migration should have taught them a lot.

Had. But he didn't teach. In November 2019, TSB, which returned to payback and slowly straightened its reputation, "delighted" customers new failure in the field of information technology. The second blow to the bank has meant it will be forced to close 82 branches in 2020 to cut costs. Or he could just not save on IT specialists.

Being miserly with IT is ultimately taxed. TSB reported a loss of $134 million in 2018 compared to a profit of $206 million in 2017. Post-migration expenses, including customer compensation, fixing fraudulent transactions (which increased dramatically during the banking turmoil), and third-party assistance totaled $419 million. The bank's IT provider was also billed $194 million for its role in the crisis.

However, no matter what lessons have been learned from the TSB outage, there will still be outages. They are inevitable. But with testing and good code, crashes and downtime can be greatly reduced. Cloud4Y, who often helps large companies migrate to cloud infrastructure, understands the importance of quickly moving from one system to another. Therefore, we can perform load testing, and use a multi-level backup system, as well as other options that allow you to check everything possible before starting the migration.

What else can you read on the blog? Cloud4Y

→ Salt solar energy
→ Pentesters at the forefront of cybersecurity
→ The Great Snowflake Theory
→ Internet in balloons
→ Do data centers need pillows?

Subscribe to our Telegram-channel, so as not to miss the next article! We write no more than twice a week and only on business.

Source: habr.com

How did the bank fail?

A bit of history

No tests - prepare for problems

Results

Add a comment Отменить ответ