Folklore of programmers and engineers (part 1)

Folklore of programmers and engineers (part 1)

This is a collection of stories from the Internet about how bugs sometimes have absolutely incredible manifestations. Perhaps you also have something to say.

Car allergy to vanilla ice cream

A story for engineers who understand that the obvious is not always the solution, and that no matter how implausible the facts may seem, they are still the facts. The Pontiac Division of General Motors Corporation received a complaint:

I am writing to you for the second time, and I do not blame you for not answering, because it sounds crazy. Our family has a tradition: every evening after dinner there is ice cream. The varieties of ice cream change every time, and after dinner, the whole family chooses which ice cream to buy, after which I go to the store. I recently bought a new Pontiac and since then my ice cream trips have become a problem. You see, every time I buy vanilla ice cream and come back from the store, the car won't start. If I bring any other ice cream, the car starts without problems. I want to ask a serious question, no matter how stupid it sounds: “What is it about Pontiac that makes it not start when I bring vanilla ice cream, but it starts easily if I bring ice cream with a different flavor?”.

As you can imagine, the division president was skeptical about the letter. However, just in case, he sent an engineer to check. He was surprised that he was met by a wealthy, well-educated man living in a beautiful area. They agreed to meet right after dinner to go to the ice cream shop together. It was vanilla that night, and when they got back in the car, it wouldn't start.

The engineer came three more evenings. The first time the ice cream was chocolate. The car started up. The second time was strawberry ice cream. The car started up. On the third night, he asked for vanilla. The car didn't start.

With common sense, the engineer refused to believe that the car was allergic to vanilla ice cream. Therefore, I agreed with the owner of the car that he would continue his visits until he found a solution to the problem. And along the way, he began to take notes: he wrote down all the information, the time of day, the type of gasoline, the time of arrival and return from the store, and so on.

The engineer soon realized that the owner of the car spent less time buying vanilla ice cream. The reason was the layout of the goods in the store. Vanilla ice cream was the most popular and was kept in a separate freezer at the front of the store to make it easier to find. And all the other varieties were in the back of the store, and it took much more time to find the right variety and pay.

Now the question was to the engineer: why did the car not start if less time had passed since the moment the engine was turned off? Since the problem was time, not vanilla ice cream, the engineer quickly found the answer: it was a gas lock. It occurred every evening, but when the owner of the car spent more time looking for ice cream, the engine had time to cool enough and start up quietly. And when a man bought vanilla ice cream, the engine was still too hot and the gas plug did not have time to dissolve.

Moral: even completely insane problems are sometimes real.

Crash Bandicoot

It's painful to experience this. As a programmer, you get used to blaming your code first, second, third... and somewhere in the ten thousandth place you blame the compiler. And even further down the list you are already blaming the equipment.

Here is my story about the iron bug.

For the game Crash Bandicoot, I wrote the code to download and save to a memory card. For such a self-satisfied game developer, it was like a walk in the park: I figured the work would take several days. However, as a result, I debugged the code for six weeks. Along the way, I solved other problems, but every few days I returned to this code for several hours. It was agony.

The symptom looked like this: when you save the current passage of the game and access the memory card, almost always everything goes fine ... But sometimes the read or write operation is timed out for no obvious reason. Short recording often damages the memory card. When a player tries to save, not only does he not save, he also destroys the map. Crap.

After a while, our producer at Sony, Connie Bus, started to panic. We couldn't ship the game with this bug, and six weeks later I didn't understand what was causing the problem. Through Connie, we contacted other PS1 developers: has anyone experienced this? No. No one had problems with the memory card.

When you're out of ideas for debugging, it's practically the only approach left to "divide and conquer": you remove more and more code from the buggy program, until there is a relatively small fragment that still causes a problem. That is, you cut off a piece from the program until the part that contains the bug remains.

But the thing is, it's very hard to cut pieces out of a video game. How to run it if you removed the code that emulates gravity? Or drawing characters?

So you have to replace entire modules with stubs that pretend to do something useful, but actually do something very simple that cannot contain errors. You have to write such crutches so that the game at least works. This is a slow and painful process.

In short, I did it. I removed more and more pieces of code until the initial code remained, which sets up the system to start the game, initializes the rendering hardware, etc. Of course, at this stage I could not make a save and load menu, because I would have to make a stub for all the graphics code. But I could pretend to be a user who uses the (invisible) save and load screen and asks to save and then write to the memory card.

As a result, I was left with a small piece of code that still had the aforementioned problem - but so far it happened randomly! Most of the time everything worked fine, but occasionally there were glitches. I removed almost all the game code, but the bug still lived. It was puzzling: the rest of the code didn't really do anything.

At some point, probably at three in the morning, a thought occurred to me. Read and write (I/O) operations imply exact execution time. When you work with a hard drive, memory card or Bluetooth module, the low-level code responsible for reading and writing does this in accordance with the clock pulses.

With the help of a clock, a device that is not directly connected to the processor is synchronized with the code executing in the processor. The clock determines the baud rate - the data transfer rate. If there is confusion with timings, then either the hardware, or the software, or both are also confused. And this is very bad, because the data can be corrupted.

What if something in our code confuses the timings? I checked everything related to this in the code of the test program, and noticed that we set the programmable timer in PS1 to a frequency of 1 kHz (1000 cycles per second). This is quite a lot, by default, when you start the set-top box, it runs at a frequency of 100 Hz. And most games use this frequency.

Andy, the game developer, set the timer to 1kHz so that movements are calculated more accurately. Andy tends to be over the top, and if we're emulating gravity, we're doing it as accurately as possible!

But what if the acceleration of the timer somehow affected the overall timing of the program, and therefore the clock that regulates the baud rate for the memory card?

I commented out the timer code. The error didn't reoccur. But this does not mean that we fixed it, because the failure occurred randomly. What if I'm just lucky?

A few days later I experimented with the test program again. The bug didn't recur. I went back to the full code base of the game and changed the save and load code so that the programmable timer resets to its original value (100Hz) before accessing the memory card, and then back to 1kHz again. There were no more crashes.

But why did this happen?

I returned to the test program again. I tried to find some pattern in the occurrence of an error with a timer of 1 kHz. I eventually noticed that the error occurs when someone is playing with a PS1 controller. Since I would rarely do this myself - why do I need a controller when testing the save and load code? - I did not notice this dependence. But one day one of our artists was waiting for me to finish testing - I must have been cursing at that moment - and nervously twisted the controller in his hands. An error has occurred. “Wait what?! Come on, do it again!"

When I realized that these two events are interconnected, I was able to easily reproduce the error: I started writing to the memory card, moved the controller, ruined the memory card. It looked like a hardware bug to me.

I came to Connie and told about my discovery. She relayed the information to one of the engineers who designed the PS1. "Impossible," he replied, "it can't be a hardware problem." I asked Connie to arrange a chat for us.

The engineer called me and we argued with him in his broken English and my (extremely) broken Japanese. Finally I said, "Let me just send in my 30-line test program where the controller movement results in a bug." He agreed. Said it was a waste of time and that he was terribly busy working on a new project, but would give in because we are a very important developer for Sony. I cleaned up my test program and sent it to him.

The next evening (we were in LA and he was in Tokyo) he called me and embarrassedly apologized. It was a hardware problem.

I don't know what exactly the bug was, but from what I heard at Sony HQ, setting the timer high enough would interfere with components on the motherboard near the timer crystal. One of them was the baud rate controller for the memory card, which also set the baud rate for the controllers. I'm not an engineer, so I might be confused.

But the bottom line is that there was interference between the components on the motherboard. And when simultaneously transmitting data through the controller port and the memory card port with a timer running at a frequency of 1 kHz, bits disappeared, data was lost, and the card was damaged.

Faulty cows

In the 1980s my mentor Sergey was writing software for the CM-1800, a Soviet clone of the PDP-11. This microcomputer has just been installed at the railway station near Sverdlovsk, an important transport hub in the USSR. The new system was designed for the routing of wagons and freight traffic. But it turned out to be an annoying bug that led to random crashes and crashes. Falls always occurred when someone went home in the evening. But despite a thorough investigation the next day, with all manual and automatic tests, the computer worked correctly. This is usually indicative of a race condition or some other concurrency bug that occurs under certain conditions. Tired of calls late at night, Sergey decided to get to the bottom of the matter, and first of all to understand what conditions at the marshalling yard led to a computer breakdown.

First, he collected statistics of all unexplained falls and plotted them by date and time. The pattern was obvious. After watching for a few more days, Sergey realized that he could easily predict the time of future system failures.

He soon found out that the disruptions only occurred when the station was sorting out wagons of cattle from northern Ukraine and western Russia bound for a nearby slaughterhouse. This in itself was strange, because the slaughterhouse was supplied by farms that were much closer, in Kazakhstan.

The Chernobyl nuclear power plant exploded in 1986, and radioactive fallout made the surrounding areas uninhabitable. Large areas in northern Ukraine, Belarus and western Russia have been contaminated. Suspecting a high level of radiation in the arriving cars, Sergei developed a method to test this theory. The population was forbidden to have dosimeters, so Sergey was put down by several military men at the railway station. After several shots of vodka, he managed to convince the soldier to measure the level of radiation in one of the suspicious cars. It turned out that the level is many times higher than the usual values.

Not only did the cattle emit a lot of radiation, but its level was so high that it led to random loss of bits in the memory of the SM-1800, which was located in the building next to the station.

There was a shortage of food in the USSR, and the authorities decided to mix "Chernobyl" meat with meat from other regions of the country. This made it possible to reduce the overall level of radioactivity without wasting valuable resources. Upon learning of this, Sergei immediately filled out the documents for emigration. And the crashes of the computer stopped by themselves when the level of radiation decreased over time.

Through the pipes

Once upon a time, Movietech Solutions created software for cinemas designed for accounting and ticketing and general management. The DOS version of the flagship application was quite popular among small and medium-sized theater chains in North America. So it's not surprising that when the Windows 95 version was announced, integrated with the latest touch screens and self-service kiosks, and equipped with all sorts of reporting tools, it also quickly became popular. Most of the time, the update went smoothly. On-site IT staff installed new hardware, migrated data, and business continued. Except when it didn't last. When this happened, the company sent James, nicknamed "The Cleaner."

Although this nickname hints at a nefarious type, however, the cleaner is just a combination of instructor, installer and jack of all trades. James would spend a few days at a customer's place putting all the components together, and then a couple more days teaching the staff how to use the new system, fixing any hardware problems that came up and actually helping the software through its formative years.

So it's not surprising that during this hectic time, James arrived at the office in the morning, and before he could reach his desk, he was greeted by an executive filled with caffeine beyond the usual.

“I'm afraid you need to go to Annapolis in Nova Scotia as soon as possible. They have the whole system down, and after a night of working with their engineers, we can't figure out what happened. It looks like the network has failed on the server. But only after the system has been running for a few minutes.

“They didn’t go back to the old system?” James replied quite seriously, though in his mind his eyes widened in surprise.

- Exactly: their IT specialist "changed priorities" and he decided to leave with their old server. James, they installed the system at six sites and just paid for premium support, and their business is now running like it was in the 1950s.

James straightened slightly.

- That's another matter. Okay, let's get started.

When he arrived in Annapolis, the first thing he did was find the client's first movie theater that had a problem. On the map taken at the airport, everything looked decent, but the surroundings of the desired address looked suspicious. Not ghetto, but reminiscent of film noir. When James parked at the curb in the center, a prostitute approached him. Given the size of Annapolis, it was most likely the only one in the entire city. Her appearance immediately reminded of the famous character who offered sex for money on the big screen. No, not about Julia Roberts, but about Jon Voight [reference to the movie "Midnight Cowboy" - approx. per.].

Sending the prostitute home, James went to the cinema. The surroundings have become better, but still gave the impression of rundown. Not that James was overly concerned. He's been to bad places before. And this was Canada, where even burglars are polite enough to say "thank you" after taking your wallet.

The side entrance to the cinema was in a dank alley. James went to the door and knocked. Soon it creaked and opened slightly.

Are you a cleaner? came a hoarse voice from within.

“Yes, it’s me… I came here to make things right.”

James walked into the foyer of the cinema. Probably having no other choice, the staff began issuing paper tickets to visitors. This made financial reporting difficult, not to mention more interesting details. But the employees greeted James with relief and immediately took him to the server room.

At first glance, everything was in order. James logged into the server and checked the usual suspicious locations. No problem. However, as a precaution, James shut down the server, replaced the network card, and rolled back the system. She immediately earned in full. The staff started selling tickets again.

James called Mark and informed him of the situation. It's not hard to imagine that James might want to linger here and see if anything unexpected happens. He went down the stairs and began to question the staff and what had happened. Obviously the system has stopped working. They turned it off and on, everything worked. But after 10 minutes the system fell off.

Just at that moment something similar happened. All of a sudden, the ticketing system started throwing errors. The staff sighed and grabbed their paper tickets as James hurried to the server room. Everything looked good with the server.

Then one of the employees came in.

The system is working again.

James was puzzled because he hadn't done anything. More precisely, nothing that would make the system work. He logged out, picked up his phone, and called his company's help desk. Soon the same employee entered the server room.

- The system is down.

James glanced at the server. An interesting and familiar pattern of multi-colored shapes danced on the screen - chaotically twisting and intertwining pipes. We have all seen this screensaver at some point. It was beautifully rendered and literally hypnotized.


James pressed the button and the pattern disappeared. He hurried to the ticket office and on the way met an employee returning to him.

The system is working again.

If you can mentally make a facepalm, then that's exactly what James did. Screensaver. It uses OpenGL. And therefore, during operation, it consumes all the resources of the server processor. As a result, each call to the server ends with a timeout.

James went back to the server room, logged in, and replaced the beautiful pipes screensaver with a blank screen. That is, instead of a screensaver that consumes 100% of processor resources, I installed another one that does not consume resources. Then I waited 10 minutes to check my guess.

When James arrived at the next cinema, he thought about how to explain to his manager that he had just flown 800 miles to turn off the screensaver.

Failure in a certain phase of the moon

True story. Once there was a software bug that depended on the phase of the moon. There was a small subroutine that was commonly used in various MIT programs to calculate an approximation to the true phase of the moon. GLS built this subroutine into a LISP program that, when writing a file, would output a string with a timestamp that was almost 80 characters long. Very rarely, the first line of a message was too long and moved to the next line. And when the program then read this file, it cursed. The length of the first line depended on the exact date and time, as well as the length of the phase specification at the time the timestamp was printed. That is, the bug literally depended on the phase of the moon!

First paper edition Jargon File (Steele-1983) contained a sample of such a line that resulted in the described bug, but the compositor "fixed" it. This has since been described as a "moon phase bug".

However, be careful with assumptions. A few years ago, engineers from CERN (European Center for Nuclear Research) encountered errors in experiments conducted at the Large Electron-Positron Collider. Since computers actively process the gigantic amount of data generated by this device before showing the result to scientists, many have assumed that the software is somehow sensitive to the phase of the moon. A few desperate engineers got to the bottom of the truth. The error arose due to a slight change in the geometry of the 27 km long ring due to the deformation of the Earth during the passage of the Moon! This story has entered the folklore of physicists as "Newton's revenge on particle physics" and an example of the connection of the simplest and oldest physical laws with the most advanced scientific concepts.

Flushing the toilet stops the train

The best hardware bug I've heard of was on a high speed train in France. The bug led to emergency braking of the train, but only if there were passengers on board. In each such case, the train was taken out of service, checked, but nothing was found. Then he was again sent to the line, and he immediately emergency stopped.

During one of the inspections, an engineer on the train went to the toilet. Soon he washed away, BOOM! Emergency stop.

The engineer contacted the driver and asked:

— What did you do just before braking?

- Well, I slowed down on the descent ...

It was strange, because in normal running the train slows down dozens of times on the descents. The train went on, and on the next descent the driver warned:

— I'm going to slow down.

Nothing happened.

— What did you do during the last braking? the driver asked.

“Well… I was on the toilet…”

- Well, then go to the toilet and do what you did when we go down again!

The engineer went to the toilet, and when the driver warned: "I'm braking," he flushed the water. Of course, the train immediately stopped.

Now they could reproduce the problem and needed to find the cause.

Two minutes later, they noticed that the remote control cable for engine braking (the train had one engine at both ends) was disconnected from the electrical cabinet wall and lay on the relay that controlled the toilet plug solenoid ... When the relay turned on, it interfered with the brake cable, and the fail-safe system simply applied emergency braking.

Gateway that hated FORTRAN

A few months ago, we noticed that network connections to the mainland network [this was in Hawaii] were getting very, very slow. It could last 10-15 minutes, and then suddenly reappear. Some time later, my colleague complained to me that network connections on the mainland in general does not work. He had some FORTRAN code that needed to be copied to a machine on the mainland, but it didn't work because "the network didn't last long enough for the FTP download to complete."

Yes, it turned out that network failures occurred when a colleague tried to FTP a FORTRAN source code file to a machine on the mainland. We tried to archive the file: then it was quietly copied (but there was no unpacker on the target machine, so the problem was not solved). Finally, we "split" the FORTRAN code into very small pieces and sent them one by one. Most of the fragments were copied without problems, but a few pieces did not pass, or passed after numerous attempts.

After examining the problematic fragments, we found that they have something in common: they all contain blocks of comments that begin and end with lines consisting of capital letters C (as a colleague preferred to comment on FORTRAN). We sent e-mails to the mainland network specialists and asked for help. Of course, they wanted to see samples of our files that could not be transferred via FTP ... but our letters did not reach them. Finally we came up with a simple describewhat unforwarded files look like. It worked :) [Dare I add here an example of one of the problematic FORTRAN comments? Probably not worth it!]

In the end we managed to figure it out. A new gateway has recently been installed between our part of the campus and the mainland's grid connection. He had HUGE problems transmitting packets that contained repeated fragments of uppercase C! Just a few of these packets could take up all the resources of the gateway and prevent most other packets from breaking through. We complained to the manufacturer of the gateway... and they answered us: “Ah, yes, you have encountered a bug of repeated C! We already know about it." In the end, we solved the problem by buying a new gateway from a different manufacturer (in the former's defense, the inability to transfer FORTRAN programs could be an advantage for some!).

Hard times

A few years ago, while working on a Perl ETL system designed to reduce the cost of Phase 40 clinical trials, I needed to process about 000 dates. Two of them did not pass the test. This didn't bother me too much, because these dates were taken from the data provided by the client, which was often, shall we say, surprising. But when I checked the original data, it turned out that these dates were January 1, 2011 and January 1, 2007. I thought that the bug was contained in the program I just wrote, but it turned out to be already 30 years old. This may sound mysterious to those unfamiliar with the software ecosystem. Due to another company's long-standing money-making decision, my client paid me to fix a bug that one company introduced by accident and the other intentionally. In order for you to understand what I'm talking about, I need to talk about the company that added the feature that became a bug as a result, as well as a few more interesting events that contributed to the mysterious bug I fixed.

In the good old days, Apple computers sometimes spontaneously reset their date to January 1, 1904. The reason was simple: a battery-powered "system clock" was used to keep track of the date and time. What happened when the battery ran out? Computers started keeping track of the date by the number of seconds since the epoch. The epoch was the reference reference date, and for the Macintosh it was January 1, 1904. And after the battery died, the current date was reset to the specified one. But why did this happen?

Previously, Apple used 32 bits to store the number of seconds since the original date. One bit can store one of two values ​​- 1 or 0. Two bits can store one of four values: 00, 01, 10, 11. Three bits can store one of eight values: 000, 001, 010, 011, 100, 101, 110, 111, etc. And 32 could store one of 232 values, that is, 4 seconds. For Apple's dates, this was about 294 years, so older Macs can't handle dates after 967. And if the system battery dies, the date is reset to 296 seconds since the epoch, and you have to manually set the date every time you turn on the computer (or until you buy a new battery).

However, Apple's decision to store dates as seconds since the epoch meant that we couldn't handle dates prior to the epoch, which had far-reaching consequences, as we'll see. Apple introduced a feature, not a bug. Among other things, this meant that the Macintosh operating system was immune to the "millennium bug" (which is not the case for many Mac applications that had their own date calculation systems to circumvent restrictions).

Go ahead. We used Lotus 1-2-3, the IBM killer application that helped launch the PC revolution, even though Apple computers had VisiCalc, which made personal computers successful. To be fair, if 1-2-3 hadn't appeared, PCs wouldn't have taken off, and the history of the personal computer might have evolved very differently. Lotus 1-2-3 incorrectly treated 1900 as a leap year. When Microsoft released its first spreadsheet, Multiplan, it took a small share of the market. And when the Excel project was launched, they decided not only to copy the row and column naming scheme from Lotus 1-2-3, but also to ensure bug compatibility by deliberately treating 1900 as a leap year. This problem exists to this day. That is, in 1-2-3 it was a bug, but in Excel it was a conscious decision that ensured that all users of 1-2-3 could import their spreadsheets into Excel without changing the data, even if they were erroneous.

But there was another problem here. Microsoft first released Excel for the Macintosh, which didn't recognize dates prior to January 1, 1904. In Excel, the beginning of the epoch was January 1, 1900. Therefore, the developers made a change so that their program recognizes the type of era and stores data within itself in accordance with the desired era. Microsoft even wrote an explanatory article about it. And this solution led to my bug.

My ETL system received Excel spreadsheets from buyers that were created on Windows, but could also be created on a Mac. Therefore, the beginning of the epoch in the table could be either January 1, 1900, or January 1, 1904. How to find out? The Excel file format shows the information you need, but the parser I used didn't (now does) and assumes you know the epoch for a particular table. I probably could have spent more time figuring out the Excel binary and sending a patch to the parser author, but I had a lot of other things to do for the client, so I quickly wrote a heuristic to determine the epoch. She was simple.

In Excel, the date July 5, 1998 can be represented in the format "07-05-98" (useless US system), "Jul 5, 98", "July 5, 1998", "5-Jul-98", or some another useless format (ironically, one of the formats my version of Excel didn't offer was the ISO 8601 standard). However, inside the table, the raw date was stored either as "35981" for epoch-1900 or "34519" for epoch-1904 (the numbers represent the number of days since the epoch). I was just using a simple parser to extract the year from the formatted date and then using the Excel parser to extract the year from the unformatted date. If both values ​​differed by 4 years, I knew that I was using a system with epoch-1904.

Why didn't I just use formatted dates? Because July 5, 1998 can be formatted as "July, 98" with the day of the month missing. We received tables from so many companies that created them in such different ways that it was up to us (in this case, me) to figure out the dates. Besides, if Excel gets it right, then so should we!

Then I ran into 39082. Let me remind you that Lotus 1-2-3 considered the 1900th leap year, and this was faithfully repeated in Excel. And since this added one day to 1900, many date functions could be wrong on that very day. So 39082 could be January 1, 2011 (on Macs) or December 31, 2006 (on Windows). If my “years parser” extracted the year 2011 from the formatted value, then everything is fine. But since the Excel parser doesn't know what epoch is being used, it defaults to epoch-1900, returning the year 2006. My application saw that the difference was 5 years, treated it as an error, logged it, and returned an unformatted value.

To get around this, I wrote this (pseudocode):

diff = formatted_year - parsed_year
if 0 == diff
    assume 1900 date system
if 4 == diff
    assume 1904 date system
if 5 == diff and month is December and day is 31
    assume 1904 date system

And then all 40 dates were parsed correctly.

In the midst of large print jobs

In the early 1980s, my father worked at Storage Technology, a now-defunct division that built tape drives and pneumatic systems for high-speed tape feeding.

They redesigned the drives so that they could have one central drive “A” connected to seven drives “B”, and a small OS in RAM that controlled drive “A” could delegate reads and writes to all drives “B”.

Each time drive A was started, a floppy disk had to be inserted into the peripheral drive connected to A in order to load the operating system into its memory. It was extremely primitive: computing power was provided by an 8-bit microcontroller.

The target audience for this equipment was companies with very large data warehouses - banks, retail chains, etc. - who needed to print a lot of address labels or bank statements.

One client had a problem. In the middle of a print job, one particular "A" drive would stop working, causing the entire job to stall. To restore the drive, the staff had to reboot everything. And if this happened in the middle of a six-hour task, then a huge amount of expensive computer time was lost and the schedule of the entire operation was disrupted.

Technicians were sent from Storage Technologies. But despite their best efforts, they could not reproduce the bug in test conditions: it seems that the failure occurred in the middle of large print jobs. The problem wasn't with the hardware, they replaced everything they could - RAM, microcontroller, floppy drive, every conceivable part of the tape drive - the problem persisted.

Then the technicians called the headquarters and called the Expert.

The expert took a chair and a cup of coffee, sat in the computer room - in those days there were rooms dedicated to computers - and watched as the staff queued up a large print job. The expert waited for a failure to occur, and it did. Everyone looked at the Expert - and he had no idea why this happened. So he ordered the job to be put back in the queue, and all the staff and technicians went back to work.

The expert sat back in his chair and waited for a failure. It took about six hours, and the failure occurred. The Expert was again out of ideas, apart from the fact that everything happened in a crowded room. He ordered the task to be restarted, sat down again and waited.

By the third failure, the Expert noticed something. The failure occurred when personnel changed tapes in a foreign drive. Moreover, the failure occurred as soon as one of the employees passed through a certain tile on the floor.

The raised floor was made of aluminum tiles laid at a height of 6-8 inches. Numerous wires from computers ran under the raised floor so that someone would not accidentally step on an important cable. The tiles were laid very tightly to keep debris out of the raised floor.

The expert realized that one of the tiles was deformed. When an employee stepped on its corner, the tile rubbed its edges against neighboring tiles. The plastic parts that connected the tiles rubbed with them, which caused static microdischarges that created radio frequency interference.

Today, RAM is much better protected from radio frequency interference. But in those years it was not so. The expert realized that these interferences disrupted the memory, and with it the operation of the operating system. He called the support service, ordered a new tile, installed it himself, and the problem disappeared.

It's the tide!

The story took place in a server room, on the fourth or fifth floor of an office in Portsmouth (I think), in the dock area.

One day the Unix server with the main database crashed. He was rebooted, but he happily continued to fall over and over again. We decided to call someone from the support service.

Support dude... I think his name was Mark, but that's not important... I don't think I know him. It doesn't matter, really. Let's stick with Mark, shall we? Great.

So, a few hours later, Mark arrived (it's not a long way from Leeds to Portsmouth, you know), turned on the server and everything worked without problems. Typical fucking support, the client gets very upset because of this. Mark looks through the log files and finds nothing out of the ordinary. Then Mark gets back on the train (or whatever mode of transport he came on, it could have been a lame cow, as far as I know… well, it doesn't matter, okay?) and heads back to Leeds, wasting a day.

The same evening the server crashes again. The story is the same ... the server does not rise. Mark tries to help remotely, but the client can't start the server.

Another train, bus, lemon meringue or some other bullshit, and Mark is back in Portsmouth. Look, the server is loading without problems! Miracle. Mark checks for several hours that everything is in order with the operating system or software, and goes to Leeds.

Around the middle of the day, the server crashes (take it easy!). This time, it seems reasonable to bring in the hardware support people to replace the server. But no, after about 10 hours it also falls.

The situation repeated itself for several days. The server works, crashes after about 10 hours, and doesn't start for the next 2 hours. They checked the cooling, the memory leaks, they checked everything, but they didn't find anything. Then the crashes stopped.

The week passed carefree... everyone was happy. Happy until it all starts again. The picture is the same. 10 hours of work, 2-3 hours of downtime ...

And then someone (I think I was told that this person had nothing to do with IT) said:

"It's the tide!"

The exclamation was met with empty looks, and, probably, someone's hand hesitated at the button to call the guard.

"It stops working with the tide."

It would seem that this is a completely foreign concept for IT support employees, who are unlikely to read the "yearbook of the tides" while sitting over coffee. They explained that this could not be related to the tide in any way, because the server had been running for a week without any failures.

"Last week the tide was low, but this week the tide is high."

A bit of terminology for those who do not have a license to sail a yacht. The tides depend on the lunar cycle. And as the Earth rotates, every 12,5 hours, the gravitational pull of the Sun and Moon creates a tidal wave. At the beginning of the 12,5-hour cycle, a high tide occurs, in the middle of the cycle, a low tide, and at the end, a high tide again. But as the moon's orbit changes, so does the difference between low tide and high tide. When the Moon is between the Sun and the Earth or on the opposite side of the Earth (full moon or absence of the Moon), we get the Syzygy tides - the highest tides and the lowest low tides. At the crescent, we get quadrature tides - the lowest tides. The difference between the two extremes is greatly reduced. The lunar cycle lasts 28 days: syzygy - quadrature - syzygy - quadrature.

When the techies were explained the essence of tidal forces, they immediately thought about calling the police. And quite logical. But it turned out that the dude was right. Two weeks earlier, a destroyer had docked near the office. Every time the tide raised it to a certain height, the ship's radar post would be at the level of the server room floor. And the radar (or electronic warfare equipment, or some other military toy) made chaos in computers.

Flight mission for a rocket

I was assigned to port a large (about 400 lines) missile launch command and control system to new versions of the operating system, compiler, and language. More specifically, from Solaris 2.5.1 to Solaris 7, and from the Verdix Ada Development System (VADS) written in Ada 83 to the Rational Apex Ada system written in Ada 95. VADS was bought by Rational and its product has been deprecated, although Rational has tried to implement compatible versions of VADS-specific packages to ease the transition to the Apex compiler.

Three people helped me just get clean compiled code. It took two weeks. And then I worked on my own to get the system working. In short, it was the worst architecture and implementation of a software system I've ever seen, so it took another two months to complete the port. Then the system was transferred for testing, which took several more months. I immediately fixed the bugs that I found during testing, but their number quickly decreased (the source code was a production system, so its functionality worked quite reliably, I just had to clean up the bugs that arose when adapting to the new compiler). In the end, when everything worked as it should, I was transferred to another project.

And on the Friday before Thanksgiving, the phone rang.

About three weeks later, the rocket launch was to be tested, and during the countdown laboratory tests, the command sequence was blocked. In real life, this would lead to an interruption of the tests, and if a blockage occurred within a few seconds after starting the engine, then several irreversible actions would occur in the auxiliary systems, due to which it would be long - and expensive - to re-prepare the rocket. It wouldn't start, but a lot of people would be very upset because of the loss of time and very, very big money. Don't let anyone tell you that the Department of Defense is unceremoniously spending money—I've never met a contracting manager who didn't put budget first or second, followed by schedule.

In previous months, this countdown test had been run hundreds of times in many variations, with only a few minor hiccups. So the probability of this happening was very low, but its consequences were very significant. Multiply both of these factors and you will realize that the news predicted a ruined holiday week for me and dozens of engineers and managers.

And attention was drawn to me as the person who ported the system.

As with most security-critical systems, a lot of parameters were logged, so it was fairly easy to identify the few lines of code that were executed before the system crashed. And of course, there was absolutely nothing unusual about them, the same expressions were successfully executed literally thousands of times during the same run.

We called the people from Apex to Rational because they developed the compiler and some of the subroutines they developed were called in suspicious code. They (and everyone else) were under the impression that the cause of a problem of literally national importance had to be found out.

Since there was nothing interesting in the logs, we decided to try to reproduce the problem in a local laboratory. This was not an easy task, since the event occurred approximately once in 1000 runs. One of the suggested reasons was that calling a vendor-developed mutex function (part of the VADS migration package) Unlock didn't unlock. The processing thread that called the function was processing heartbeat messages that nominally arrived every second. We raised the frequency to 10 Hz, that is, 10 times per second, and started the run. About an hour later, the system shut down. In the log, we saw that the sequence of recorded messages was the same as during the crash test. We did a few more runs, the system was stably blocked after 45-90 minutes after the start, and each time the log was the same track. Even though we were technically running different code now—the message rate was different—the behavior of the system was the same, so we were confident that this load scenario was causing the same problem.

Now it was necessary to find out exactly where in the sequence of expressions the blocking occurred.

This implementation of the system used the Ada task system, and it was incredibly poorly used. Tasks are a high-level concurrently executing construct in Ada, sort of like threads of execution, only built into the language itself. When two tasks need to communicate, they “rendezvous”, exchange the necessary data, and then stop rendezvous and return to their independent executions. However, the system was implemented differently. After a target was rendezvoused, that target rendezvoused with another task, which then rendezvoused with a third, and so on until some processing was completed. After that, all these rendezvous ended and each task had to return to its execution. That is, we were dealing with the most expensive function-call system in the world, which stopped the entire "multitasking" process while processing part of the input data. And before, this did not lead to problems just because the throughput was very low.

I described this task mechanism because when a rendezvous was requested or expected to complete, a "task switch" could occur. That is, the processor could start processing another task ready to be executed. It turns out that when one task is ready to rendezvous with another task, a completely different task can start executing, and eventually control returns to the first rendezvous. And there may be other events that lead to a task switch; one such event is a call to a system function, such as printing or executing a mutex.

To understand which line of code was causing the problem, I needed to find a way to record the progress of a sequence of expressions without triggering a task switch, which could prevent a crash from occurring. So I couldn't take advantage Put_Line()to do no I/O. It was possible to set a counter variable or something like that, but how can I see its value if I cannot display it on the screen?

Also, when studying the log, it turned out that, despite the hangup of the processing of heartbeat messages, which blocked all I / O operations of the process and did not allow other processing to be performed, other independent tasks continued to be executed. That is, the work was not blocked entirely, only a (critical) chain of tasks.

This was the hook needed to evaluate the blocking expression.

I made an Ada package that contained a task, an enumerated type, and a global variable of that type. Enumerated literals were tied to specific expressions in the problematic sequence (for example, Incrementing_Buffer_Index, Locking_Mutex, Mutex_Unlocked) and then inserted assignment expressions into it that assigned the corresponding enumeration to the global variable. Because the object code for all of this simply kept a constant in memory, a task switch as a result of its execution was extremely unlikely. We first suspected expressions that could switch tasks, because the blocking occurred on execution, not on return when the task was switched back (for several reasons).

The tracking task simply ran in a loop and periodically checked to see if the value of the global variable had changed. With each change, the value was saved to a file. Then a short wait and a new check. I wrote the variable to a file because the task was only executed when the system selected it for execution when switching tasks in the problem area. Whatever happens in this task would not affect other unrelated blocked tasks.

It was expected that when the system reaches the execution of the problematic code, the global variable will be reset when moving to each next expression. Then something happens that causes a task switch, and since its execution frequency (10 Hz) is lower than that of the monitoring task, the monitor could fix the value of the global variable and write it out. In a normal situation, I could get a repeating sequence of a subset of enums: the last values ​​of a variable at the time of the task switch. When hovering, the global variable should no longer change, and the last value written will indicate which expression did not complete execution.

Ran the code with tracking. He hung. And monitoring worked like clockwork.

The log showed the expected sequence, which was interrupted by a value indicating that the mutex was called Unlock, and the task is not completed - as is the case with thousands of previous calls.

Apex engineers at that time were frantically analyzing their code and found a place in the mutex where, theoretically, a lock could occur. But its probability was very small, since only a certain sequence of events occurring at a certain time could lead to blocking. Murphy's Law, folks, it's Murphy's Law.

To protect the piece of code I needed, I replaced the mutex function calls (built on top of the OS mutex functionality) with a small Ada native mutex package to control mutex access to that piece.

Pasted into the code and ran the test. Seven hours later, the code was still working.

My code was taken to Rational, where it was compiled, disassembled, and checked to make sure it didn't use the same approach that was used in the problematic mutex functions.

It was the most crowded code review of my career 🙂 There were about ten engineers and managers in the room with me, a dozen more people connected via conference call - and they all examined about 20 lines of code.

The code was tested, new executables were compiled and submitted for formal regression testing. A couple of weeks later, the countdown tests were successful and the rocket took off.

Okay, that's all well and good, but what's the point of this story?

It was an absolutely disgusting problem. Hundreds of thousands of lines of code, parallel execution, over a dozen interacting processes, poor architecture and poor implementation, interfaces to embedded systems, and millions of dollars wasted. No pressure, right.

I wasn't the only one working on this issue, although I was in the spotlight as I did the porting. But although I did it, this does not mean that I dealt with all the hundreds of thousands of lines of code, or even skimmed through them. The code and logs were analyzed by engineers all over the country, but when they told me their hypotheses about the reasons for the failure, it took me half a minute to refute them. And when I was asked to analyze theories, I passed it on to someone else, because it was obvious to me that these engineers were going the wrong way. Sounds presumptuous? Yes, it is, but I rejected hypotheses and requests for a different reason.

I understood the nature of the problem. I didn't know exactly where it originated or why, but I knew exactly what was happening.

Over the years I have accumulated a lot of knowledge and experience. I was one of the pioneers of using Ada, I understood its advantages and disadvantages. I know how the Ada runtime libraries process tasks and deal with parallel execution. And I understand low-level programming at the level of memory, registers and assembler. In other words, I have deep knowledge in my field. And I used them to find the cause of the problem. I didn’t just bypass the bug, but understood how to find it in a very sensitive runtime environment.

Such stories of struggle with the code are not very interesting for those who are not familiar with the features and conditions of such a struggle. But these stories help to understand what it takes to solve really difficult problems.

To solve really hard problems, you need to be more than just a programmer. You need to understand the "fate" of the code, how it interacts with its environment, and how the environment itself works.

And then you'll have your own ruined holiday week.

To be continued.

Source: habr.com

Add a comment