Dear Google Cloud, Not Backward Compatibility is Killing You

God damn it, Google, I didn't want to blog again. I have so many things to do. Blogging takes time, energy and creativity that I could put to good use: my books, music, my game and so on. But you've pissed me off enough to have to write this.

So let's get this over with.

I'll start with a small but instructive story from those times when I first started working at Google. I know I've said a lot of bad things about Google lately, but it frustrates me when my own company regularly makes incompetent business decisions. At the same time, we must pay tribute: the internal infrastructure of Google is truly extraordinary, we can safely say that today there is nothing better. The founders of Google were much better engineers than I'll ever be, and this story only confirms that fact.

First, a little background: Google has a data storage technology called big table. It was a remarkable technical achievement, one of the first (if not the first) “infinitely scalable” key-value store (K/V): essentially the beginning of NoSQL. Bigtable still thrives in the rather crowded K/V storage space these days, but at the time (2005) it was amazingly cool.

One funny thing about Bigtables is that they had internal control plane objects (as part of the implementation) called tablet servers, with large indexes, and at some point they became a bottleneck when scaling the system. Bigtable engineers were wrestling with how to implement scalability, and suddenly they realized that they could replace tablet servers with other Bigtable storages. So Bigtable is part of the Bigtable implementation. These repositories are there at all levels.

Another interesting detail is that for a while, Bigtables became popular and ubiquitous within Google, and each team had its own repository. So at one of the Friday meetings, Larry Page casually asked in passing: “Why do we have more than one Bigtable? Why not just one?" In theory, one storage should have been enough for all of Google's storage needs. Of course, they never jumped to just one for practical development reasons (such as the consequences of a potential failure), but the theory was interesting. One repository for the entire universe (By the way, does anyone know if Amazon did this with their Sable?)

Anyway, here's my story.

At the time, I had been working at Google for just over two years, and one day I received an email from the Bigtable engineering team that went something like this:

Dear Steve,

Greetings from the Bigtable team. We want to let you know that you are running a very, very old Bigtable binary at [datacenter name]. This version is no longer supported and we want to help you upgrade to the latest version.

Please let me know if you can schedule some time to work together on this issue.

All the best,
Bigtable team

You get a lot of mail on Google, so at first glance I read something like this:

Dear recipient,

Hello from some team. We want to report that blah blah blah blah blah. Blah blah blah blah blah blah, and blah blah blah immediately.

Please let us know if you can schedule some of your precious time for blah blah blah.

All the best,
Some kind of command

I almost removed it immediately, but at the border of consciousness I felt a painful, aching feeling that it not quite looks like a formal letter though obviouslythat the destination was wrong because I didn't use Bigtable.

But it was strange.

For the rest of the day, I alternated between work and what kind of shark meat to try in the micro-kitchen, of which at least three were close enough to hit from my seat with a well-aimed throw of a biscuit, but the thought of the letter never left me with a growing feeling mild anxiety.

They clearly called my name. And the email was sent to my email address, not someone else's, and it's not cc: or bcc:. The tone is very personal and clear. Maybe it's some kind of mistake?

Finally, curiosity got the better of me and I went to look at the Borg console in the data center they mentioned.

And of course, I had a BigTable storage in my management. I'm sorry, what? I looked at its contents, and - wow! It was from the Codelab incubator that I sat in during my first week at Google in June 2005. Codelab forced you to start Bigtable so that you wrote some values ​​there, and I apparently never closed the store after that. It still worked even though more than two years had passed.

There are several noteworthy aspects to this story. Firstly, the work of Bigtable was so insignificant on the scale of Google that only two years later someone noticed the extra storage, and even then only because the binary version was outdated. By comparison, I once considered using Bigtable on Google Cloud for my online game. At the time, this service cost approximately $16 per year. empty Bigtable on GCP. I'm not saying they're cheating you, but in my personal opinion, that's a lot of money for an empty fucking database.

Another noteworthy aspect is that the storage still working two years later. wtf? Data centers come and go; they experience outages, they undergo scheduled maintenance, they change all the time. Hardware is updated, switches are swapped, everything is constantly being improved. How the hell did they manage to keep my program running for two years with all these changes? This may seem like a modest achievement in 2020, but in 2005-2007 it was quite impressive.

And the most remarkable aspect is that an outside engineering team in some other state reaches out to me, the owner of some tiny, almost empty instance of Bigtable, which has zero traffic for the last two years - and offer help to update it.

I thanked them, removed the vault, and life went on as usual. But thirteen years later, I'm still thinking about that letter. Because sometimes I get similar emails from Google Cloud. They look like this:

Dear Google Cloud User,

As a reminder, we will be deprecating [an important service you use] from August 2020, after which you will not be able to upgrade your instances. We recommend upgrading to the latest version, which is in beta testing, has no documentation, no migration path, and is deprecated in advance with our kind help.

We are committed to ensuring that this change will have minimal impact on all users of the Google Cloud platform.

Best friends forever,
Google cloud platform

But I hardly ever read such letters, because what they actually say is this:

Dear recipient,

Go to hell. You go, you go, you go. Drop everything you do because it doesn't matter. What matters is our time. We spend time and money supporting our shit and we're tired of it so we won't support it anymore. So drop your fucking plans and start digging through our shitty documentation begging for leftovers on the forums and by the way our new shit is completely different from the old shit because we messed up that design pretty bad heh but that's your problem not our.

We continue to make efforts to ensure that all your developments become unusable within one year.

Please go nah
Google cloud platform

And the fact is that I receive such letters about once a month. It happens so often and so constantly that they inevitably pushed away me from GCP to anti-cloud camp. I no longer agree to depend on their proprietary developments, because in fact it is easier for a devops to maintain an open source system on a bare virtual machine than to try to keep up with Google with their policy of closing "outdated" products.

Before going back to Google Cloud because I even close not finished criticizing them, let's look at the company's performance in some other areas. Google engineers pride themselves on their software engineering discipline, and that's what really causes problems. Pride is a trap for the unwary, and has led many at Google to think that their decisions are always right, and that being right (by some vague, fuzzy definition) is more important than customer care.

I'll give a few arbitrary examples from other big projects outside of Google, but I hope you see this pattern everywhere. It is as follows: backwards compatibility keeps systems alive and up-to-date for decades.

Backward compatibility is the design goal of all successful systems designed to open use, i.e. implemented with open source and/or open standards. I feel like I'm saying something too obvious that it's even embarrassing for everyone, but no. This is a political issue, so examples are needed.

The first system I'll choose is the oldest one: GNU Emacs, it's kind of a hybrid between Windows Notepad, the OS kernel, and the International Space Station. It's a bit hard to explain, but in a nutshell, Emacs is a platform created in 1976 (yes, almost half a century ago) for programming to make you more productive, but masquerading as a text editor.

I use Emacs every single day. Yes, I also use IntelliJ every day, it has already become a powerful tooling platform on its own. But writing extensions for IntelliJ is a much more ambitious and difficult task than writing extensions for Emacs. And more importantly, everything written for Emacs is saved forever.

I still use software that I wrote for Emacs back in 1995. And I'm sure someone is using modules written for Emacs in the mid 80's, if not earlier. From time to time they may require minor tweaking, but this is really quite rare. I don't know of anything I've ever written for Emacs (and I've written a lot) that would require re-architecting.

Emacs has a feature called make-obsolete for obsolete entities. Emacs terminology for fundamental computer concepts (such as what a "window" is) often differs from industry conventions because Emacs introduced them a long time ago. This is a typical danger for those who were ahead of their time: all your terms are incorrect. But Emacs does have the concept of deprecation, which in their jargon is called obsolescence.

But in the Emacs world, there seems to be a different working definition. Another underlying philosophy, if you will.

In the world of Emacs (and many other areas that we'll explore below), the status of deprecated APIs basically means, "You really shouldn't use this approach, because while it works, it suffers from various drawbacks, which we'll list here. But in the end, it's your choice."

In the Google world, obsolete status means "We are in breach of our obligation to you." It really is. Here's what it essentially means. This means they will make you regularly do some work, maybe a lot of work, as a punishment for believing in them colorful advertisingA: We have the best software. The fastest! You do everything according to the instructions, launch your application or service, and then bam, after a year or two it breaks.

It's like selling a used car that will definitely break down after 1500 km.

These are two completely different philosophical definitions of "obsolescence". Google definition smells planned obsolescence. I don't believe it actually planned obsolescence in the same sense as Apple. But Google is definitely planning to break your programs, in a roundabout way. I know this because I worked there as a software engineer for over 12 years. They have vague internal guidelines on how much backwards compatibility should be maintained, but it ultimately depends on each individual team or service. There are no enterprise or engineering level recommendations, and the boldest recommendation in terms of obsolescence cycles is "try to give customers 6-12 months to upgrade before breaking their whole system."

The problem is much bigger than they think, and it will remain for many years to come, because customer care is not in their DNA. More on this below.

At this point, I'm going to make the bold claim that Emacs has been successful to a great extent and even basically because they take backwards compatibility so seriously. Actually, this is the thesis of our article. Successful long-lived open systems owe their success to the microcommunities that live around for decades. extensions/plugins. This is the ecosystem. I've already talked about the nature of platforms and how important they are, and how Google has never, in its entire corporate history, understood what goes into building a successful open platform other than Android or Chrome.

Actually, I should briefly mention Android, because you must have thought of it.

First, the Android is not Google. They have almost nothing in common with each other. Android is a company that was bought by Google in July 2005, the company has been allowed to operate more or less autonomously and has in fact remained largely untouched over the years. Android is an infamous tech stack and an equally infamous prickly organization. As one Googler put it, "you can't just go and get into Android."

In a previous article, I discussed how bad some of the early Android design decisions were. Hell, when I wrote that article, they were rolling out crap called "instant apps" that now (surprise!) outdated, and I sympathize if you were stupid enough to listen to Google and move your content into these instant apps.

But there is a difference here, a significant difference, which is that Android people really understand how important platforms are, they go out of their way to keep old Android apps working. In fact, their efforts to maintain backwards compatibility are so extreme that even I, during my brief stint in the Android division a few years ago, found myself trying to convince them to drop support for some of the oldest devices and APIs (I was wrong, as was in many other things past and present.Sorry Android guys!Now that I've been to Indonesia, I understand why we need them).

The people of Android maintain backwards compatibility to almost unimaginable extremes, piling up a huge amount of obsolete tech debt in their systems and tool chains. Oh my god, you should have seen some of the crazy things they have to do in their build system, all in the name of compatibility.

For that, I give Android the coveted "You're not Google" award. They really don't want to become Google, which can't build durable platforms, but Android know, how to do it. And so Google is being very wise in one respect: allowing people on Android to do things their way.

However, Android Instant Apps were a pretty dumb idea. And do you know why? Because they demanded rewrite and redesign your application! As if people just take and rewrite two million applications. I'm assuming instant apps were some googler's idea.

But there is a difference here. Backward compatibility comes at a cost. Android itself bears the burden of these costs, while Google insists that this burden be borne you, paying customer.

You can see Android's commitment to backwards compatibility in its APIs. When you have four or five different subsystems to do literally the same thing, that's a sure sign that there's a commitment to backwards compatibility at the core. Which in the world of platforms is synonymous with commitment to your customers and your market.

Google's main problem here is their pride in their engineering hygiene. They don't like it when there are many different ways to do the same thing, with old, less desirable ways sitting next to new, fancier ways. It increases the learning curve for newcomers to the system, it increases the burden of maintaining legacy APIs, it slows down the speed of new features, and the main sin is being ugly. Google is like Lady Ascot from Tim Burton's Alice in Wonderland:

Lady Ascot:
Alice, do you know what I'm most afraid of?
— The decline of the aristocracy?
- I was afraid that I would have ugly grandchildren.

To understand the trade-off between beautiful and practical, let's take a look at the third successful platform (after Emacs and Android) and see how it works: Java itself.

Java has a lot of legacy APIs. Deprecation is very popular among Java programmers, even more so than in most programming languages. APIs are constantly deprecated in Java itself, the core language, and libraries.

To take just one of thousands of examples, closing streams considered obsolete. It has been deprecated since the release of Java 1.2 in December 1998. It's been 22 years since this was deprecated.

But my real code in production still kills threads every day. Do you really think that's good? Absolutely! I mean, of course, if I were to rewrite the code today, I would implement it differently. But the code for my game, which has made hundreds of thousands of people happy over the past two decades, is written to close threads that hang too long, and I never had to change it. I know my system better than anyone, I have literally 25 years of experience with it in production, and I can say for sure: in my case, closing these particular worker threads is completely worthless. It's not worth the time and effort to rewrite this code, and kudos to Larry Ellison (probably) that Oracle didn't force me to rewrite it.

Probably, Oracle understands platforms too. Who knows.

The evidence can be found in all key Java APIs, which are riddled with waves of obsolescence like glacier lines in a canyon. You can easily find five or six different keyboard navigation managers (KeyboardFocusManager) in the Java Swing library. It's actually hard to find a Java API that isn't deprecated. But they still work! I think the Java team will only truly remove an API if the interface causes a glaring security issue.

Here's the thing, folks: we software developers are all very busy, and in every area of ​​software, we're faced with competing alternatives. At any given time, X programmers are looking at Y as a possible replacement. Oh, you don't believe me? Do you want to call Swift? Like, everyone migrates to Swift and no one refuses it, right? Wow, how little you know. Companies are counting the costs of dual mobile development teams (iOS and Android) - and they are starting to realize that these funny-named cross-platform development systems like Flutter and React Native really work and can help reduce the size of their mobile teams. twice or, conversely, make them twice as productive. Real money is at stake. Yes, there are compromises, but, on the other hand, de-e-engi.

Let's hypothetically assume that Apple foolishly took Guido van Rossum's lead and declared that Swift 6.0 is backwards incompatible with Swift 5.0, much like Python 3 is incompatible with Python 2.

I must have told this story ten years ago, but fifteen years ago I went to O'Reilly's Foo Camp with Guido, sat in a tent with Paul Graham and a bunch of big shots. We sat in the sweltering heat and waited for Larry Page to take off in his personal helicopter, while Guido monotonously muttered about "Python 3000", which he named after the number of years it would take everyone to migrate there. We kept asking him why he was breaking compatibility, and he said "Unicode". And we asked, if we have to rewrite our code, what other benefits will we see? And he answered “Yooooooooooooooouuuuuuuniiiiiiicoooooooode”.

If you install the Google Cloud Platform SDK (“gcloud”), you will receive the following notification:

Dear recipient,

We would like to remind you that Python 2 support is deprecated, so pooooooooooooooooooooooooooooooooooo

… and so on. Circle of life.

But the fact is that every developer has a choice. And if you get them to rewrite the code often enough, then they might think about other options. They are not your hostages, as much as you would like them to be. They are your guests. Python is still a very popular programming language, but damn it, Python 3(000) has created such a mess in itself, in its communities, and among the users of its communities that the consequences cannot be cleared up for fifteen years.

How many Python programs have been rewritten in Go (or Ruby, or some other alternative) because of this backwards incompatibility? How much new software has been written in something other than Python, even though it could be written in Python if Guido hadn't burned down the whole village? It's hard to say, but Python has clearly suffered. It's a huge mess and everyone loses.

So let's say Apple takes Guido's example and breaks compatibility. What do you think will happen next? Well, maybe 80-90% of developers will rewrite their software if they can. In other words, 10-20% of the user base will automatically go to some competing language like Flutter.

Do this a few times and you'll lose half your user base. As in sports, in the world of programming, the current form also means all. Anyone who loses half their users in five years will be considered a Big Fat Loser. You must be in trend in the world of platforms. But this is where dropping support for older versions will get you killed over time. Because every time you get rid of a part of the developers, you (a) lose them forever because they are angry with you for breaking the contract, and (b) give them to your competitors.

Ironically, I also helped Google become such a backwards-compatibility-defying diva when I created Grok, a source code analysis and understanding system that makes it easy to automate and instrument code based on the code itself - similar to an IDE, but here the cloud service stores materialized representations of all the billions of lines of Google source code in a large data warehouse.

Grok provided Googlers with a powerful framework to perform automated refactorings across their entire codebase (literally, across Google). The system calculates not only your upstream dependencies (on which you depend), but also downstream (which are up to you) so when you change APIs you know who you break! This way, when you make changes, you can check that every consumer of your API has updated to the new version, and in reality, often with the Rosie tool they wrote, you can completely automate the process.

This allows Google's codebase to be internally almost supernaturally "clean" as they have these robotic servants scurrying around the house and automatically clean everything up if they renamed SomeDespicablyLongFunctionName to SomeDespicablyLongMethodName because someone thought it was an ugly grandson and his needs to be sedated.

And to be honest, it works pretty well for Google… internally. I mean, yes, the Go community at Google is really having a good laugh at the Java community at Google because of their habit of continuous refactoring. If you restart something N times, then not only did you mess it up N-1 times, but after a while it becomes quite clear that you probably messed it up on the Nth try too. But, by and large, they stay above the fuss and keep the code "clean".

The problems start when they try to force this attitude on their cloud clients and users of other APIs.

I introduced you a little to Emacs, Android and Java; let's look at the last successful long-lived platform: the Web itself. Can you imagine how many iterations HTTP has gone through since 1995 when we used blink tags and "Under Construction" icons on web pages.

But it still works! And these pages are still working! Yes folks, browsers are the world champions in backwards compatibility. Chrome is another example of a rare Google platform that has its heads screwed on properly, and you guessed it, Chrome effectively acts as an isolated company apart from the rest of Google.

I also want to thank our friends among operating system developers: Windows, Linux, NOT APPLE FUCK YOU APPLE, FreeBSD and so on, for doing such a great job of backward compatibility on their successful platforms (Apple gets a C at best with the downside is that they break things all the time for no good reason, but somehow the community handles this with every release, and OS X containers still aren't completely obsolete...yet).

But wait, you say. Aren't we comparing apples and oranges - standalone software systems on a single machine like Emacs/JDK/Android/Chrome to multi-server systems and APIs like cloud services?

Well, I tweeted about it yesterday, but in the Larry Wall style, on a "suck/rules" basis, I searched for the word deprecated on the developer sites of Google and Amazon. And while AWS has hundreds times more service offerings than GCP, Google developer documentation mentions obsolescence about seven times more often.

If anyone at Google is reading this, they're probably ready to pull out Donald Trump-style charts showing that they're actually doing everything right, and that I shouldn't be making unfair comparisons like "the number of mentions of the word deprecated versus the number of services ".

But after so many years, Google Cloud is still #3 (I never wrote an article about the failed attempt to become #2), but according to insiders, there is some concern that they may soon drop to #4.

I have no good arguments to "prove" my thesis. All I have are colorful examples that I have accumulated over 30 years as a developer. I have already mentioned the deeply philosophical nature of this problem; in a sense, it is politicized in the developer communities. Some people think that creators platforms should care about compatibility, while others think it's a concern users (the developers themselves). One out of two. And really, isn't it a political issue when we decide who should bear the cost of common problems?

So this is politics. And there will certainly be angry responses to my performance.

Как user Google Cloud Platform, and as an AWS user for two years (working at Grab), I can say that there is a huge difference between the philosophies of Amazon and Google when it comes to priorities. I don't actively develop on AWS, so I don't know very well how often they remove old APIs. But there is a suspicion that this does not happen as often as in Google. And I sincerely believe that this source of constant controversy and frustration in GCP is one of the biggest factors holding back the development of the platform.

I know I didn't name specific examples of GCP systems that are being discontinued. I can say that pretty much everything I've used, from networking (from the oldest to VPCs) to storage (Cloud SQL v1-v2), Firebase (now Firestore with a completely different API), App Engine (let's not even get started), Cloud Endpoints to... I don't know - absolutely all this forced to rewrite the code after a maximum of 2-3 years, and they never automated the migration for you, and often there was no documented migration path at all. Like it's supposed to be.

And every time I look at AWS, I ask myself why the hell am I still on GCP. They obviously don't want customers. They need customers. Do you understand the difference? Let's explain.

Google Cloud has Marketplacewhere people offer their software solutions, and to avoid the effect of an empty restaurant, it was necessary to fill it with some proposals, so they contracted with Bitnami to create a bunch of solutions that are deployed with “one click”, or I myself should write “ solutions" because they don't solve a damn thing. They just exist like flags, like marketing filler, and Google never cared if any of the tools really worked. I know product managers who were driving and I can assure you that these people don't give a damn.

Take, for example, a solution with supposedly “one-click” deployment percona. I was sick to death of the antics of Google Cloud SQL, so I started looking at building my own Percona cluster as an alternative. And this time, Google kind of did a good job, they were going to save me some time and effort at the click of a button!

Well, great, let's go. Follow the link and click this button. Select "Yes" to accept all default options and deploy the cluster to your Google Cloud Project. Haha, it doesn't work. None of this crap works. The tool was never tested and it started to rot from the first minute, and it wouldn't surprise me if more than half of the "solutions" for one-click deployment (now we understand why the quotes) in general does not work. This is absolutely hopeless darkness, where it is better not to enter.

But Google right urges you to use them. They want you to bought. For them, it's a transaction. They don't want anything support. It's not part of Google's DNA. Yes, engineers support each other, as evidenced by my history with Bigtable. But in products and services for ordinary people, they always were ruthless in closing any service, which does not meet the bar of profitability, even if it has millions of users.

And this presents a real problem for GCP, because this DNA is behind all cloud offerings. They don't seek to support anything; they are well known to refuse to host (as a managed service) any third party software untiluntil AWS does the same and builds a successful business around it, and when customers literally demand the same. However, it takes some effort to get Google to support something.

This lack of support culture, coupled with the "let's break it down to make it prettier" attitude alienates developers from them.

And this is not very good if you want to build a long-lived platform.

Google, wake up, damn it. It's 2020 now. You are still losing. It's time to take a hard look in the mirror and decide if you really want to stay in the cloud business.

If you want to stay then stop breaking everything. Guys, you are rich. We developers are not. So when it comes to who bears the burden of compatibility, you need to take it upon yourself. Not to us.

Because there are at least three other really good clouds. They beckon.

And now I'm going to go on fixing all my broken systems. Eh.

Until next time!

PS Update after reading some of the discussions on this article (the discussions are great btw). Firebase is not discontinued and there are no plans that I know of. However, they have a nasty streaming bug that causes the Java client to stop in App Engine. One of their engineers helped me solve this problem, when i worked at google, but they never really fixed the bug, so I have a lousy workaround, I have to restart the GAE application every day. And it's been like this for four years! Now they have Firestore. It will take a lot of work to migrate to it as it is a completely different system and the Firebase bug will never be fixed. What conclusion can be drawn? You can get help if you work for a company. I'm probably the only one using Firebase on GAE because I write less than 100 keys in a 100% native app and it crashes every couple of days due to a known bug. What can I say other than use it at your own risk. I'm moving to Redis.

I've also seen some more experienced AWS users say that AWS usually never stops supporting any service, and SimpleDB is a great example. My assumption that AWS doesn't have the end-of-support sickness that Google has seems to be justified.

Also, I noticed that 20 days ago the Google App Engine team broke the hosting of a critical Go library by shutting down a GAE application from one of the core Go developers. Indeed, it was stupid.

Finally, I heard that Googlers are already discussing this issue and generally agree with me (love you guys!). But they seem to see the problem as unsolvable because Google's culture has never had a proper incentive structure. I thought it would be nice to take some time to discuss the absolutely amazing experience I had with AWS engineers when I was at Grab. Sometime in the future, I hope!

And yes, in 2005 they did have different types of shark meat at the giant buffet in Building 43, and my favorite was hammerhead shark meat. However, by 2006, Larry and Sergey had cut all unhealthy snacks. So there really weren't any sharks during the Bigtable story in 2007, and I meanly deceived you.

When I looked at cloud Bigtable four years ago (give or take), the cost was exactly that. It seems to have gone down a bit now, but it's still an awful lot for an empty data warehouse, especially since my first story shows how inconsequential an empty large table is at their scale.

Sorry for offending the Apple community and not saying anything nice about Microsoft etc. You are all right, I really appreciate all the discussion this article has generated! But sometimes it takes a little wave to start a conversation, you know?

Thanks for reading.

Update 2, 19.08.2020/XNUMX/XNUMX. Stripes does the API update correctly!

Update 3, 31.08.2020/2/2. I was contacted by a Google engineer at Cloud Marketplace, who turned out to be an old friend of mine. He wanted to find out why CXNUMXD wasn't working, and in the end we figured out it was because I created my network a few years ago and CXNUMXD doesn't work on legacy networks because of a missing subnet parameter in their templates. I think potential GCP users better make sure they have enough familiar engineers at Google...

Source: habr.com