Big Interview with Cliff Click - the Father of Java JIT Compilation

Big Interview with Cliff Click - the Father of Java JIT CompilationCliff Click — CTO of Cratus (IoT sensors for process improvement), founder and co-founder of several startups (including Rocket Realtime School, Neurensic and H2O.ai) with several successful exits. Cliff wrote his first compiler at 15 (Pascal for TRS Z-80)! He is best known for his work on C2 in Java (the Sea of ​​Nodes IR). This compiler showed the world that the JIT can produce quality code, which has become one of the factors in the development of Java as one of the main modern software platforms. Cliff then helped Azul Systems build an 864-core mainframe with pure Java software that kept GC pauses on a 500-gigabyte heap within 10 milliseconds. In general, Cliff managed to work on all aspects of the JVM.

 
This habrapost is a great interview with Cliff. We will talk on the following topics:

  • Transition to low-level optimizations
  • How to do big refactoring
  • Cost Model
  • Learning low-level optimizations
  • Practical Examples of Performance Improvement
  • Why create your own programming language
  • career as a performance engineer
  • Technical Challenges
  • A little about register allocation and multi-core
  • The biggest challenge in life

Interviews are conducted by:

  • Andrey Satarin from Amazon Web Services. In his career, he managed to work in completely different projects: he tested the NewSQL distributed database in Yandex, the cloud detection system in Kaspersky Lab, the multiplayer game in Mail.ru, and the currency price calculation service in Deutsche Bank. Interested in testing large-scale backend and distributed systems.
  • Vladimir Sitnikov from Netcracker. For ten years he has been working on the performance and scalability of NetCracker OS, software used by telecom operators to automate network and network equipment management processes. Interested in Java and Oracle Database performance issues. Author of over a dozen performance improvements in the official PostgreSQL JDBC driver.

Transition to low-level optimizations

Andrei: You are a famous person in the world of JIT compilation, Java and performance work in general, right? 

Cliff: It's like that!

Andrei: Let's start with general questions about working on performance. What do you think about the choice between high-level and low-level optimizations like working at the CPU level?

Cliff: Yes, everything is simple. The fastest code is the one that never runs. Therefore, you always need to start from a high level, work on algorithms. A better O-notation will beat a poorer O-notation, unless some sufficiently large constants intervene. Low-level things come last. Usually, if you've optimized the rest of the stack well enough and there's still something interesting left, that's low level. But how do you start at a high level? How do you know that enough work has been done at a high level? Well... no way. There are no ready recipes. You need to understand the problem, decide what you are going to do (so as not to take unnecessary steps in the future), and then you can already uncover the profiler, which can say something useful. At some point, you yourself understand that you got rid of unnecessary things and it's time to start fine-tuning the low level. This is definitely a special kind of art. A lot of people are doing unnecessary things, but moving so fast that they have no time to care about productivity. But this is as long as the question does not arise edge-on. Usually, 99% of the time, no one cares what I do, until the moment when there is an important thing on the critical path that someone cares about. And here everyone starts to nag you on the topic “why it didn’t work perfectly from the very beginning.” In general, there is always something to improve in the performance. But 99% of the time you have no leads! You're just trying to make something work and in the process you realize what's important. You can never know in advance that this piece needs to be made perfect, so, in fact, you have to be perfect in everything. It's impossible and you don't do it. There is always a bunch of things to fix - and that's perfectly normal.

How to do big refactoring

Andrei: How do you work on a performance? It's a cross-cutting problem. For example, have you had to work on problems that result from the intersection of a large amount of already existing functionality?

Cliff: I try to avoid it. If I know performance is going to be a problem, I think before I start coding, especially data structures. But often you discover all this very late. And then you have to go to extremes and do what I call "rewrite and conquer": you need to grab onto a fairly large piece. Part of the code will still have to be rewritten due to performance issues or something else. Whatever the reason for rewriting code, it is almost always better to rewrite a larger piece than a smaller piece. At this point, everyone starts shaking with fear: “oh my god, you can’t touch so much code!”. But, in fact, this approach almost always works much better. You need to immediately take on a big problem, draw a big circle around it and say: everything inside the circle, I will rewrite. The border is much smaller than the content inside it that needs to be replaced. And if such a delineation of boundaries allows you to do the work inside perfectly - your hands are untied, do what you want. Once you understand the problem, the process of rewriting becomes much easier, so take a big bite!
At the same time, when you do a rewrite in a large chunk and realize that performance will become an issue, you can immediately start worrying about it. It usually turns into simple things like "don't copy the data, manage the data as simple as possible, make it smaller". In large rewrites, there are standard ways to improve performance. And they almost always revolve around data.

Cost Model

Andrei: In one of your podcasts, you talked about cost models in the context of performance. Can you explain what you mean by that?

Cliff: Certainly. I was born in an era where processor performance was extremely important. And this era is returning again - fate is not without irony. I began to live in the days of eight-bit machines, my first computer worked with 256 bytes. It's bytes. Everything was very small. We had to count instructions, and as we started to move up the stack of programming languages, languages ​​took on more and more. There was Assembler, then Basic, then C, and C took care of a lot of the details, like register allocation and instruction selection. But everything was pretty clear there, and if I made a pointer to an instance of a variable, then I will get load, and this instruction has a known cost. Hardware gives out a known number of machine cycles, so the speed of execution of different things can be calculated simply by adding up all the instructions that you are going to run. Each compare/test/branch/call/load/store could be added up and said: here's the execution time for you. As you work on improving performance, you'll notice exactly what numbers correspond to small hot cycles. 
But as soon as you switch to Java, Python and similar things, you very quickly move away from low-level hardware. What is the cost of calling a getter in Java? If JIT in HotSpot everything is correct inlined, it will be load, but if he didn't, it will be a function call. Because the call is on a hot cycle, it will override all other optimizations in that cycle. So the real cost will be much higher. And you immediately lose the ability to look at a piece of code and understand that we should execute it in terms of processor clock speed, memory used and cache. All this becomes interesting only if you really burrowed into performance.
Now we are in a situation where processor speeds have hardly been growing for a decade. The old days are back! You can no longer count on good single-threaded performance. But if you suddenly get into parallel computing, it's insanely difficult, everyone looks at you like you're James Bond. Tenfold accelerations here usually occur in those places where someone has missed something. Parallelism requires a lot of work. To get that same tenfold acceleration, you need to understand the cost model. What and how much. And for this you need to understand how the tongue lies on the underlying iron.
Martin Thompson has chosen a great word for his blog Mechanical Sympathy! It is necessary to understand what the hardware is going to do, how exactly it will do it, and why it does what it does at all. Using this, it's pretty easy to start counting instructions and figure out where the execution time is leaking. If you do not have the appropriate training, you are simply looking for a black cat in a dark room. I see performance optimization people all the time who have no idea what the hell they're doing. They suffer a lot and don't make much progress. And when I take the same piece of code, put a couple of little hacks in there and get a five or ten times speedup, they're like, well, that's not fair, we already knew that you were better. Amazing. What I'm talking about... the cost model is about what kind of code you write and how fast it runs on average in the big picture.

Andrei: And how to keep such a volume in your head? This is achieved by a lot of experience, or? Where is such experience obtained?

Cliff: Well, I did not get my experience in the easiest way. I used to program in assembly language back in the days when you could figure out every single instruction. It sounds silly, but since then, the Z80 instruction set has always remained in my head, in memory. I don't remember people's names a minute after the conversation, but I do remember code written 40 years ago. Funny, it looks like a syndrome"scientist idiot».

Learning low-level optimizations

Andrei: Is there any easier way to get in on the action?

Cliff: Yes and no. The hardware we all use hasn't changed that much in that time. Everyone uses x86, except for Arm smartphones. If you're not doing some hardcore embedding, you're all the same. Okay, next. The instructions haven't changed in centuries either. You need to go and write something in assembler. Not much, but enough to begin to understand. You're smiling, and I'm completely serious. You need to understand the correspondence between language and hardware. After that, you need to go pee a little and make a little toy compiler for a little toy language. "Toy" means that you need to make it in a reasonable time. It can be super simple, but it must generate instructions. The act of generating the instruction will allow us to understand the cost model for the bridge between the high-level code that everyone writes in and the machine code that runs on the hardware. This correspondence will burn in the brain at the time of writing the compiler. Even the simplest compiler. After that, you can start looking at Java and the fact that it has a much deeper semantic abyss, and building bridges over it is much more difficult. In Java, it is much more difficult to understand whether our bridge turned out good or bad, what will make it fall apart and what will not. But you need some kind of starting point when you look at the code and understand: “yeah, this getter should be inlined every time.” And then it turns out that sometimes it happens, except for the situation when the method becomes too large, and the JIT starts to inline everything. The performance of such places can be predicted instantly. Usually getters work well, but then you look at big hot loops and you realize that there are some function calls floating around that do not understand what they are doing. This is the problem with the widespread use of getters, the reason why they are not inlined is it is not clear whether this is a getter. If you have a super small code base, you can just memorize it and then say: this is a getter, and this is a setter. In a large codebase, each function lives its own history, which is generally unknown to anyone. The profiler says that we lost 24% of the time on some loop and to understand what this loop does, we need to look at each function inside. It is impossible to understand this without studying the function, and this seriously slows down the process of understanding. That's why I don't use getters and setters, I've reached a new level!
Where to get the cost model? Well, you can read something, of course ... But I think the best way is to act. Make a small compiler and that will be the best way to understand the cost model and fit it in your own head. A small compiler that would be suitable for microwave programming is a task for a beginner. Well, I mean, if you already have programming skills, then they should be enough. All these things like parsing a string, which you will have some kind of algebraic expression, pulling out instructions for mathematical operations from there in the correct order, taking the correct values ​​\uXNUMXb\uXNUMXbfrom registers - all this is done at once. And while you do it, it will be imprinted in the brain. I think everyone knows what a compiler does. And this will give an understanding of the value model.

Practical Examples of Performance Improvement

Andrei: What else is worth paying attention to when working on performance?

Cliff: Data structures. By the way, yes, I have not conducted these classes for a long time ... Rocket School. It was funny, but it required so much effort, and I also have a life! OK. So, in one of the big and interesting classes, “Where does your performance go”, I gave students an example: two and a half gigabytes of fintech data were read from a CSV file, and then it was necessary to calculate the number of products sold. Regular tick market data. UDP packets converted to text format since the 70s. Chicago Mercantile Exchange - Stuff like butter, corn, soybeans, stuff like that. It was necessary to count these products, the number of transactions, the average volume of movement of funds and goods, and so on. It's pretty simple trading math: find the product code (it's 1-2 characters in a hash table), get the amount, add it to one of the trade sets, add volume, add value, and a couple of other things. Very simple math. The toy implementation was very straightforward: everything is in the file, I read the file and move along it, separating individual records into Java lines, looking for the necessary things in them and adding them according to the above math. And it works with some small speed.

With this approach, everything is obvious what is happening, and parallel computing will not help here, right? It turns out that a fivefold increase in performance can be achieved just by choosing the right data structures. And it surprises even experienced programmers! In my particular case, the trick was that you should not do memory allocations in a hot loop. Well, that's not the whole truth, but in general, you shouldn't single out "once in X" when X is large enough. When X is two and a half gigabytes, you shouldn't allocate anything "once per letter" or "once per line" or "once per field" or anything like that. That is what time is spent on. How does it even work? Imagine me making a call String.split() or BufferedReader.readLine(). Readline makes a string from a set of bytes that have arrived over the network, once for every string, for every one of hundreds of millions of strings. I take this line, parse it and throw it away. Why am I throwing it away - well, I already processed it, that's all. So, for each byte read from these 2.7G, two characters will be written per line, that is, already 5.4G, and I don’t need them for anything else, so they are thrown away. If you look at memory bandwidth, we load 2.7G that goes through the memory and memory bus in the processor, and then twice as much goes to the line that lies in memory, and all this is frayed with each new line. But I need to read it, the iron reads it, even if then everything is frayed. And I have to write it down because I created the string and the caches are full - the cache can't fit 2.7G. So for every byte I read, I read two extra bytes and write two extra bytes, and they end up in a 4:1 ratio, which is a waste of memory bandwidth. And then it turns out that if I do String.split() - then I’m doing this far from the last time, there may be 6-7 more fields inside. Therefore, the classic CSV reading code followed by string parsing results in a memory bandwidth loss of around 14:1 relative to what you actually would like to have. If you throw out these selections, you can get a fivefold speedup.

And it's not that very difficult. If you look at the code from the right angle, it all becomes pretty easy once you get the gist of the problem. You should not stop allocating memory at all: the only problem is that you allocate something and it immediately dies, and along the way burns an important resource, which in this case is memory bandwidth. And all this translates into a drop in performance. On x86, you usually need to actively burn processor cycles, but here you burned all the memory much earlier. The solution is to reduce the amount of allocations. 
The other part of the problem is that if you run the profiler when it runs out of memory, right at the moment it happens, you're usually waiting for the cache to come back, because it's full of junk you just spawned, all those lines. Therefore, each load or store operation becomes slow, because they lead to cache misses - the entire cache has become slow, waiting for garbage to leave it. Therefore, the profiler will only show warm random noise spread throughout the entire loop - there will be no separate hot instruction or place in the code. Only noise. And if you look at the GC cycles, they are all Young Generation and super fast - microseconds or milliseconds at the most. After all, all this memory dies instantly. You allocate billions of gigabytes, and he cuts them, and cuts, and cuts again. All this happens very quickly. It turns out that there are cheap GC cycles, warm noise along the entire cycle, but we want to get a 5x speedup. At this moment, something should close in my head and sound: “why is that ?!”. The overflow of the memory lane is not displayed in the classic debugger, you need to run the hardware performance counter debugger and see it for yourself and directly. And not directly it can be suspected from these three symptoms. The third symptom is when you look at what you highlight, ask the profiler, and he answers: "You made a billion lines, but the GC worked for free." As soon as this happened, you realize that you spawned too many objects and burned the entire memory strip. There is a way to figure this out, but it's not obvious. 

The problem is in the data structure: the bare structure behind everything that happens is too big, it's 2.7G on disk, so making a copy of this thing is very undesirable - you want to load it from the network byte buffer immediately into registers so as not to read-write to a string back and forth five times. Unfortunately, Java doesn't give you such a library as part of the JDK by default. But it's trivial, right? In fact, this is 5-10 lines of code that will go to the implementation of our own buffered line loader, which repeats the behavior of the string class, while being a wrapper around the underlying byte buffer. As a result, it turns out that you are working almost as if with strings, but in fact pointers to the buffer are moving there, and raw bytes are not copied anywhere, and thus the same buffers are reused, over and over again, and the operating system is happy to take on do the things it was meant to do, like covertly double-buffering those byte buffers, and you're no longer grinding around an endless stream of junk data. By the way, you understand that when working with the GC, it is guaranteed that each memory allocation will not be visible to the processor after the last GC cycle? Therefore, all this cannot be in the cache, and then a 100% guaranteed miss happens. When working with a pointer, on x86 it takes 1-2 cycles to subtract a register from memory, and as soon as this happens, you pay, pay, pay, because the memory is all NINE caches – and this is the cost of allocating memory. Real value.

In other words, data structures are the hardest things to change. And once you realize you've chosen the wrong data structure, which will kill performance later on, there's usually a lot of work to be done, but if you don't, things get worse. First of all, you need to think about data structures, this is important. The main cost here lies on the fat data structures, which are starting to be used in the style of "I copied data structure X into data structure Y, because I like Y better in shape." But the copy operation (which seems cheap) actually wastes memory bandwidth, and that's where all the wasted execution time is buried. If I have a giant JSON string and I want to turn it into a structured DOM tree of a POJO or something, the operation of parsing that string and building the POJO, and then re-accessing the POJO later on will add up to extra cost - not cheap. Except if you run over POJO a lot more often than you run over row. Offhand, instead, you can try to decrypt the string and pull out only what you need from there, without turning it into any POJOs. If all this happens on a path that requires maximum performance, no POJOs for you - you need to somehow directly delve into the line.

Why create your own programming language

Andrei: You said that in order to understand the cost model, you need to write your own little little language ...

Cliff: Not a language, but a compiler. Language and compiler are different things. The biggest difference is in your head. 

Andrei: By the way, as far as I know, you are experimenting with creating your own languages. For what?

Cliff: Because I can! I'm half retired, so this is my hobby. All my life I have been implementing someone else's languages. I also worked a lot on the coding style. And also because I see problems in other languages. I see that there are better ways to do familiar things. And I would use them. I just got tired of seeing problems in myself, in Java, in Python, in any other language. I currently write in React Native, JavaScript and Elm as a hobby, which is not about retirement, but about active work. I also write in Python and most likely will continue to work on machine learning for Java backends. There are many popular languages ​​and all of them have interesting features. Each is good with something of its own and you can try to bring all these chips together. So, I study things that are interesting to me, the behavior of the language, I try to come up with reasonable semantics. And so far I've got it! At the moment I'm struggling with memory semantics because I want to have it like in C and Java, and get a strong memory model and memory semantics for loads and stores. At the same time, have automatic type inference as in Haskell. Here, I'm trying to mix Haskell-like type inference with memory that works like C and Java. This is what I have been doing for the last 2-3 months, for example.

Andrei: If you build a language that takes better aspects of other languages, do you think that someone will do the opposite: take your ideas and use them?

Cliff: That's how new languages ​​are born! Why is Java similar to C? Because C had a good syntax that everyone understood and Java was inspired by this syntax, adding type safety, array bounds checks, GC, and they also improved some things from C. They added their own. But they were inspired quite a lot, right? Everyone stands on the shoulders of the giants who came before you - that's how progress is made.

Andrei: I understand that your language will be safe regarding memory usage. Have you considered implementing something like Rust's borrow checker? Did you look at him, how do you like him?

Cliff: Well, I've been writing in C for ages now, with all those mallocs and frees, and manually managing the lifetime. You know, 90-95% of manually controlled lifetime has the same structure. And it's very, very painful to do it by hand. I would like the compiler to simply say what is happening there and what you have achieved with your actions. For some things, the borrow checker does it out of the box. And it should automatically display information, understand everything and not even burden me with explaining this understanding. It should do at least a local escape analysis, and only if it fails, then you need to add type annotations that will describe the lifetime - and such a scheme is much more complicated than the borrow checker, or in general any existing memory checker. The choice between "everything is in order" and "I did not understand anything" - no, there must be something better. 
So, as someone who has written a lot of C code, I think having support for automatic lifetime management is the most important thing. I also got sick of how much Java uses memory and the main complaint is in the GC. When allocating memory in Java, you will not get back the memory that was local on the last GC cycle. In languages ​​with finer memory management, this is not the case. If you call malloc, you immediately get memory that was normally just used. Usually you do some temporary things with the memory and immediately return it back. And it immediately returns to the malloc pool, and the next malloc cycle pulls it out again. Therefore, the real memory usage is reduced to a set of live objects at a particular point in time, plus leaks. And if everything doesn’t flow in a completely indecent way, most of the memory settles in the caches and the processor, and this works quickly. But it requires a lot of manual memory management with malloc and free being called in the right order, in the right place. Rust can handle this properly on its own, and in a lot of cases even give better performance, since the memory consumption is narrowed down to just the current computations - as opposed to waiting for the next GC cycle, which will free the memory. As a result, we got a very interesting way to improve performance. And quite powerful - I mean, I did such things when processing data for fintech, and this allowed me to get an acceleration of about five times. That's a pretty big speedup, especially in a world where processors aren't getting faster and we're still waiting for improvements.

career as a performance engineer

Andrei: I would also like to ask about the career in general. You rose to fame with your JIT work at HotSpot and then moved to Azul, which is also a JVM company. But they were already doing more hardware than software. And then they suddenly switched to Big Data and Machine Learning, and then to fraud detection. How did it happen? These are very different areas of development.

Cliff: I've been programming for a long time and managed to check in a lot of different classes. And when people say, "oh, you're the one who did JIT for Java!", it's always funny. But before that, I was working on a clone of PostScript - the language that Apple once used for its laser printers. And before that, I did the implementation of the Forth language. I think a common theme for me is tool development. All my life I have been making tools with which other people write their cool programs. But I also developed operating systems, drivers, kernel-level debuggers, OS development languages ​​that started out trivial but got more and more complicated over time. But the main topic, after all, is the development of tools. A big chunk of life passed between Azul and Sun, and it was about Java. But when I got into Big Data and Machine Learning, I put my dress hat back on and said, “Oh, now we have a non-trivial problem, and there are a lot of interesting things going on here and people who are doing something.” This is a great development path to follow.

Yes, I really love distributed computing. My first job was as a student in C, on an advertising project. It was distributed computing on Zilog Z80 chips that collected data for an analog OCR run by a real analog analyzer. It was a cool and completely crazy topic. But there were problems, some part was not recognized correctly, so it was necessary to get a picture and show it to a person who already read with his eyes and reported what it says, and therefore there were jobs with data, and these jobs had their own language . There was a backend that handled it all - Z80s running in parallel with vt100 terminals running - one per person, and there was a Z80 parallel programming model. Some common chunk of memory shared by all Z80s inside a star configuration; the backplane was also shared, and half of the RAM was shared within the network, and the other half was private or went to something else. A meaningfully complex parallel distributed system with shared… semi-shared memory. When it was... I don't even remember, somewhere in the mid-80s. Quite a long time ago. 
Yes, we will assume that 30 years is a long time ago The tasks associated with distributed computing have existed for a long time, people have long been at war with B-clusters. Such clusters look like... For example: there is an Ethernet and your fast x86 is connected to this Ethernet, and now you want to get fake shared memory, because no one could then code distributed computing, it was too difficult and therefore there was a fake shared memory with protection pages of memory on x86, and if you wrote to this page, then we told the rest of the processors that if they get access to the same shared memory, it will need to be loaded from you, and thus something like a cache coherence support protocol appeared and software for it. An interesting concept. The real problem, of course, was elsewhere. All this worked, but you quickly got performance problems, because no one understood the performance models at a good enough level - what are the memory access patterns, how to make sure that the nodes do not endlessly ping each other, and so on.

What I came up with in H2O is that developers themselves are responsible for determining where concurrency hid and where it didn't. I came up with a coding model that made it easy and simple to write high-performance code. But writing slow code is difficult, it will look bad. You need to seriously try to write slow code, you will have to use non-standard methods. The braking code is visible at a glance. As a consequence, you usually write code that runs fast, but you have to figure out what to do with shared memory. All this is tied to large arrays and the behavior there is similar to non-volatile large arrays in parallel Java. In a sense, imagine that two threads write to a parallel array, one of them wins, and the other, respectively, loses, and you don’t know which one is which. If they are not volatile, then the order can be anything - and it really works well. People really care about the order of operations, they put volatile in the right places and expect memory-related performance issues in the right places. Otherwise, they would simply write code in the form of cycles from 1 to N, where N is some trillions, in the hope that all complex cases will automatically become parallel - and this does not work there. But in H2O, it's neither Java nor Scala, you can call it "Java minus minus" if you want. This is a very straightforward programming style and is similar to writing simple C or Java code with loops and arrays. But at the same time, memory can be processed in terabytes. I still use H2O. From time to time I use it in different projects - and this is still the fastest thing, dozens of times ahead of its competitors. If you're doing Big Data with column data, it's very hard to beat H2O.

Technical Challenges

Andrei: What was your biggest challenge in your entire career?

Cliff: Are we discussing the technical or non-technical part of the question? I would say the biggest challenges are non-technical. 
As for technical challenges. I just defeated them. I don't even know which one was the biggest, but there were some pretty interesting ones that took quite a lot of time, mental struggles. When I went to Sun, I was sure that I would make a fast compiler, and a lot of seniors said that I would never succeed. But I went down this path, wrote a compiler up to the register allocator, and quite fast. It was as fast as the modern C1, but then the allocator was much slower, and in retrospect it was a big data structure problem. I needed it to write a graphical register allocator and didn't understand the dilemma between code expressiveness and speed that existed in that era and was very important. It turned out that the data structure usually exceeds the cache size on x86s of that time, and therefore, if I initially assumed that the register allocator would work out 5-10 percent of the total jitter time, then in reality it turned out to be 50 percent.

As time went on, the compiler got clearer and more performant, it stopped generating nasty code more and more, and the performance started to more and more resemble what the C compiler produces. Unless, of course, you write some rubbish that even C cannot speeds up. If you write code like in C, you get and perform like in C in more cases. And the further you go, the more often you get code that asymptotically matches level C, the register allocator starts to look like something complete... regardless of whether your code is fast or slow. I continued to work on the allocator so that it makes better selections. It got slower and slower, but delivered better and better performance when no one else could. I could dive into the register allocator, bury a month of work in there, and all of a sudden, all the code would start executing 5% faster. This happened over and over again and the register allocator became something of a work of art - everyone loved it or hated it, and people from the academy asked questions on the topic “why everything is done this way”, why not line scanand what is the difference. The answer is still the same: an allocator based on graph painting plus very careful handling of the buffer code equals a weapon of victory, the best combination that no one can defeat. And this is a rather obscure thing. Everything else that the compiler does there is pretty well-studied things, although they are also brought to the level of art. I have always done things that were supposed to turn the compiler into a work of art. But none of this was anything out of the ordinary - with the exception of the register allocator. The focus is on getting it right to cut down under load and if it does (I can explain more if you're interested) it means you can inline more aggressively without the risk of going over a break in the performance graph. In those days there were a bunch of full-blown compilers, hung with baubles and whistles, in which there were register allocators, but no one else could do it.

The problem is that if you add methods to be inlined by increasing and increasing the inlining area, the set of values ​​used instantly overtakes the number of registers, and you have to file. The critical level usually comes when the allocator gives up, and one good candidate for spilling is worth another, you will spill some generally wild things. The value of inlining here is that you lose part of the overhead, the overhead for calling and saving, you can see the values ​​inside and you can optimize them further. The cost of inlining is that a large number of live values ​​are generated, and if your register allocator writes more than it should, you immediately lose. Therefore, most allocators have a problem: when inlining crosses a certain line, everything in the world starts to spill and performance can be flushed down the toilet. Those who implement the compiler add some heuristics: for example, to stop inlining, starting with some large enough size, since allocations will ruin everything. This is how a break in the performance graph is formed - you inline, inline, the performance slowly grows - and then boom! – he falls down with a rapid jack because you inlined too much. This is how things worked before the advent of Java. Java requires a lot more inlining, so I had to make my allocator a lot more aggressive so that it flattens out instead of falling down, and if you inline too much, it starts to spill, but then it still comes down to the “no more spilling” moment. This is an interesting observation and it just came to me out of nowhere, not obvious, but well paid off. I took up aggressive inlining and it took me to places where performance Java and C go side by side. They're really close - I can write Java code that's significantly faster than C code and the like - but on average, in the big picture, they're roughly comparable. It seems that part of this merit is the register allocator, which allows me to inline as stupidly as possible. I just inline everything I see. The question here is whether the allocator works well, whether the result is reasonably working code. That was a big challenge: to understand all this and make money.

A little about register allocation and multi-core

Vladimir: Problems like register allocation seem to be some kind of eternal endless topic. I wonder if there was an idea that seemed promising, and then failed in practice?

Cliff: Certainly! Register allocation is an area in which you try to find some heuristics to solve an NP-complete problem. And you can never get the perfect solution, right? It's just not possible. Look, Ahead of Time compilation - it also does not work well. The conversation here is about some average cases. About typical performance, so you can go and measure what you think is good typical performance - after all, you're working on improving it! Register allocation is a topic all about performance. Once you have the first prototype, it works and paints what you need, the performance work begins. You need to learn how to measure well. Why is it important? If you have clear data, you can look at different areas and see: yes, it helped here, but everything broke down there! Some good ideas come along, you add new heuristics, and all of a sudden everything starts working a little better on average. Or doesn't start. I had a lot of cases where we fought for five percent of productivity, which distinguished our development from the previous allocator. And every time it looks like this: somewhere won, somewhere lost. If you have good performance analysis tools, you can find losing ideas and understand why they fail. Maybe it's worth leaving everything as it is, or maybe more seriously take up fine-tuning, or go and fix something else. It's a whole bunch of things! I did this cool hack, but I also need this one, and this one, and this one - and their total combination gives some improvements. And singles can fail. This is the nature of working on the performance of NP-complete problems.

Vladimir: One gets the feeling that things like painting in allocators are a task already solved. Well, decided for you, judging by what you say, is it worth it then at all ...

Cliff: It is not resolved as such. It's up to you to turn it into "resolved". There are difficult tasks and they need to be solved. Once that's done, it's time to work on performance. This work should be treated appropriately - doing benchmarks, collecting metrics, explaining situations when, when you roll back to the previous version, your old hack started working again (or vice versa, stopped). And don't back down until you achieve something. As I said, if cool ideas that didn't work, but in the field of allocation of registers of ideas, it's about endless. You can, for example, read scientific publications. Although now this area has begun to move much more slowly and has become clearer than in the days of its youth. However, there are an infinity of people working in this field and all their ideas are worth trying, they are all waiting in the wings. And you can't tell how good they are unless you try them. How well do they integrate with everything else in your allocator, because the allocator does a lot of things, and some ideas in your particular allocator will not work, but in another allocator - easily. The main way for the allocator to win is to pull the slow stuff out of the main path and force it to split along the boundaries of the slow paths. So if you want to run the GC, go the slow route, deoptimize, throw an exception, stuff like that, you know these things are relatively rare. And they are really rare, I checked. You do the extra work and it removes a lot of the restrictions on those slow paths, but that doesn't really matter because they are slow and rarely traveled. For example, null pointer - it never happens, right? You need to have several paths for different things, but they should not interfere with the main one. 

Vladimir: What do you think about multi-core, when there are thousands of cores at once? Is this a useful thing?

CliffA: The success of the GPU shows that it is quite useful!

VladimirA: They are quite specialized. What about general purpose processors?

CliffA: Well, that was Azul's business model. The answer came back in an era when people were very fond of predictable performance. Then it was difficult to write parallel code. The H2O coding model scales well, but it is not a general purpose model. Is that slightly more general than when using the GPU. Are we talking about the difficulty of developing such a thing or the difficulty of using it? For example, Azul taught me an interesting lesson, rather non-obvious: small caches are normal. 

The biggest challenge in life

Vladimir: What about non-technical challenges?

Cliff: The biggest challenge was not to be…kind and nice to people. And as a result, I constantly found myself in extremely conflict situations. The ones where I knew things were going wrong but didn't know how to move forward with those problems and couldn't deal with them. A lot of long-term problems, lasting for decades, appeared in this way. The fact that Java has C1 and C2 compilers is a direct consequence of this. The fact that Java has not had multi-level compilation for ten years in a row is also a direct consequence. It is obvious that we needed such a system, but it is not obvious why it was not there. I had problems with one engineer...or a group of engineers. Once upon a time, when I started working at Sun, I was ... Well, not only then, I generally always have my own opinion on everything. And I considered it to be true that you can simply take this truth of yours and tell it head-on. Especially since I was shockingly right most of the time. And if you don't like this approach... especially if you are obviously wrong and doing nonsense... In general, few people could tolerate this form of communication tolerantly. Although some could, like me. I built my whole life on meritocratic principles. If you show me something wrong, I will immediately turn around and say: you said nonsense. At the same time, of course, I apologize, and all that, I will note the merits, if any, and I will do other correct actions. On the other hand, I'm shockingly right about a shockingly large percentage of the total time. And it doesn't work very well with people. I'm not trying to be nice, but I put the question squarely. “This will never work, because one, two and three.” And they're like, "Oh!" There were other consequences that are probably better to miss: for example, leading to a divorce from his wife and a decade of depression after that.

Challenge is a fight with people, with their perception of what you can or cannot do, what is important and what is not. There were many challenges about coding style. I still write a lot of code, and in those days I even had to slow down because I was doing too many parallel tasks and doing them badly, instead of focusing on one. Looking back, I wrote half of the Java JIT command code, the C2 command. The next fastest encoder wrote half as fast, the next one half as fast, and that was an exponential drop. The seventh person in this row was very, very slow - it always happens! I touched a bunch of code. I watched who wrote what, no exceptions, I stared at their code, I reviewed each of them, and I still continued to write more myself than any of them. With people, this approach does not work very well. Some people don't like that. And when they can't handle it, all sorts of claims begin. For example, one day I was told to stop writing code because I write too much code and it endangers the team, and it all sounded like a joke to me: dude, if the whole remaining team disappears and I continue to write code, you will only lose half commands. On the other hand, if I keep writing code and you lose half the team, that sounds like very bad management. I never really thought about it, never talked about it, but it was still somewhere in my head. At the back of my mind, the thought was spinning: “Are you all joking or what?”. So the biggest problem was me and my relationship with people. Now I understand myself much better, I have been a team leader with programmers for a long time, and now I tell people directly: you know, I am who I am, and you will have to deal with me - is it okay that I stand here? And when they started to cope with it, everything worked. After all, in fact, I am neither bad nor good, I have no bad intentions or selfish aspirations, this is just my essence, and I need to live with it somehow.

Andrei: More recently, everyone has been talking about self-awareness for introverts, and generally about soft skills. What can be said about this?

Cliff: Yes, it was an understanding and a lesson that I learned from the divorce from my wife. What I learned from divorce is self-understanding. So I began to understand other people. Understand how this interaction works. This led to discoveries one after another. There was an awareness of who I am and what I represent. What I'm doing: either I'm preoccupied with the task, or I'm avoiding conflict, or something else - and this level of self-awareness really helps to control myself. After that, everything goes much easier. One thing that I have found not only in myself, but also in other programmers, is the inability to verbalize thoughts when you are in a state of emotional stress. For example, you are sitting coding like this, you are in a state of flow, and then they come running to you and start screaming in hysterics that something has broken there, and now extreme measures will be taken against you. And you can't even say a word because you're in a state of emotional stress. The acquired knowledge allows you to prepare for this moment, survive it and move on to a retreat plan, after which you can already do something. So yes, when you start to realize how it all works, it's a huge life-changing event. 
I myself could not find the right words, but I remembered the sequence of actions. The bottom line is that this reaction is as much physical as it is verbal, and you need space. Such a space, in the Zen sense. This is exactly what needs to be explained, and then immediately step aside - purely physically move away. When I am silent in words, I can process the situation in terms of emotions. As the adrenaline gets to the brain, puts you into fight or flight mode, you can no longer say anything, no - now you are an idiot, a whipping engineer, incapable of a decent answer or of even stopping the attack , and the attacker is free to attack again and again. First you need to become yourself again, take back control, get out of the “fight or flight” mode.

And that's what verbal space is for. Just free space. If you say something at all, then you can say just that, and then go and really find yourself a “space”: go for a walk in the park, lock yourself in the shower - it doesn’t matter. The main thing is to temporarily disconnect from that situation. As soon as you switch off for at least a few seconds, control returns, you begin to think soberly. “Okay, I’m not some kind of idiot, I don’t do stupid things, I’m a pretty useful person.” Once you've been able to convince yourself, it's time to move on to the next step: understand what happened. You were attacked, the attack came from where they were not expected, it was a dishonest vile ambush. This is bad. The next step is to understand why the attacker needed it. Indeed, why? Maybe it's because he's pissed off? Why is he furious? For example, because he himself screwed up and cannot accept responsibility? In this way, you need to carefully handle the whole situation. But this requires room for maneuver, verbal space. The very first step is to break the verbal contact. Avoid verbal discussion. Cancel it, get away as quickly as possible. If it's a phone call, just hang up - it's a skill I learned from my ex-wife. If the conversation doesn't lead to anything good, just say goodbye and hang up. On the other side of the tube: “blah blah blah”, you answer: “yeah, bye!” and hang up. You just end the conversation. Five minutes later, when the ability to think sensibly returns to you, you have cooled down a bit, it becomes possible to think about everything, what happened at all and what will happen next. And start formulating a thoughtful response, and not just reacting to emotions. For me, the breakthrough in self-awareness was precisely the fact that in case of emotional stress, I cannot speak. Getting out of this state, thinking and planning how to respond and compensate for problems - these are the right steps in the case when you cannot speak. The easiest way is to run away from the situation in which emotional stress manifests itself and simply stop participating in this stress. After that, you gain the ability to think, when you can think, it becomes possible to speak, and so on.

By the way, in court, the lawyer of the opposite side tries to do this with you - now it’s clear why. Because he has the ability to suppress you to such a state that you cannot even pronounce your name, for example. In the truest sense, you can't speak. If this is happening to you, and if you know that you will end up in a place where verbal battles rage, in a place like a court, then you can come with your lawyer. A lawyer will stand up for you and stop the verbal attack, and do it in a completely legal way, and you will have your lost zen space back. For example, I had to call my family a couple of times, the judge was quite friendly about this, but the lawyer of the opposite side shouted and shouted at me, I couldn’t even put in a word. In such cases, using an intermediary works best for me. The mediator stops all this pressure that is pouring on you in a continuous stream, you find the necessary Zen space, with it the ability to speak returns. This is a whole field of knowledge in which there is a lot to learn, a lot to discover inside yourself, and all this turns into high-level strategic decisions, different for different people. Some people do not have the above problems, usually people who are professionally involved in sales do not have them. All these people who make their living with words - famous singers, poets, religious figures and politicians - they always have something to say. They don't have those problems, but I do.

Andrei: It was… unexpected. Great, we've already talked a lot and it's time to end this interview. We will certainly meet at the conference and we will be able to continue this dialogue. See you at Hydra!

It will be possible to continue communication with Cliff at the Hydra 2019 conference, which will be held on July 11-12, 2019 in St. Petersburg. He will come with a report "The Azul Hardware Transactional Memory experience". Tickets can be purchased on the official website.

Source: habr.com

Add a comment