QEMU.js: now in a serious way and with WASM

Once upon a time, for the sake of laughter, I decided prove the reversibility of the process and learn how to generate JavaScript (or rather, Asm.js) from machine code. QEMU was chosen for the experiment, some time later an article was written on Habr. In the comments, I was advised to remake the project on WebAssembly, and even throw it myself almost finished somehow I didn’t feel like the project ... The work was going on, but very slowly, and now, recently in that article appeared comment on the topic “So, how did it end?”. To my detailed answer, I heard “This is pulling on an article.” Well, if it pulls, then there will be an article. Maybe someone will come in handy. From it, the reader will learn some facts about the QEMU code generation backends, as well as how to write a Just-in-Time compiler for a web application.

Tasks

Since I have already learned how to “somehow” port QEMU to JavaScript, this time it was decided to do it wisely and not repeat old mistakes.

Mistake number one: branch off point release

My first mistake was to fork my version from the 2.4.1 upstream version. Then it seemed to me a good idea: if a point release exists, then it is probably more stable than a simple 2.4, and even more so branches master. And since I planned to add a fair amount of my own bugs, I didn’t need other people’s ones at all. And so it probably happened. But here's the problem: QEMU does not stand still, and at some point they even announced the optimization of the generated code by 10 percent. “Yeah, now I’ll die,” I thought, and broke off. Here we need to make a digression: due to the single-threaded nature of QEMU.js and the fact that the original QEMU does not imply the absence of multithreading (that is, the possibility of simultaneous operation of several unrelated code paths is critical for it, and not just “use all cores”), the main functions of threads I had to “turn out” to be able to call from outside. This created some natural merging problems. However, the fact that some of the changes from the branch master, with which I tried to merge my code, were also cherry picked in the point release (and therefore in my branch) too, probably would not have added convenience.

In general, I decided that it still makes sense to throw out the prototype, take it apart for spare parts and build a new version from scratch based on something fresher and now from master.

Mistake number two: TLP methodology

In essence, this is not a mistake, in general, it’s just a feature of creating a project in conditions of complete misunderstanding of both “where and how to move?”, and in general “will we get there?”. In these conditions fluff programming was a justified option, but, of course, I absolutely did not want to repeat this unnecessarily. This time I wanted to do it smartly: atomic commits, conscious code changes (and not “stringing random characters together until it compiles (with warnings)”, as Linus Torvalds once said about someone, according to Wikiquote), etc.

Mistake number three: not knowing the ford to climb into the water

Even now I haven’t completely got rid of this, but now I decided to go not along the path of the very least resistance, and do it “in an adult way”, namely, write my TCG backend from scratch, so that later I don’t say, they say, “Yes, this, of course, slowly, but I can’t control everything - TCI is written like that ... ”. Also, this initially seemed like an obvious solution, since I generate binary code. As they say, "Gathered Ghentу, but not that one ”: the code is, of course, binary, but control cannot simply be transferred to it - it must be explicitly pushed into the browser for compilation, resulting in some object from the JS world, which still needs to be saved somewhere. However, on normal RISC architectures, as far as I understand, a typical situation is the need to explicitly reset the instruction cache for the regenerated code - if this is not what we need, then at least it is close. In addition, from my last attempt, I learned that control does not seem to be transferred to the middle of the translation block, so we don’t really need a bytecode interpreted from any offset, and you can simply generate it by function on TB.

Came and kicked

Although I started rewriting the code back in July, the magic pendel crept up unnoticed: usually letters from GitHub come as notifications of responses to Issues and Pull requests, but here, suddenly mention in thread Binaryen as a qemu backend in the context, "Here he did something like that, maybe he will say something." It was about using a related Emscripten library Binaryen to create a WASM JIT. Well, I said that you have an Apache 2.0 license there, and QEMU as a whole is distributed under GPLv2, and they are not very compatible. Suddenly it turned out that the license can be fix it somehow (I don’t know: maybe change, maybe dual licensing, maybe something else ...). This, of course, made me happy, because by that time I had already looked closely at binary format WebAssembly, and I was somehow sad and incomprehensible. There was also a library that would gobble up the basic blocks with the transition graph, and issue the bytecode, and even launch it itself in the interpreter, if necessary.

Then there was writing on the QEMU mailing list, but this is rather to the question, “Who needs it at all?”. And it suddenlyturned out to be necessary. At a minimum, you can scrape together such possibilities of use, if it will work more or less smartly:

  • launching something tutorial without installation at all
  • virtualization on iOS, where, according to rumors, the only application that has the right to code generation on the fly is the JS engine (is it true?)
  • demonstration of mini-OS - single-disk, built-in, all sorts of firmware, etc ...

Browser runtime features

As I said, QEMU is tied to multithreading, but it is not in the browser. Well, that is, how not ... At first it didn’t exist at all, then WebWorkers appeared - as far as I understand, this is multithreading based on message passing without shared variables. Naturally, this creates significant problems when porting existing code based on the shared memory model. Then, under public pressure, it was also implemented under the name SharedArrayBuffers. It was gradually introduced, they celebrated its launch in different browsers, then they celebrated the new year, and then Meltdown ... After which they came to the conclusion that coarsening, not coarsening the measurement of time, but with the help of shared memory and a thread that increments the counter, it’s all the same pretty accurate. So they turned off multithreading with shared memory. It seems that it was then turned back on, but, as it became clear from the first experiment, there is life without it, and if so, then we will try to do it without relying on multithreading.

The second feature is the impossibility of low-level manipulations with the stack: you cannot just take, save the current context and switch to a new one with a new stack. The call stack is managed by the JS virtual machine. It would seem, what is the problem, since we decided to manage the former flows completely manually anyway? The fact is that block I / O in QEMU is implemented through coroutines, and this is where low-level stack manipulations would come in handy. Fortunately, Emscipten already contains a mechanism for asynchronous operations, two in fact: asyncify и emterpreter. The former works through significant bloat in the generated JavaScript code and is no longer supported. The second is the current "correct way" and works through bytecode generation for the native interpreter. It works, of course, slowly, but it does not bloat the code. True, the coroutine support for this mechanism had to be contributed independently (there were already coroutines written under Asyncify and there was an implementation of approximately the same API for Emterpreter, you just had to connect them).

At the moment, I have not yet managed to split the code into compiled in WASM and interpreted using Emterpreter, so block devices do not work yet (see the following series, as they say ...). That is, in the end you should get such a funny layered something:

  • interpreted block I/O. Well, did you really expect an emulated NVMe with native performance? 🙂
  • statically compiled QEMU core code (translator, other emulated devices, etc.)
  • dynamically compiled into WASM guest code

Features of QEMU sources

As you probably already guessed, the code for emulating guest architectures and the code for generating host machine instructions in QEMU are separated. In fact, there is even a little more tricky:

  • there are guest architectures
  • Yes accelerators, namely, KVM for hardware virtualization on Linux (for compatible guest and host systems), TCG for JIT code generation anywhere. Starting with QEMU 2.9, support for the HAXM hardware virtualization standard on Windows (Details)
  • if TCG is used, and not hardware virtualization, then it has separate support for code generation for each host architecture, as well as for a universal interpreter
  • ... and around it all - emulated peripherals, user interface, migration, record-replay, etc.

By the way, did you know: QEMU can emulate not only the entire computer, but also the processor for a separate user process in the host kernel, which is used, for example, by the AFL fuzzer for instrumenting binaries. Perhaps someone wants to port this QEMU mode of operation to JS? 😉

Like most long-standing free software, QEMU is built by calling configure и make. Suppose you decide to add something: a TCG backend, an implementation of threads, something else. Do not rush to rejoice / horrified (underline as appropriate) at the prospect of communicating with Autoconf - in fact, configure QEMU, apparently, is self-written and is not generated from anything.

WebAssembly

So what is this thing - WebAssembly (aka WASM)? This is a replacement for Asm.js, no longer pretending to be valid JavaScript code. On the contrary, it is purely binary and optimized, and even just writing an integer into it is not very simple: for compactness, it is stored in the format LEB128.

You may have heard about the relooping algorithm for Asm.js - this is the restoration of "high-level" execution flow control instructions (that is, if-then-else, loops, etc.), which JS engines are designed for, from a low-level LLVM IR, closer to the machine code executed by the processor. Naturally, the intermediate representation of QEMU is closer to the second one. It would seem that here it is, bytecode, the end of torment... And then blocks, if-then-else and loops!..

And this is another reason why Binaryen is useful: it can naturally accept high-level blocks that are close to what will be stored in WASM. But it can also issue code from the graph of basic blocks and transitions between them. Well, I have already said about what it hides behind the convenient C / C ++ API WebAssembly storage format.

TCG (Tiny Code Generator)

TCG was originally backend for the C compiler. Then, apparently, he could not stand the competition with GCC, but in the end he found his place in QEMU as a code generation mechanism for the host platform. There is also a TCG backend that generates some abstract bytecode, which is immediately executed by the interpreter, but I decided to avoid using it this time. However, the fact that QEMU already has the ability to enable the transition to the generated TB through the function tcg_qemu_tb_exec, came in handy for me.

To add a new TCG backend to QEMU, you need to create a subdirectory tcg/<имя архитектуры> (in this case, tcg/binaryen), and it contains two files: tcg-target.h и tcg-target.inc.c и prescribe it's all about configure. You can put other files there, but as you can guess from the names of these two, they will both be included somewhere: one as a regular header file (it is included in tcg/tcg.h, and that one is already in other files in directories tcg, accel and not only), the other - only as a code snippet in tcg/tcg.c, but it has access to its static functions.

Deciding that I would spend too much time on detailed proceedings, how it works, I simply copied the “skeletons” of these two files from another backend implementation, honestly indicating this in the license header.

File tcg-target.h contains mainly settings in the form #define-ov:

  • how many registers and what width are there on the target architecture (we have as many registers as we want, as many as we have - the question is more than what will be generated into more efficient code by the browser on the “fully target” architecture ...)
  • alignment of host instructions: on x86, and even in TCI, instructions are not aligned at all, but I'm going to put in the code buffer not instructions at all, but pointers to the structures of the Binaryen library, so I'll say: 4 bytes
  • what optional instructions the backend can generate - we include everything that we find in Binaryen, let the accelerator break the rest into simpler ones
  • what is the approximate size of the TLB cache requested by the backend. The fact is that everything is serious in QEMU: although there are helper functions that carry out load / store taking into account the guest MMU (and where now without it?), But they save their broadcast cache in the form of a structure, the processing of which is convenient to embed directly to broadcast blocks. The question is, which offset in this structure is most efficiently processed by a small and fast sequence of commands.
  • here you can also tweak the assignment of one or two reserved registers, enable the TB call through a function and optionally describe a couple of small inline-functions like flush_icache_range (but this is not our case)

File tcg-target.inc.c, of course, is usually much larger and contains a few essential features:

  • initialization, indicating, among other things, restrictions on which instruction can work with which operands. Brazenly copied by me from another backend
  • a function that takes one internal bytecode instruction
  • here you can put auxiliary functions, and here you can use static functions from tcg/tcg.c

For myself, I chose the following strategy: in the first words of the next translation block, I wrote down four pointers: the start mark (some value in the neighborhood 0xFFFFFFFF, by which the current state of TB was determined), the context, the generated module, and the magic number for debugging. First, the label was placed in 0xFFFFFFFF - nWhere n - a small positive number, and with each execution through the interpreter increased by 1. When it reached 0xFFFFFFFE, compilation took place, the module was stored in a function table imported into a small “launcher”, into which the execution of tcg_qemu_tb_exec, and the module was removed from QEMU memory.

To paraphrase the classics, “Crutch, how much is intertwined in this sound for the heart of a proger…”. However, the memory was leaking somewhere. And it was memory managed by QEMU! I had code that, when writing the next instruction (well, that is, a pointer), deleted the one that was referenced at this place earlier, but this did not help. In fact, in the simplest case, QEMU allocates memory at startup and writes the generated code there. When the buffer ends, the code is discarded, and the next one starts to be written in its place.

Having studied the code, I realized that the magic number crutch allowed me not to fall on the destruction of the heap, freeing something wrong on the uninitialized buffer during the first pass. But who rewrites the buffer around my function later? As the Emscripten developers advise, having run into a problem, I ported the resulting code back to the native application, set Mozilla Record-Replay on it ... In general, in the end I understood a simple thing: for each block, a struct TranslationBlock with his description. Guess where... That's right, just before the block right in the buffer. Realizing this, I decided to tie up with crutches (at least some), and simply threw out the magic number, and transferred the remaining words to struct TranslationBlock, by creating a singly linked list that can be quickly traversed when flushing the translation cache and free up memory.

Some crutches remain: for example, marked pointers in the code buffer - some of them are simply BinaryenExpressionRef, that is, they look at the expressions that need to be linearly put into the generated basic block, part is the transition condition between BBs, part is where to go. Well, there are already prepared blocks for Relooper that need to be connected according to the conditions. To distinguish between them, the assumption is used that they are all aligned by at least four bytes, so you can safely use the lower two bits for the label, you just need to remember to remove it if necessary. By the way, such labels are already used in QEMU to indicate the reason for exiting the TCG loop.

Using Binaryen

Modules in WebAssembly contain functions, each of which contains a body, which is an expression. Expressions are unary and binary operations, blocks consisting of lists of other expressions, control flow, etc. As I already said, the control flow here is organized precisely as high-level branches, loops, function calls, etc. Arguments to functions are not passed on the stack, but explicitly, as in JS. There are also global variables, but I did not use them, so I will not talk about them.

Also, the functions have local variables numbered from zero, having the type: int32 / int64 / float / double. In this case, the first n local variables are the arguments passed to the function. Please note that although everything here is not entirely low-level in terms of control flow, integers still do not carry the “signed / unsigned” attribute: how the number will behave depends on the operation code.

Generally speaking, Binaryen provides simple C-API: you create a module, in him create expressions - unary, binary, blocks from other expressions, control flow, etc. Then you create a function whose body needs to be an expression. If you, like me, have a low-level transition graph, the relooper component will help you. As far as I understand, it is possible to use high-level execution flow control in a block as long as it does not go beyond the block - that is, it is possible to make an internal fast path / slow path branch inside the built-in TLB cache processing code, but it is not possible to interfere with the "outer" control flow . When you release a relooper, its blocks are released, when you release a module, expressions, functions, etc., allocated to it disappear. arena.

However, if you want to interpret the code on the fly without unnecessary creations and deletions of the interpreter instance, it may make sense to move this logic to a C++ file, and from there directly control the entire C++ API of the library, bypassing ready-made wrappers.

So to generate code, you need

// настроить глобальные параметры (можно поменять потом)
BinaryenSetAPITracing(0);

BinaryenSetOptimizeLevel(3);
BinaryenSetShrinkLevel(2);

// создать модуль
BinaryenModuleRef MODULE = BinaryenModuleCreate();

// описать типы функций (как создаваемых, так и вызываемых)
helper_type  BinaryenAddFunctionType(MODULE, "helper-func", BinaryenTypeInt32(), int32_helper_args, ARRAY_SIZE(int32_helper_args));
// (int23_helper_args приоб^Wсоздаются отдельно)

// сконструировать супер-мега выражение
// ... ну тут уж вы как-нибудь сами :)

// потом создать функцию
BinaryenAddFunction(MODULE, "tb_fun", tb_func_type, func_locals, FUNC_LOCALS_COUNT, expr);
BinaryenAddFunctionExport(MODULE, "tb_fun", "tb_fun");
...
BinaryenSetMemory(MODULE, (1 << 15) - 1, -1, NULL, NULL, NULL, NULL, NULL, 0, 0);
BinaryenAddMemoryImport(MODULE, NULL, "env", "memory", 0);
BinaryenAddTableImport(MODULE, NULL, "env", "tb_funcs");

// запросить валидацию и оптимизацию при желании
assert (BinaryenModuleValidate(MODULE));
BinaryenModuleOptimize(MODULE);

... if I forgot something - sorry, this is just to represent the scale, and the details are in the documentation.

And now the crex-fex-pex begins, something like this:

static char buf[1 << 20];
BinaryenModuleOptimize(MODULE);
BinaryenSetMemory(MODULE, 0, -1, NULL, NULL, NULL, NULL, NULL, 0, 0);
int sz = BinaryenModuleWrite(MODULE, buf, sizeof(buf));
BinaryenModuleDispose(MODULE);
EM_ASM({
  var module = new WebAssembly.Module(new Uint8Array(wasmMemory.buffer, $0, $1));
  var fptr = $2;
  var instance = new WebAssembly.Instance(module, {
      'env': {
          'memory': wasmMemory,
          // ...
      }
  );
  // и вот уже у вас есть instance!
}, buf, sz);

In order to somehow connect the world of QEMU and JS and at the same time enter the compiled functions quickly, an array was created (a table of functions for import into the launcher), and the generated functions were put there. In order to quickly calculate the index, the translation block zero word index was originally used as it, but then the index calculated by such a formula began to simply fit into the field in struct TranslationBlock.

Incidentally, demo (so far with a cloudy license) only works fine in Firefox. The Chrome developers were somehow not ready to the fact that someone wants to create more than a thousand instances of WebAssembly modules, so they simply allocated a gigabyte of virtual address space for each ...

For now, that's all. Perhaps there will be another article if it is of interest to someone. Namely, at least only make block devices work. Perhaps it makes sense to also make the compilation of WebAssembly modules asynchronous, as is customary in the JS world, since there is still an interpreter that can do all this until the native module is ready.

Last riddle: you compiled a binary on a 32-bit architecture, but the code climbs from Binaryen through memory operations, somewhere onto the stack or somewhere else into the upper 2 GB of the 32-bit address space. The problem is that from the point of view of Binaryen, this is a call to a too large resulting address. How to bypass it?

admin style

I didn't end up testing it, but my first thought was "What if I put 32-bit Linux?" Then the upper part of the address space will be occupied by the kernel. The only question is how much will be occupied: 1 or 2 Gb.

In a programmer's way (option for practitioners)

Inflate a bubble at the top of the address space. I don't understand why it works - in the same place already should be a stack. But “we are practitioners: everything works for us, but no one knows why…”.

// 2gbubble.c
// Usage: LD_PRELOAD=2gbubble.so <program>

#include <sys/mman.h>
#include <assert.h>

void __attribute__((constructor)) constr(void)
{
  assert(MAP_FAILED != mmap(1u >> 31, (1u >> 31) - (1u >> 20), PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0));
}

... with Valgrind, however, it is not compatible, but, fortunately, Valgrind itself very effectively ousts everyone from there 🙂

Perhaps someone will give a better explanation of how this code of mine works...

Source: habr.com

Add a comment