A simple hash table for the GPU

A simple hash table for the GPU
I posted on Github new project A Simple GPU Hash Table.

This is a simple GPU hash table capable of processing hundreds of millions of inserts per second. On my NVIDIA GTX 1060 laptop, the code inserts 64 million randomly generated key-value pairs in about 210ms and removes 32 million pairs in about 64ms.

That is, the speed on a laptop is approximately 300 million inserts / sec and 500 million deletions / sec.

The table is written in CUDA, although the same technique can be applied to HLSL or GLSL. The implementation has several limitations that ensure high performance on the video card:

  • Only 32-bit keys and the same values ​​are processed.
  • The hash table has a fixed size.
  • And this size should be equal to two to the power.

For keys and values, you need to reserve a simple delimiting marker (in the code above, this is 0xffffffff).

Hash table without locks

The hash table uses open addressing with linear sounding, meaning it's just an array of key-value pairs that is stored in memory and has superior cache performance. The same cannot be said about chaining, which involves looking up a pointer in a linked list. A hash table is a simple array that stores elements KeyValue:

struct KeyValue
{
    uint32_t key;
    uint32_t value;
};

The size of the table is a power of two, not a prime number, because one fast instruction is enough to apply the pow2/AND mask, and the modulo operator is much slower. This is important in the case of linear probing, because in a linear table lookup, the slot index must be wrapped in each slot. And as a result, the cost of the modulo operation in each slot is added.

The table only stores the key and value for each element, not a hash of the key. Since the table only stores 32-bit keys, the hash is calculated very quickly. The above code uses a Murmur3 hash, which does just a few shifts, XORs, and multiplications.

The hash table uses a lock protection technique that is independent of memory order. Even if some write operations violate the order of other such operations, the hash table will still retain the correct state. We will talk about this below. The technique works great with video cards that have thousands of threads running concurrently.

The keys and values ​​in the hash table are initialized to null.

The code can be modified to handle both 64-bit keys and values. Keys require atomic read, write, and compare-and-swap operations. Values ​​require atomic reads and writes. Fortunately, in CUDA, read-write operations for 32-bit and 64-bit values ​​are atomic as long as they are naturally aligned (see below). here), and modern graphics cards support 64-bit atomic compare-and-swap operations. Of course, when switching to 64 bits, the performance will decrease somewhat.

Hash table state

Each key-value pair in a hash table can have one of four states:

  • The key and value are empty. In this state, the hash table is initialized.
  • The key has been written, but the value has not yet been written. If another thread of execution is currently reading the data, it will then return an empty value. This is normal, the same thing would happen if another thread of execution had completed a little earlier, and we are talking about a concurrent data structure.
  • Both key and value are recorded.
  • The value is available to other threads of execution, but the key is not yet available. This can happen because the programming model in CUDA implies a loosely ordered memory model. This is normal, on any event the key is still empty, even if the value is no longer empty.

An important nuance is that as soon as the key has been written to the slot, it no longer moves - even if the key is deleted, we will talk about this below.

The hash table code works even with loosely ordered memory models where the order of reading and writing to memory is not known. As we break down insertion, lookup, and deletion in a hash table, remember that each key-value pair is in one of the four states described above.

Hash table insert

The CUDA function that inserts key-value pairs into a hash table looks like this:

void gpu_hashtable_insert(KeyValue* hashtable, uint32_t key, uint32_t value)
{
    uint32_t slot = hash(key);

    while (true)
    {
        uint32_t prev = atomicCAS(&hashtable[slot].key, kEmpty, key);
        if (prev == kEmpty || prev == key)
        {
            hashtable[slot].value = value;
            break;
        }
        slot = (slot + 1) & (kHashTableCapacity-1);
    }
}

To insert a key, the code iterates over the hash table array, starting with the hash of the key being inserted. Each array slot performs an atomic compare-and-swap operation that compares the key in that slot to null. If a mismatch is found, then the key in the slot is updated with the inserted key, and then the original slot key is returned. If this original key was empty or matched the key being inserted, then the code found a slot suitable for insertion and puts the inserted value into the slot.

If in one kernel call gpu_hashtable_insert() there are multiple elements with the same key, then any of their values ​​can be written to the key slot. This is considered normal: one of the key-value write operations during the call will succeed, but since all this happens in parallel within several threads of execution, we cannot predict which memory write operation will be the last one.

Hash table lookup

Code for finding keys:

uint32_t gpu_hashtable_lookup(KeyValue* hashtable, uint32_t key)
{
        uint32_t slot = hash(key);

        while (true)
        {
            if (hashtable[slot].key == key)
            {
                return hashtable[slot].value;
            }
            if (hashtable[slot].key == kEmpty)
            {
                return kEmpty;
            }
            slot = (slot + 1) & (kHashTableCapacity - 1);
        }
}

To find the value of a key stored in a table, we iterate over the array starting from the hash of the key we are looking for. In each slot, we check if the key is the one we are looking for, and if so, return its value. We also check if the key is empty, and if so, we abort the search.

If we can't find the key, then the code returns an empty value.

All of these lookups can be performed concurrently during insertions and deletions. Each pair in the table will have one of the four states described above for the stream.

Deletion in a hash table

Code to remove keys:

void gpu_hashtable_delete(KeyValue* hashtable, uint32_t key, uint32_t value)
{
    uint32_t slot = hash(key);

    while (true)
    {
        if (hashtable[slot].key == key)
        {
            hashtable[slot].value = kEmpty;
            return;
        }
        if (hashtable[slot].key == kEmpty)
        {
            return;
        }
        slot = (slot + 1) & (kHashTableCapacity - 1);
    }
}

Removing a key is unusual: we leave the key in the table and mark its value (not the key itself) blank. This code is very similar to lookup(), except that when it finds a match for a key, it makes its value null.

As mentioned above, once a key is written to a slot, it does not move. Even when an element is removed from the table, the key remains in place, it's just that its value becomes empty. This means that we don't need to use an atomic slot value write operation, because it doesn't matter if the current value is empty or not - it will still become empty.

Resizing the hash table

You can resize a hash table by creating a larger table and inserting non-empty elements from the old table into it. I didn't implement this functionality because I wanted to keep the sample code simple. Moreover, memory allocation in CUDA programs is often done in the host code, not in the CUDA core.

Article A Lock-Free Wait-Free Hash Table describes how to modify such a lock-protected data structure.

Competitiveness

In the above code snippets of the function gpu_hashtable_insert(), _lookup() и _delete() process one key-value pair at a time. And lower gpu_hashtable_insert(), _lookup() и _delete() process an array of pairs in parallel, each pair in a separate GPU thread of execution:

// CPU code to invoke the CUDA kernel on the GPU
uint32_t threadblocksize = 1024;
uint32_t gridsize = (numkvs + threadblocksize - 1) / threadblocksize;
gpu_hashtable_insert_kernel<<<gridsize, threadblocksize>>>(hashtable, kvs, numkvs);

// GPU code to process numkvs key/values in parallel
void gpu_hashtable_insert_kernel(KeyValue* hashtable, const KeyValue* kvs, unsigned int numkvs)
{
    unsigned int threadid = blockIdx.x*blockDim.x + threadIdx.x;
    if (threadid < numkvs)
    {
        gpu_hashtable_insert(hashtable, kvs[threadid].key, kvs[threadid].value);
    }
}

The lock-protected hash table supports concurrent inserts, lookups, and deletes. Since key-value pairs are always in one of four states, and keys do not move, the table guarantees correctness even when using different kinds of operations at the same time.

However, if we process a batch of insertions and deletions in parallel, and if the input array of pairs contains duplicate keys, then we will not be able to predict which pairs will “win” - they will be written to the hash table last. Let's say we called the insert code with an input array of pairs A/0 B/1 A/2 C/3 A/4. When the code completes, pairs B/1 и C/3 are guaranteed to be present in the table, but any of the pairs A/0, A/2 or A/4. This may or may not be a problem, depending on the application. You may know in advance that there are no duplicate keys in the input array, or you may not care which value was written last.

If this is a problem for you, then you need to separate the duplicate pairs into different CUDA system calls. In CUDA, any operation that calls the kernel always completes before the next kernel call (at least within one thread. In different threads, kernels are executed in parallel). In the example above, if you call one kernel with A/0 B/1 A/2 C/3, and the other with A/4, then the key A will get the value 4.

Now let's talk about whether functions should lookup() и delete() use a plain (plain) or variable (volatile) pointer to an array of pairs in a hash table. CUDA Documentation States that:

The compiler may at its discretion optimize reads and writes to global or shared memory... These optimizations can be disabled using the keyword volatile: ... any reference to this variable is compiled into a true memory read or write instruction.

Correctness considerations do not require application volatile. If a thread of execution uses a cached value from an earlier read operation, then it will use slightly outdated information. But still, this is information from the correct state of the hash table at a certain point in the kernel call. If you need to use the latest information, you can use the index volatile, but then performance will decrease slightly: according to my tests, when deleting 32 million elements, the speed decreased from 500 million deletions / sec to 450 million deletions / sec.

Performance

In the test for insertion of 64 million elements and removal of 32 million of them, competition between std::unordered_map and a hash table for the GPU is actually missing:

A simple hash table for the GPU
std::unordered_map spent 70 ms inserting and deleting elements and then freeing them unordered_map (liberation from millions of elements takes a lot of time, because inside unordered_map multiple allocations are made). To be honest, at std:unordered_map very different restrictions. It is a single CPU thread of execution, supports key-values ​​of any size, performs well at high usage rates, and shows consistent performance after multiple deletes.

The duration of the hash table for the GPU and interprogram interaction was 984 ms. This includes the time spent placing the table in memory and deleting it (a one-time allocation of 1 GB of memory, which takes some time in CUDA), inserting and deleting elements, and iterating over them. All copies to and from the memory of the video card are also taken into account.

The hash table itself took 271 ms. This includes the time spent by the graphics card inserting and deleting elements, and does not take into account the time spent copying to memory and iterating over the resulting table. If the GPU table lives for a long time, or if the hash table is kept entirely in the video card's memory (for example, to create a hash table that will be used by other GPU code, not the CPU), then the test result is relevant.

The hash table for the video card demonstrates high performance due to the large bandwidth and active parallelization.

Disadvantages

The hash table architecture has several issues to keep in mind:

  • Linear probing is hampered by clustering, which causes the keys in the table to be placed far from ideally.
  • Keys are not removed using the function delete and clutter up the table over time.

As a result, the performance of a hash table can gradually degrade, especially if it exists for a long time and many inserts and deletions occur in it. One way to mitigate these shortcomings is to rehash into a new table with a fairly low usage factor and filter out the deleted keys when rehashing.

To illustrate the problems described, using the above code to create a table with 128M items, I will loop through 4M items until I fill 124M slots (usage ratio is about 0,96). Here is the result table, each line is a call to the CUDA kernel inserting 4 million new elements into one hash table:

Utilization rate
Insert duration 4 elements

0,00
11,608448 ms (361,314798 million keys/sec)

0,03
11,751424 ms (356,918799 million keys/sec)

0,06
11,942592 ms (351,205515 million keys/sec)

0,09
12,081120 ms (347,178429 million keys/sec)

0,12
12,242560 ms (342,600233 million keys/sec)

0,16
12,396448 ms (338,347235 million keys/sec)

0,19
12,533024 ms (334,660176 million keys/sec)

0,22
12,703328 ms (330,173626 million keys/sec)

0,25
12,884512 ms (325,530693 million keys/sec)

0,28
13,033472 ms (321,810182 million keys/sec)

0,31
13,239296 ms (316,807174 million keys/sec)

0,34
13,392448 ms (313,184256 million keys/sec)

0,37
13,624000 ms (307,861434 million keys/sec)

0,41
13,875520 ms (302,280855 million keys/sec)

0,44
14,126528 ms (296,909756 million keys/sec)

0,47
14,399328 ms (291,284699 million keys/sec)

0,50
14,690304 ms (285,515123 million keys/sec)

0,53
15,039136 ms (278,892623 million keys/sec)

0,56
15,478656 ms (270,973402 million keys/sec)

0,59
15,985664 ms (262,379092 million keys/sec)

0,62
16,668673 ms (251,627968 million keys/sec)

0,66
17,587200 ms (238,486174 million keys/sec)

0,69
18,690048 ms (224,413765 million keys/sec)

0,72
20,278816 ms (206,831789 million keys/sec)

0,75
22,545408 ms (186,038058 million keys/sec)

0,78
26,053312 ms (160,989275 million keys/sec)

0,81
31,895008 ms (131,503463 million keys/sec)

0,84
42,103294 ms (99,619378 million keys/sec)

0,87
61,849056 ms (67,815164 million keys/sec)

0,90
105,695999 ms (39,682713 million keys/sec)

0,94
240,204636 ms (17,461378 million keys/sec)

As utilization increases, performance decreases. This is not desirable in most cases. If an application inserts elements into a table and then discards them (for example, when counting words in a book), then this is not a problem. But if the application uses a long-lived hash table (for example, in a graphics editor to store non-empty portions of images when the user frequently inserts and deletes information), then this behavior can be troublesome.

And measured the depth of hash table probing after 64 million inserts (0,5 usage factor). The average depth was 0,4774, so most of the keys were in either the best possible slot or one slot away from the best position. The maximum sounding depth was 60.

I then measured the probing depth in a table with 124M inserts (0,97 usage factor). The average depth was already 10,1757, and the maximum - 6474 (!!). The performance of linear probing drops heavily at high utilization rates.

It is best to keep this hash table low in utilization. But then we increase performance by consuming memory. Fortunately, in the case of 32-bit keys and values, this can also be justified. If, in the above example, in a table with 128 million elements, we keep the utilization factor of 0,25, then we can fit no more than 32 million elements in it, and the remaining 96 million slots will be lost - 8 bytes per pair, 768 MB of wasted memory.

Please note that this is a loss of video card memory, which is a more valuable resource than system memory. While most modern CUDA-capable desktop graphics cards have at least 4GB of memory (at the time of writing, the NVIDIA 2080 Ti has 11GB), it's still not the wisest decision to lose that much.

Later I will write more about creating hash tables for video cards that do not have problems with depth probing, as well as ways to reuse deleted slots.

Probing depth measurement

To determine how deep a key is probed, we can extract the hash of the key (its ideal table index) from its actual table index:

// get_key_index() -> index of key in hash table
uint32_t probelength = (get_key_index(key) - hash(key)) & (hashtablecapacity-1);

Because of the magic of two's complement binary numbers and the fact that the hash table has a power of two, this approach will work even when the key index is moved to the beginning of the table. Take a key that hashed to 1 but is inserted in slot 3. Then for a table with a capacity of 4 we get (3 — 1) & 3, which is equivalent to 2.

Conclusion

If you have any questions or comments please email me at Twitter or open a new thread at repositories.

This code is inspired by the great articles:

In the future, I will continue to write about hash table implementations for video cards and analyze their performance. I plan on chaining, Robin Hood hashing, and cuckoo hashing using atomic operations on graphics card-friendly data structures.

Source: habr.com

Add a comment