Efficient storage of hundreds of millions of small files. Self-hosted solution

Efficient storage of hundreds of millions of small files. Self-hosted solution

Dear community, this article will be devoted to the efficient storage and delivery of hundreds of millions of small files. At this stage, a final solution is proposed for POSIX compatible file systems with full support for locks, including cluster locks, and seemingly even without crutches.

Therefore, for this purpose, I wrote my own specialized server.
In the course of implementing this task, we managed to solve the main problem, simultaneously achieving savings in disk space and RAM, which our cluster file system mercilessly consumed. Actually, such a number of files is harmful for any clustered file system.

The idea is this:

In simple words, small files are uploaded through the server, they are saved directly to the archive, and are also read from it, and large files are placed side by side. Scheme: 1 folder = 1 archive, in total we have several million archives with small files, and not several hundred million files. And all this is implemented fully, without any scripts and unpacking files into tar / zip archives.

I will try to make it short, I apologize in advance if the post is capacious.

It all started with the fact that I could not find a suitable server in the world that could save data received via the HTTP protocol directly to archives, so that there were no drawbacks inherent in conventional archives and object stores. And the reason for the search was the Origin cluster of 10 servers that had grown to a large scale, in which 250,000,000 small files had already accumulated, and the growth trend was not going to stop.

For those who do not like to read articles and a little documentation is easier:

here ΠΈ here.

And docker at the same time, now there is an option only with nginx inside just in case:

docker run -d --restart=always -e host=localhost -e root=/var/storage 
-v /var/storage:/var/storage --name wzd -p 80:80 eltaline/wzd

Next:

If there are a lot of files, significant resources are needed, and, most annoyingly, some of them are wasted. For example, when using a clustered file system (in this case, MooseFS), the file, regardless of the actual size, always takes at least 64 KB. That is, 3, 10, or 30 KB files require 64 KB on disk. If there are a quarter of a billion files, we are wasting 2 to 10 terabytes. It will not be possible to create new files indefinitely, since in the same MooseFS there is a limit: no more than 1 billion with one replica of each file.

As the number of files increases, you need a lot of RAM for metadata. Also, frequent large metadata dumps contribute to the wear of SSD drives.

wzd server. We put things in order on the disks.

The server is written in Go. First of all, I needed to reduce the number of files. How to do it? Due to archiving, but in this case without compression, since my files are solid shrinked pictures. BoltDB came to the rescue, which still had to be deprived of its shortcomings, this is reflected in the documentation.

In total, instead of a quarter of a billion files, in my case, only 10 million Bolt archives remained. If I had the opportunity to change the current structure of filling directories with files, then it would be possible to reduce it to about 1 million files.

All small files are packed into Bolt archives, which automatically receive the names of the directories in which they are located, and all large files remain next to the archives, there is no point in packing them, this is customizable. Small ones are archived, large ones are left unchanged. The server works transparently with both.

Architecture and features of the wZD server.

Efficient storage of hundreds of millions of small files. Self-hosted solution

The server operates under Linux, BSD, Solaris and OSX operating systems. I only tested for AMD64 architecture under Linux, but it should work for ARM64, PPC64, MIPS64 as well.

Main features:

  • Multithreading;
  • Multiserver, providing fault tolerance and load balancing;
  • Maximum transparency for the user or developer;
  • Supported HTTP methods: GET, HEAD, PUT and DELETE;
  • Managing read and write behavior through client-side headers;
  • Support for highly configurable virtual hosts;
  • Support for CRC data integrity when writing / reading;
  • Semi-dynamic buffers for minimal memory consumption and optimal network performance tuning;
  • Delayed data compaction;
  • In addition, a multi-threaded wZA archiver is offered to migrate files without stopping the service.

Real Experience:

I have been developing and testing the server and archiver on live data for quite a long time, now it is successfully operating on a cluster that includes 250,000,000 small files (pictures) located in 15,000,000 directories on separate SATA disks. A cluster of 10 servers is an Origin server installed behind a CDN network. It is served by 2 Nginx servers + 2 wZD servers.

For those who decide to use this server, it makes sense to plan the directory structure before use, if applicable. I’ll make a reservation right away that the server is not designed to cram everything into 1 Bolt archive.

Performance testing:

The smaller the size of the archived file, the faster GET and PUT operations are performed on it. Let's compare the total time of writing by the HTTP client to regular files and to Bolt archives, as well as reading. Compares work with 32 KB, 256 KB, 1024 KB, 4096 KB and 32768 KB files.

When working with Bolt archives, the data integrity of each file is checked (CRC is used), before writing and also after writing, it is read on the fly and recalculated, this naturally introduces delays, but the main thing is data security.

I ran performance tests on SSDs, as tests don't show a clear difference on SATA drives.

Graphs based on test results:

Efficient storage of hundreds of millions of small files. Self-hosted solution
Efficient storage of hundreds of millions of small files. Self-hosted solution

As you can see, for small files, the difference in read and write times between archived and non-archived files is small.

We will get a completely different picture when we test reading and writing 32 MB files:

Efficient storage of hundreds of millions of small files. Self-hosted solution

The time difference between reading files is within 5-25 ms. Things are worse with recording, the difference is about 150 ms. But in this case, it is not required to upload large files, it simply does not make sense, they can live separately from archives.

*Technically, this server can also be used for tasks requiring NoSQL.

Basic methods of working with wZD server:

Loading a regular file:

curl -X PUT --data-binary @test.jpg http://localhost/test/test.jpg

Uploading a file to the Bolt archive (if the fmaxsize server parameter, which determines the maximum file size that can be included in the archive, is not exceeded, if it is exceeded, the file will be loaded as usual next to the archive):

curl -X PUT -H "Archive: 1" --data-binary @test.jpg http://localhost/test/test.jpg

Downloading a file (if there are files with the same names on the disk and in the archive, then when downloading, priority is given to an unzipped file by default):

curl -o test.jpg http://localhost/test/test.jpg

Download file from Bolt archive (forcibly):

curl -o test.jpg -H "FromArchive: 1" http://localhost/test/test.jpg

Other methods are described in the documentation.

wZD Documentation
wZA Documentation

The server so far only supports the HTTP protocol, it does not work with HTTPS yet. The POST method is also not supported (it has not yet been decided whether it is needed or not).

Whoever digs into the source code will find a toffee there, not everyone likes it, but I did not bind the main code to the functions of the web framework, except for the interrupt handler, so in the future I can quickly rewrite almost any engine.

To do:

  • Development of your own replicator and distributor + geo for the possibility of using it in large systems without clustered FS (Everything for adults)
  • Possibility of complete reverse restoration of metadata in case of their complete loss (in case of using a distributor)
  • Native protocol for the ability to use persistent network connections and drivers for different programming languages
  • Extended possibilities of using the NoSQL component
  • Compressions of different types (gzip, zstd, snappy) for files or values ​​inside Bolt archives and for regular files
  • Encryption of different types for files or values ​​inside Bolt archives and for regular files
  • Delayed server-side video conversion, including on the GPU

I have everything, I hope this server is useful to someone, BSD-3 license, double copyright, because if there was no company where I work, the server would not have been written. I am the only developer. I would be grateful for the found bugs and feature requests.

Source: habr.com

Add a comment