Architecture for storing and sharing photos in Badoo

Architecture for storing and sharing photos in Badoo

Artem Denisov ( bo0rsh201, Badoo)

Badoo is the world's largest dating site. We currently have about 330 million registered users worldwide. But what is much more important in the context of our conversation today is that we store about 3 petabytes of user photos. Every day our users upload about 3,5 million new photos, and the reading load is about 80 thousand requests per second. This is quite a lot for our backend, and sometimes there are difficulties with this.

Architecture for storing and sharing photos in Badoo

I’ll talk about the design of this system, which stores and sends photos in general, and I’ll look at it from a developer’s point of view. There will be a brief retrospective on how it developed, where I will outline the main milestones, but I will only talk in more detail about the solutions that we are currently using.

Now let's get started.


As I said, this will be a retrospective, and in order to start it somewhere, let's take the most common example.

Architecture for storing and sharing photos in Badoo

We have a common task, we need to accept, store and send user photos. In this form, the task is general, we can use anything:

  • modern cloud storage,
  • a boxed solution, of which there are also a lot now;
  • We can set up several machines in our data center and put large hard drives on them and store photos there.

Badoo historically - both now and then (at the time when it was just in its infancy) - lives on its own servers, inside our own DCs. Therefore, this option was optimal for us.

Architecture for storing and sharing photos in Badoo

We just took several machines, called them “photos”, and we got a cluster that stores photos. But it seems like something is missing. In order for all this to work, we need to somehow determine on which machine we will store which photos. And here, too, there is no need to open America.

Architecture for storing and sharing photos in Badoo

We add some field to our storage with information about users. This will be the sharding key. In our case, we called it place_id, and this place id points to the place where user photos are stored. We make maps.

At the first stage, this can even be done manually - we say that a photo of this user with such a place will land on such a server. Thanks to this map, we always know when a user uploads a photo, where to save it, and we know where to give it from.

This is an absolutely trivial scheme, but it has quite significant advantages. The first is that it is simple, as I said, and the second is that with this approach we can easily scale horizontally by simply delivering new cars and adding them to the map. You don't need to do anything else.

That's how it was for us for some time.

Architecture for storing and sharing photos in Badoo

This was around 2009. They delivered cars, delivered...

And at some point we began to notice that this scheme has certain disadvantages. What are the disadvantages?

First of all, there is limited capacity. We cannot cram as many hard drives onto one physical server as we would like. And this has become a certain problem over time and with the growth of the dataset.

And second. This is an atypical configuration of machines, since such machines are difficult to reuse in some other clusters; they are quite specific, i.e. they should be weak in performance, but at the same time with a large hard drive.

This was all for 2009, but, in principle, these requirements are still relevant today. We have a retrospective, so in 2009 everything was completely bad with this.

And the last point is the price.

Architecture for storing and sharing photos in Badoo

The price was very steep at that time, and we needed to look for some alternatives. Those. we needed to somehow better utilize both the space in the data centers and the physical servers on which all this is located. And our system engineers began a large study in which they reviewed a bunch of different options. They also looked at clustered file systems such as PolyCeph and Luster. There were performance problems and quite difficult operation. They refused. We tried to mount the entire dataset via NFS on each car in order to somehow scale it up. Reading also went poorly, we tried different solutions from different vendors.

And in the end, we settled on using the so-called Storage Area Network.

Architecture for storing and sharing photos in Badoo

These are large SHDs that are specifically designed for storing large amounts of data. They are shelves with disks that are mounted on the final optical output machines. That. we have some kind of pool of machines, quite small, and these SHDs, which are transparent to our sending logic, i.e. for our nginx or anyone else to serve requests for these photos.

This decision had obvious advantages. This is SHD. It is aimed at storing photos. This works out cheaper than simply equipping machines with hard drives.

Second plus.

Architecture for storing and sharing photos in Badoo

This is that the capacity has become much larger, i.e. we can accommodate much more storage in a much smaller volume.

But there were also disadvantages that emerged quite quickly. As the number of users and load on this system grew, performance problems began to arise. And the problem here is quite obvious - any SHD designed to store many photos in a small volume, as a rule, suffers from intensive reading. This is actually true for any cloud storage or anything else. Now we don’t have an ideal storage that would be infinitely scalable, you could stuff anything into it, and it would tolerate readings very well. Especially casual readings.

Architecture for storing and sharing photos in Badoo

As is the case with our photos, because photos are requested inconsistently, and this will greatly affect their performance.

Even according to today’s figures, if we get somewhere more than 500 RPS for photos on a machine to which storage is connected, problems already begin. And it was bad enough for us, because the number of users is growing, things are only going to get worse. This needs to be optimized somehow.

In order to optimize, we decided at that time, obviously, to look at the load profile - what, in general, is happening, what needs to be optimized.

Architecture for storing and sharing photos in Badoo

And here everything plays into our hands.

I already said in the first slide: we have 80 thousand reading requests per second with only 3,5 million uploads per day. That is, this is a difference of three orders of magnitude. It is obvious that reading needs to be optimized and it is practically clear how.

There is one more small point. The specifics of the service are such that a person registers, uploads a photo, then begins to actively look at other people, like them, and is actively shown to other people. Then he finds a mate or doesn’t find a mate, it depends how it turns out, and stops using the service for a while. At this moment, when he uses it, his photos are very hot - they are in demand, a lot of people view them. As soon as he stops doing this, pretty quickly he drops out of as much exposure to other people as he had before, and his photos are almost never requested.

Architecture for storing and sharing photos in Badoo

Those. We have a very small hot dataset. But at the same time there are a lot of requests for him. And a completely obvious solution here is to add a cache.

A cache with LRU will solve all our problems. What are we doing?

Architecture for storing and sharing photos in Badoo

We add another relatively small one in front of our large cluster with storage, which is called photocaches. This is essentially just a caching proxy.

How does it work from the inside? Here is our user, here is storage. Everything is the same as before. What do we add in between?

Architecture for storing and sharing photos in Badoo

It's just a machine with a physical local disk, which is fast. This is with an SSD, for example. And some kind of local cache is stored on this disk.

What does it look like? The user sends a request for a photo. NGINX looks for it first in the local cache. If not, then simply proxy_pass to our storage, download the photo from there and give it to the user.

But this one is very banal and it is unclear what is happening inside. It works something like this.

Architecture for storing and sharing photos in Badoo

The cache is logically divided into three layers. When I say “three layers”, this does not mean that there is some kind of complex system. No, these are conditionally just three directories in the file system:

  1. This is a buffer where photos just downloaded from a proxy go.
  2. This is a hot cache that stores currently actively requested photos.
  3. And a cold cache, where photos are gradually pushed out of the hot cache when fewer requests come to them.

For this to work, we need to somehow manage this cache, we need to rearrange the photos in it, etc. This is also a very primitive process.

Architecture for storing and sharing photos in Badoo

Nginx simply writes to the RAMDisk access.log for each request, in which it indicates the path to the photo that it has currently served (relative path, of course), and which partition it was served. Those. it may say “photo 1” and then either a buffer, or a hot cache, or a cold cache, or a proxy.

Depending on this, we need to somehow decide what to do with the photo.

We have a small daemon running on each machine that constantly reads this log and stores statistics on the use of certain photographs in its memory.

Architecture for storing and sharing photos in Badoo

He simply collects there, keeps the counters and periodically does the following. He moves actively requested photos, for which there are many requests, to the hot cache, wherever they are.

Architecture for storing and sharing photos in Badoo

Photos that are requested rarely and have become requested less frequently are gradually pushed out of the hot cache into the cold one.

Architecture for storing and sharing photos in Badoo

And when we run out of space in the cache, we simply start deleting everything from the cold cache indiscriminately. And by the way, this works well.

In order for the photo to be saved immediately when proxying it to the buffer, we use the proxy_store directive and the buffer is also a RAMDisk, i.e. for the user it works very quickly. This concerns the internals of the caching server itself.

The remaining question is how to distribute requests across these servers.

Let's say there is a cluster of twenty storage machines and three caching servers (this is how it happened).

Architecture for storing and sharing photos in Badoo

We need to somehow determine which requests are for which photos and where to land them.

The most commonplace option is Round Robin. Or do it by accident?

This obviously has a number of disadvantages because we would be using the cache very inefficiently in such a situation. Requests will land on some random machines: here it was cached, but on the next one it is no longer there. And if all this works, it will be very bad. Even with a small number of machines in the cluster.

We need to somehow unambiguously determine which server to land which request.

There is a banal way. We take the hash from the URL or the hash from our sharding key, which is in the URL, and divide it by the number of servers. Will work? Will.

Architecture for storing and sharing photos in Badoo

Those. We have a 2% request, for example, for some “example_url” it will always land on the server with index “XNUMX”, and the cache will be constantly disposed of as best as possible.

But there is a problem with resharding in such a scheme. Resharding - I mean changing the number of servers.

Let's assume that our caching cluster can no longer cope and we decide to add another machine.

Let's add.

Architecture for storing and sharing photos in Badoo

Now everything is divisible not by three, but by four. Thus, almost all the keys that we used to have, almost all the URLs now live on other servers. The entire cache was invalidated simply for a moment. All requests fell on our storage cluster, it became unwell, service failure and dissatisfied users. I don't want to do that.

This option doesn't suit us either.

That. what should we do? We must somehow make efficient use of the cache, land the same request on the same server over and over again, but be resistant to resharding. And there is such a solution, it’s not that complicated. It's called consistent hashing.

Architecture for storing and sharing photos in Badoo

How does it look like?

Architecture for storing and sharing photos in Badoo

We take some function from the sharding key and spread all its values ​​on the circle. Those. at point 0, its minimum and maximum values ​​converge. Next, we place all our servers on the same circle in approximately this way:

Architecture for storing and sharing photos in Badoo

Each server is defined by one point, and the sector that goes clockwise to it, accordingly, is served by this host. When requests come to us, we immediately see that, for example, request A - it has a hash there - and it is served by server 2. Request B - by server 3. And so on.

Architecture for storing and sharing photos in Badoo

What happens in this situation during resharding?

Architecture for storing and sharing photos in Badoo

We do not invalidate the entire cache, as before, and do not shift all the keys, but we shift each sector a short distance so that, relatively speaking, our sixth server, which we want to add, fits into the free space, and we add it there.

Architecture for storing and sharing photos in Badoo

Of course, in such a situation the keys also move out. But they move out much weaker than before. And we see that our first two keys remained on their servers, and the caching server changed only for the last key. This works quite efficiently, and if you add new hosts incrementally, then there is no big problem here. You add and add a little at a time, wait until the cache is full again, and everything works well.

The only question remains with refusals. Let's assume that some kind of car is out of order.

Architecture for storing and sharing photos in Badoo

And we wouldn’t really want to regenerate this map at this moment, invalidate part of the cache, and so on, if, for example, the machine was rebooted, and we need to somehow service requests. We simply keep one backup photo cache at each site, which acts as a replacement for any machine that is currently down. And if suddenly one of our servers becomes unavailable, the traffic goes there. Naturally, we don’t have any cache there, i.e. it's cold, but at least user requests are being processed. If this is a short interval, then we experience it completely calmly. There's just more load on storage. If this interval is long, then we can already make a decision - to remove this server from the map or not, or perhaps replace it with another.

This is about the caching system. Let's look at the results.

It would seem that there is nothing complicated here. But this method of managing the cache gave us a trick rate of about 98%. Those. out of these 80 thousand requests per second, only 1600 reach storage, and this is a completely normal load, they calmly endure it, we always have a reserve.

We placed these servers in three of our DCs, and received three points of presence - Prague, Miami and Hong Kong.

Architecture for storing and sharing photos in Badoo

That. they are more or less locally located to each of our target markets.

And as a nice bonus, we got this caching proxy, on which the CPU is actually idle, because it is not so needed to serve content. And there, using NGINX+ Lua, we implemented a lot of utilitarian logic.

Architecture for storing and sharing photos in Badoo

For example, we can experiment with webp or progressive jpeg (these are effective modern formats), see how it affects traffic, make some decisions, enable it for certain countries, etc.; make dynamic resize or crop photos on the fly.

This is a good usecase when, for example, you have a mobile application that displays photos, and the mobile application does not want to waste the client’s CPU on requesting a large photo and then resizing it to a certain size in order to push it into the view. We can simply dynamically specify some parameters in the UPort conditional URL, and the photo cache will resize the photo itself. As a rule, it will select the size that we physically have on the disk, as close as possible to the requested one, and downsell it in specific coordinates.

By the way, we have made publicly available video recordings of the last five years of the conference of developers of high-load systems HighLoad++. Watch, learn, share and subscribe to YouTube channel.

We can also add a lot of product logic there. For example, we can add different watermarks using URL parameters, we can blur, blur or pixelate photos. This is when we want to show a photo of a person, but we don’t want to show his face, this works well, it’s all implemented here.

What did we get? We got three points of presence, a good trick rate, and at the same time we don’t have idle CPU on these machines. He has now become, of course, more important than before. We need to give ourselves stronger cars, but it's worth it.

This concerns the return of photographs. Everything here is quite clear and obvious. I think I didn’t discover America, almost any CDN works this way.

And, most likely, a sophisticated listener might have a question: why not just change everything to CDN? It would be about the same; all modern CDNs can do this. And there are a number of reasons.

The first is photographs.

Architecture for storing and sharing photos in Badoo

This is one of the key points of our infrastructure, and we need as much control over it as possible. If this is some kind of solution from a third-party vendor, and you do not have any power over it, it will be quite difficult for you to live with it when you have a large dataset, and when you have a very large flow of user requests.

Let me give you an example. Now, with our infrastructure, we can, for example, in case of some problems or underground knocks, go to the machine and mess around there, relatively speaking. We can add the collection of some metrics that we only need, we can experiment somehow, see how this affects the graphs, and so on. Now a lot of statistics are being collected on this caching cluster. And we periodically look at it and spend a long time exploring some anomalies. If it were on the CDN side, it would be much harder to control. Or, for example, if some kind of accident occurs, we know what happened, we know how to live with it and how to overcome it. This is the first conclusion.

The second conclusion is also rather historical, because the system has been developing for a long time, and there were many different business requirements at different stages, and they do not always fit into the CDN concept.

And the point that follows from the previous one is

Architecture for storing and sharing photos in Badoo

This is because on photo caches we have a lot of specific logic, which cannot always be added upon request. It is unlikely that any CDN will add some custom things to you at your request. For example, encrypting URLs if you don't want the client to be able to change something. Do you want to change the URL on the server and encrypt it, and then send some dynamic parameters here.

What conclusion does this suggest? In our case, CDN is not a very good alternative.

Architecture for storing and sharing photos in Badoo

And in your case, if you have any specific business requirements, then you can quite easily implement what I showed you yourself. And this will work perfectly with a similar load profile.

But if you have some kind of general solution, and the task is not very specific, you can absolutely safely take a CDN. Or if time and resources are more important to you than control.

Architecture for storing and sharing photos in Badoo

And modern CDNs have almost everything that I told you about now. With the exception of plus or minus some features.

This is about giving away photos.

Let's now move forward a little in our retrospective and talk about storage.

2013 was passing.

Architecture for storing and sharing photos in Badoo

Caching servers were added, performance problems went away. Everything is fine. Dataset is growing. As of 2013, we had about 80 servers connected to storage, and about 40 caching ones in each DC. This is 560 terabytes of data on each DC, i.e. about a petabyte in total.

Architecture for storing and sharing photos in Badoo

And with the growth of the dataset, operating costs began to rise significantly. What did this mean?

Architecture for storing and sharing photos in Badoo

In this diagram that is drawn - with a SAN, with machines and caches connected to it - there are a lot of points of failure. If we had already dealt with the failure of caching servers before, everything was more or less predictable and understandable, but on the storage side everything was much worse.

Firstly, the Storage Area Network (SAN) itself, which can fail.

Secondly, it is connected via optics to the end machines. There may be problems with optical cards and spark plugs.

Architecture for storing and sharing photos in Badoo

Of course, there are not as many of them as with the SAN itself, but, nevertheless, these are also points of failure.

Next is the machine itself, which is connected to the storage. It can also fail.

Architecture for storing and sharing photos in Badoo

In total, we have three points of failure.

Further, in addition to points of failure, there is heavy maintenance of the storage itself.

This is a complex multi-component system, and systems engineers can have a hard time working with it.

And the last, most important point. If a failure occurs at any of these three points, we have a non-zero chance of losing user data because the file system may crash.

Architecture for storing and sharing photos in Badoo

Let's say our file system is broken. Firstly, its recovery takes a long time - it can take a week with a large amount of data. And secondly, in the end we will most likely end up with a bunch of incomprehensible files that will need to be somehow combined into user photos. And we risk losing data. The risk is quite high. And the more often such situations happen, and the more problems arise in this entire chain, the higher this risk.

Something had to be done about this. And we decided that we just need to backup the data. This is actually an obvious solution and a good one. What have we done?

Architecture for storing and sharing photos in Badoo

This is what our server looked like when it was connected to storage before. This is one main section, it’s just a block device that actually represents a mount for remote storage via optics.

We just added a second section.

Architecture for storing and sharing photos in Badoo

We placed a second storage next to it (fortunately, it’s not that expensive in terms of money), and called it a backup partition. It is also connected via optics and is located on the same machine. But we need to somehow synchronize the data between them.

Here we simply make an asynchronous queue nearby.

Architecture for storing and sharing photos in Badoo

She's not very busy. We know we don't have enough records. A queue is just a table in MySQL in which lines like “you need to back up this photo” are written. With any change or upload, we copy from the main partition to backup using an asynchronous or just some kind of background worker.

And thus we always have two consistent sections. Even if one part of this system fails, we can always change the main partition with a backup, and everything will continue to work.

But because of this, the reading load increases greatly, because... in addition to clients who read from the main section, because they first look at the photo there (it’s more recent there), and then look for it on the backup, if they haven’t found it (but NGINX just does this), our system is also a plus backup now reads from the main partition. It’s not that this was a bottleneck, but I didn’t want to increase the load, essentially, just like that.

And we added a third disk, which is a small SSD, and called it a buffer.

Architecture for storing and sharing photos in Badoo

How it works now.

The user uploads a photo to the buffer, then an event is thrown into the queue indicating that it needs to be copied into two sections. It is copied, and the photo lives on the buffer for some time (say, a day), and only then is purged from there. This greatly improves the user experience, because the user uploads a photo, as a rule, requests immediately begin to follow, or he himself updated the page and refreshed it. But it all depends on the application that makes the upload.

Or, for example, other people to whom he began to show himself immediately send requests after this photo. It is not yet in the cache; the first request occurs very quickly. Essentially the same as from a photo cache. Slow storage is not involved in this at all. And when a day later it is purged, it is already either cached on our caching layer, or, most likely, no one needs it anymore. Those. The user experience here has grown very well due to such simple manipulations.

Well, and most importantly: we stopped losing data.

Architecture for storing and sharing photos in Badoo

Let's just say we stopped potentially lose data, because we didn’t really lose it. But there was danger. We see that this solution is, of course, good, but it is a little like smoothing out the symptoms of the problem, instead of solving it completely. And some problems remain here.

Firstly, this is a point of failure in the form of the physical host itself on which all this machinery runs; it has not gone away.

Architecture for storing and sharing photos in Badoo

Secondly, there are still problems with SANs, their heavy maintenance, etc. remains. It wasn’t that it was a critical factor, but I wanted to try to somehow live without it.

And we made the third version (in fact, the second in fact) - the reservation version. What did it look like?

This is what it was -

Architecture for storing and sharing photos in Badoo

Our main problems are with the fact that this is a physical host.

Firstly, we are removing SANs because we want to experiment, we want to try just local hard drives.

Architecture for storing and sharing photos in Badoo

This is already 2014-2015, and at that time the situation with disks and their capacity in one host became much better. We decided why not try it.

And then we simply take our backup partition and physically transfer it to a separate machine.

Architecture for storing and sharing photos in Badoo

Thus, we get this diagram. We have two cars that store the same datasets. They back up each other completely and synchronize data over the network through an asynchronous queue in the same MySQL.

Architecture for storing and sharing photos in Badoo

Why this works well is because we have few records. Those. if writing were comparable to reading, perhaps we would have some kind of network overhead and problems. There is little writing, a lot of reading - this method works well, i.e. We rarely copy photos between these two servers.

How does this work, if you look a little more in detail.

Architecture for storing and sharing photos in Badoo

Upload. The balancer simply selects random hosts with a pair and uploads to it. At the same time, he naturally does health checks and makes sure the car doesn’t fall out. Those. he uploads photos only to a live server, and then through an asynchronous queue it’s all copied to his neighbor. With upload everything is extremely simple.

The task is a little more difficult.

Architecture for storing and sharing photos in Badoo

Lua helped us here, because it can be difficult to make such logic on vanilla NGINX. We first make a request to the first server, see if the photo is there, because potentially it could be uploaded, for example, to a neighbor, but has not yet arrived here. If the photo is there, that's good. We immediately give it to the client and, possibly, cache it.

Architecture for storing and sharing photos in Badoo

If it is not there, we simply make a request to our neighbor and are guaranteed to receive it from there.

Architecture for storing and sharing photos in Badoo

That. again we can say: there may be problems with performance, because there are constant round trips - the photo was uploaded, it is not here, we are making two requests instead of one, this should work slowly.

In our situation, this does not work slowly.

Architecture for storing and sharing photos in Badoo

We collect a bunch of metrics on this system, and the conditional smart rate of such a mechanism is about 95%. Those. The lag of this backup is small, and due to this we are almost guaranteed, after the photo has been uploaded, we will take it the first time and will not have to go anywhere twice.

So what else have we got that's really cool?

Previously, we had the main backup partition, and we read from them sequentially. Those. We always searched on the main one first, and then on the backup. It was one move.

Now we utilize reading from two machines at once. We distribute requests using Round Robin. In a small percentage of cases we make two requests. But overall, we now have twice as much reading stock as we had before. And the load was greatly reduced both on the sending machines and directly on the storage machines, which we also had at that time.

As for fault tolerance. Actually, this is what we mainly fought for. With fault tolerance, everything turned out great here.

Architecture for storing and sharing photos in Badoo

One car breaks down.

Architecture for storing and sharing photos in Badoo

No problem! A system engineer may not even wake up at night, he’ll wait until the morning, nothing bad will happen.

If even if this machine fails, the queue is out of order, there are no problems either, the log will simply be accumulated first on the living machine, and then it will be added to the queue, and then on to the car that will go into operation after some time.

Architecture for storing and sharing photos in Badoo

Same thing with maintenance. We simply turn off one of the machines, manually pull it out of all the pools, it stops receiving traffic, we do some kind of maintenance, we edit something, then we return it to service, and this backup catches up quite quickly. Those. per day, the downtime of one car catches up within a couple of minutes. This is really very little. With fault tolerance, I say again, everything is cool here.

What conclusions can be drawn from this redundancy scheme?

We got fault tolerance.

Easy to use. Since the machines have local hard drives, this is much more convenient from an operational point of view for the engineers who work with it.

We received a double reading allowance.

This is a very good bonus in addition to fault tolerance.

But there are also problems. Now we have a much more complex development of some features related to this, because the system has become 100% eventually consistent.

Architecture for storing and sharing photos in Badoo

We must, say, in some background job, constantly think: “What server are we running on now?”, “Is there really a current photo here?” etc. This, of course, is all wrapped up, and for the programmer who writes business logic, it is transparent. But, nevertheless, this large complex layer has appeared. But we are ready to put up with this in exchange for the goodies that we received from it.

And here again some conflict arises.

I said at the beginning that storing everything on local hard drives is bad. And now I say that we liked it.

Yes, indeed, over time the situation has changed a lot, and now this approach has many advantages. Firstly, we get much simpler operation.

Secondly, it’s more productive, because we don’t have these automatic controllers or connections to disk shelves.

There is a huge amount of machinery there, and these are just a few disks that are assembled here on the machine into a raid.

But there are also disadvantages.

Architecture for storing and sharing photos in Badoo

This is approximately 1,5 times more expensive than using SANs even at today's prices. Therefore, we decided not to boldly convert our entire large cluster into cars with local hard drives and decided to leave a hybrid solution.

Half of our machines work with hard drives (well, not half - probably 30 percent). And the rest are old cars that used to have the first reservation scheme. We simply remounted them, since we did not need new data or anything else, we simply moved the mounts from one physical host to two.

And we now have a large stock of reading, and we expanded it. If earlier we mounted one storage on one machine, now we mount four, for example, on one pair. And it works fine.

Let's take a brief summary of what we accomplished, what we fought for, and whether we succeeded.

Results

We have users - as many as 33 million.

We have three points of presence - Prague, Miami, Hong Kong.

They contain a caching layer, which consists of cars with fast local disks (SSDs), on which simple machinery from NGINX, its access.log and Python daemons run, which process all this and manage the cache.

If you wish, you are in your project, if photos are not as critical for you as they are for us, or if trade-off control versus development speed and resource costs is in the other direction for you, then you can safely replace it with a CDN, modern CDNs are they do well.

Next comes the storage layer, on which we have clusters of pairs of machines that back up each other, files are asynchronously copied from one to another whenever they change.

Moreover, some of these machines work with local hard drives.

Some of these machines are connected to SANs.

Architecture for storing and sharing photos in Badoo

And, on the one hand, it is more convenient to use and a little more productive, on the other hand, it is convenient in terms of placement density and price per gigabyte.

This is such a brief overview of the architecture of what we got and how it all developed.

A few more tips from the captain, very simple ones.

First, if you suddenly decide that you urgently need to improve everything in your photo infrastructure, measure first, because perhaps nothing needs to be improved.

Architecture for storing and sharing photos in Badoo

Let me give you an example. We have a cluster of machines that sends photos from attachments in chats, and the scheme has been working there since 2009, and no one is suffering from it. Everyone is fine, everyone likes everything.

In order to measure, first hang a bunch of metrics, look at them, and then decide what you are unhappy with and what needs to be improved. In order to measure this, we have a cool tool called Pinba.

It allows you to collect very detailed statistics from NGINX for each request and response codes, and distribution of times - whatever you want. It has bindings to all sorts of different analytics systems, and then you can look at it all beautifully.

First we measured it, then we improved it.

Further. We optimize reading with cache, writing with sharding, but this is an obvious point.

Architecture for storing and sharing photos in Badoo

Further. If you are just now starting to build your system, then it is much better to make photos as immutable files. Because you immediately lose a whole class of problems with cache invalidation, with how the logic should find the correct version of the photo, and so on.

Architecture for storing and sharing photos in Badoo

Let’s say you uploaded a hundred, then rotated it, make it so that it was a physically different file. Those. no need to think: now I’ll save a little space, write it to the same file, change the version. This always doesn’t work well and causes a lot of headaches later.

Next point. About resize on the fly.

Previously, when users uploaded a photo, we immediately cut a whole bunch of sizes for all occasions, for different clients, and they were all on the disk. Now we have abandoned this.

We left only three main sizes: small, medium and large. We simply downscale everything else from the size that is behind the one that was asked to us at Uport, we simply do the downscale and give it to the user.

The CPU of the caching layer here turns out to be much cheaper than if we constantly regenerated these sizes on each storage. Let's say we want to add a new one, this will take a month - run a script everywhere that would do all this neatly, without destroying the cluster. Those. If you have the opportunity to choose now, it is better to make as few physical sizes as possible, but so that at least some distribution is, say, three. And everything else can be simply resized on the fly using ready-made modules. It's all very easy and accessible now.

And incremental asynchronous backup is good.

As our practice has shown, this scheme works great with delayed copying of changed files.

Architecture for storing and sharing photos in Badoo

The last point is also obvious. If your infrastructure does not have such problems now, but there is something that can break, it will definitely break when it becomes a little more. Therefore, it is better to think about this in advance and not experience problems with it. That's all I wanted to say.

Contacts

» bo0rsh201
» Badoo Blog

This report is a transcript of one of the best speeches at the conference of developers of high-load systems HighLoad++. There is less than a month left until the HighLoad++ 2017 conference.

We already have it ready Conference program, the schedule is now being actively formed.

This year we continue to explore the topic of architectures and scaling:

We also use some of these materials in our online training course on developing high-load systems HighLoad.Guide is a chain of specially selected letters, articles, materials, videos. Our textbook already contains more than 30 unique materials. Connect!

Source: habr.com

Add a comment