🥇Optimizing the distribution of servers across racks

In one of the chats I was asked a question:

- Is there something to read about how to properly pack servers into racks?

I realized that I don’t know such a text, so I wrote my own.

Firstly, this text is about physical servers in physical data centers (DCs). Secondly, we believe that there are quite a lot of servers: hundreds or thousands, for a smaller number this text does not make sense. Thirdly, we consider that we have three limits: physical space in the racks, power supply per rack, and let the racks stand in rows, so that we can use one ToR switch to connect servers in neighboring racks.

The answer to the question depends a lot on which parameter we are optimizing and what we can vary to achieve the best result. For example, we just need to take up a minimum of space in order to leave more for further growth. Or maybe we have the freedom to choose the height of the racks, the power per rack, the sockets in the PDU, the number of racks in the switch group (one switch for 1, 2 or 3 racks), the length of the wires and the pulling work (this is critical at the ends of the rows: with 10 racks in a row and 3 racks per switch, you will have to pull the wires to another row or underuse the ports in the switch), etc., etc. Separate stories: server selection and DC selection, we will assume that they are selected.

It would be nice to understand some of the nuances and details, in particular, the average / maximum consumption of servers, and how we are supplied with electricity. So, if we have a Russian power supply of 230V and one phase per rack, then a 32A machine can hold ~ 7kW. Let's say we nominally pay for 6kW per rack. If the provider measures our consumption only for a row of 10 racks, and not for each rack, and if the machine is at a conditional cut-off of 7kW, then technically we can gobble up 6.9kW in a single rack, 5.1kW in another and everything will be ok - not punishable.

Usually our main goal is to minimize costs. The best yardstick to measure is TCO (total cost of ownership) reduction. It consists of the following pieces:

CAPEX: procurement of DC infrastructure, servers, network hardware and cabling
OPEX: DC rent, consumed electricity, maintenance. OPEX depends on service life. It is reasonable to assume it is equal to 3 years.

Depending on how big the individual pieces in the overall pie are, we need to optimize the most expensive one, and let the rest use all the remaining resources as efficiently as possible.

Let's say we have an existing DC, we have a rack height of H units (for example, H=47), electricity per rack is Prack (Prack=6kW), and we decide to use h=2U two-unit servers. Let's remove 2..4 units from the rack for switches, patch panels and organizers. Those. physically, we have Sh=rounddown((H-2..4)/h) servers in our rack (i.e. Sh = rounddown((47-4)/2)=21 servers per rack). Let's remember this Sh.

In a simple case, all servers in a rack are the same. In total, if we fill the rack with servers, then on average we can spend power Pserv = Prack / Sh (Pserv = 6000W / 21 = 287W) for each server. For simplicity, we here ignore the consumption of the switch.

Let's take a step aside and determine what the maximum server consumption Pmax is. If it is very simple, very inefficient and completely safe, then we read what is written on the server power supply - this is it.

If it’s more complicated, more efficient, then we take the TDP (thermal design package) of all components and sum it up (this is not very true, but it’s possible).

Usually we don’t know the TDP of components (except for the CPU), so we take the most correct, but also the most difficult approach (we need a laboratory) - we take an experimental server of the desired configuration and load it, for example, with Linpack (CPU and memory) and fio (disks) , we measure consumption. If you take it seriously, you also need to create the warmest environment in the cold corridor during the tests, because this will affect both the consumption of fans and the consumption of the CPU. We get the maximum consumption of a specific server with a specific configuration under these specific conditions under this specific load. We just mean that a new system firmware, a different software version, other conditions may affect the result.

So, back to Pserv and how can we compare it with Pmax. It's a matter of understanding how the services work and how strong your tech lead's nerves are.

If we don’t take risks at all, then we believe that all servers can immediately start consuming their maximum. At the same moment, there may be one input to the DC. Infra must also provide service under these conditions, therefore Pserv ≡ Pmax. This is an approach where reliability is absolutely critical.

If the technical director thinks not only about ideal security, but also about the company's money and is brave enough, then we can decide that

we are starting to manage our vendors, in particular, we prohibit scheduled maintenance at the moments of the planned peak load in order to minimize the drop in one input;
and / or our architecture allows you to lose a rack / row / DC, and the services continue to work;
and/or we spread the load horizontally across the racks well, so our services will never jump to maximum consumption in one rack all together.

It is very useful here not just to guess, but to monitor consumption and know how servers actually consume electricity under normal and peak conditions. Therefore, after some analysis, the technical director compresses everything he has and says: “we make a strong-willed decision that the maximum achievable average of the maximum consumption of servers per rack is **so much** lower than the maximum consumption”, conditionally Pserv=0.8* Pmax.

And then not 6 servers with Pmax = 16W fit into a 375kW rack, but 20 servers with Pserv = 375W * 0.8 = 300W. Those. 25% more servers. This is a very big savings - after all, we immediately need 25% less racks (and there we will also save on PDUs, switches and cables). A serious disadvantage of such a decision is that it is necessary to constantly monitor that our assumptions are still correct. That the new firmware version does not significantly change the operation of fans and consumption, that the development suddenly did not start using the servers much more efficiently with the new release (read, they achieved more load and more consumption on the server). After all, then both our initial assumptions and conclusions immediately become incorrect. This is a risk that must be taken responsibly (or avoided and then paid for obviously underutilized racks).

An important note - it is worth trying to distribute servers from different services on racks horizontally, if possible. This is necessary so that stories do not happen when one batch of servers for one service arrives, the racks are vertically clogged with it to increase the "density" (because it's easier). In reality, it turns out that one rack is crammed with the same low-loaded servers of the same service, and the other with equally high-loaded ones. The probability of the fall of the second is significantly higher, because. the load profile is the same, and all the servers together in that rack start consuming the same amount as a result of the increased load.

Let's return to the distribution of servers in racks. We've looked at the physical limits of rack space and power limits, now let's look at networking. You can use switches for 24/32/48 ports N (for example, we have 48-port ToR switches). Fortunately, there are not many options, if you do not think about break-out cables. We are considering scenarios where we have one switch per rack, one switch for two or three racks in the Rnet group. It seems to me that more than three racks in a group is already too much, because. the problem of cabling between racks becomes much greater.

So, for each network scenario (1, 2 or 3 racks in a group), we distribute the servers among the racks:

Srack = min(Sh, rounddown(Prack/Pserv), rounddown(N/Rnet))

Thus, for a variant with 2 racks in a group:

Srack2 = min(21, rounddown(6000/300), rounddown(48/2)) = min(21, 20, 24) = 20 servers per rack.

Similarly, we consider the remaining options:

Srack1 = 20
Srack3 = 16

And we are almost there. We count the number of racks for the distribution of all our servers S (let it be 1000):

R = roundup(S / (Srack * Rnet)) * Rnet

R1 = roundup(1000 / (20 * 1)) * 1 = 50 * 1 = 50 racks

R2 = roundup(1000 / (20 * 2)) * 2 = 25 * 2 = 50 racks

R3 = roundup(1000 / (16 * 3)) * 3 = 25 * 2 = 63 racks

Next, we calculate the TCO for each option based on the number of racks, the required number of switches, cabling, etc. We choose the option where TCO is less. Profit!

Note that although the required number of racks for options 1 and 2 is the same, their price will be different, because. the number of switches for the second option is half as much, and the length of the required cables is longer.

PS If there is an opportunity to play with the power per rack and the height of the rack, the variability increases. But the process can be reduced to the above, just sorting through the options. Yes, there will be more combinations, but still a very limited number - the power supply to the rack for calculation can be increased in increments of 1 kW, typical racks come in a limited number of sizes: 42U, 45U, 47U, 48U, 52U. And here Excel's What-If analysis in the Data Table mode can help for the calculation. We look at the received plates and choose the minimum.

Source: habr.com

Optimizing the distribution of servers across racks

Add a comment Отменить ответ