🛁 🈺 ↖️ Optimize rack distribution of servers 🍂 🖐🏻 🖐🏽

In one of the chats I was asked a question:

- And there is something to read, how to pack servers in racks correctly?

I realized that I did not know such a text, so I wrote my own.

Firstly, this text is about physical servers in physical data centers (DCs). Secondly, we believe that there are a lot of servers: hundreds or thousands, for a smaller number this text does not make sense. Thirdly, we believe that we have three limiters: physical space in the racks, power to the rack, and let the racks stand in rows, so that we can use one ToR switch to connect servers in neighboring racks.

The answer to the question greatly depends on which parameter we optimize and what we can vary in order to achieve the best result. For example, we only need to take a minimum of space in order to leave more for further growth. Or maybe we have freedom in choosing the height of the racks, the power per rack, the sockets in the PDU, the number of racks in a group of switches (one switch per 1, 2 or 3 racks), the length of wires and pulling work (this is critical at the ends of rows: with 10 racks in a row and 3 racks on the switch, you will have to pull the wires in another row or underuse the ports in the switch), etc., etc. Separate stories: server selection and DC selection, we assume that they are selected.

It would be nice to understand some of the nuances and details, in particular, the average / maximum server consumption, and how we are supplied with electricity. So, if we have a Russian power supply 230V and one phase per rack, then a 32A machine can hold ~ 7kW. Suppose we nominally pay for 6kW per rack. If a provider measures our consumption only for a series of 10 racks, and not for each rack, and if the machine costs 7 kW at the cut-off, then technically we can gobble up 6.9 kW in a separate rack, in another 5.1 kW and everything will be ok - not punishable.

Usually our main goal is to minimize costs. The best measurement criterion is the reduction in TCO (total cost of ownership). It consists of the following pieces:

CAPEX: procurement of DC infrastructure, servers, network hardware and cabling
OPEX: DC rental, consumed electricity, maintenance. OPEX depends on the service life. It is reasonable to assume it is equal to 3 years.

Depending on how large the individual pieces are in the whole pie, we need to optimize the most expensive, and let the rest use all the remaining resources as efficiently as possible.

Suppose we have an existing DC, there is a rack height of H units (for example, H = 47), electricity to the rack P _rack (P _rack = 6 kW), and we decided to use h = 2U two-unit servers. We remove 2..4 units from the rack to the switches, patch panels and organizers. Those. physically, we have S _h = rounddown ((H-2..4) / h) servers in our rack (i.e., S _h = rounddown ((47-4) / 2) = 21 servers per rack). Remember this is S _h .

In the simple case, all the servers in the rack are the same. Total, if we hammer the rack with servers, then on each server we can spend on average the power P _serv = P _rack / S _h (P _serv = 6000 W / 21 = 287 W). For simplicity, we ignore switch consumption here.

We take a step to the side and determine what the maximum server consumption is P _max . If it’s very simple, very inefficient and completely safe, then we read what is written on the server’s power supply - that’s it.

If it is more complicated, more efficient, then we take the TDP (thermal design package) of all components and summarize (this is not very true, but it can be so).

Usually we don’t know the TDP components (except for the CPU), so we take the most correct, but also the most difficult approach (we need a laboratory) - we take an experimental server of the required configuration and load it, for example, with Linpack (CPU and memory) and fio (disks) measure consumption. If you take it seriously, you also need to create the warmest environment in the cold corridor during the tests, because this will affect both the fan consumption and CPU consumption. We get the maximum consumption of a specific server with a specific configuration in these specific conditions under this specific load. We just mean that the new firmware of the system, another version of the software, other conditions may affect the result.

In total, we return to P _serv and how do we compare it with P _max . This is a question of understanding how the services work and how strong your nerves at your techie are.

If you do not risk it at all, then we believe that all servers can immediately begin to consume their maximum. At the same time, one input to the DC can be formed. Infra in these conditions should provide a service, therefore P _serv ≡ P _max . This is an approach where reliability is absolutely crucial.

If the techdir thinks not only about perfect safety, but also about the company's money and is quite brave enough, then we can decide that

we begin to manage our vendors, in particular, we prohibit scheduled maintenance at the times of the planned peak load to minimize the drop in one input;
and / or our architecture allows you to lose the rack / row / DC, and the services continue to work;
and / or we well spread the load horizontally across the racks, so our services will never jump to the maximum consumption in one rack all together.

It is very useful here not just to guess, but to monitor the consumption and know how really in normal and peak conditions the servers consume electricity. Therefore, after some analysis, the techdir compresses everything that it has and says: “we willfully decide that the maximum achievable average of the maximum server consumption per rack is ** so much ** lower than the maximum consumption”, conditionally P _serv = 0.8 * P _max .

And then not 16 servers with P _max = 375W, but 20 servers with P _serv = 375W \ * 0.8 = 300W get into a 6kW rack. Those. 25% more servers. This is a very big saving - after all, we immediately need 25% less racks (and we’ll also save on PDUs, switches and cables). A serious minus of such a decision - it is necessary to constantly monitor that our assumptions are still true. That the new version of the firmware does not significantly change the operation of fans and consumption, that the development of a new release suddenly did not start using the server much more efficiently (read, we got more load and more consumption on the server). After all, then both our initial assumptions and conclusions immediately become incorrect. This is a risk that must be taken responsibly (or avoided and then paid for obviously underloaded racks).

An important note - you should try to distribute servers from different services on racks horizontally, if possible. This is necessary so that stories do not happen when one batch of servers for one service arrives, the racks are clogged vertically with it to increase the "density" (because it is easier). In reality, it turns out that one rack is crammed with the same low-loaded servers of one service, and the other is equally high-loaded. The probability of a fall of the second is much higher, because the load profile is the same, and all servers together in this rack begin to consume the same amount as a result of increased load.

Back to the distribution of servers in the racks. We examined the physical limitations of the rack space and power limitations, and now take a look at the network. You can use switches on 24/32/48 ports N (for example, we have 48-port ToR switches). Fortunately, there are not many options if you don’t think about break-out cables. We consider the scenarios when we have one switch per rack, one switch to two or three racks in the R _net group. It seems to me that more than three racks in the group are already too much, because the problem of cabling between racks becomes much larger.

So, for each network scenario (1, 2 or 3 racks in a group) we distribute the server into racks:

S _rack = min (S _h , rounddown (P _rack / P _serv ), rounddown (N / R _net ))

Thus, for the option with 2 racks in the group:

S _rack ² = min (21, rounddown (6000/300), rounddown (48/2)) = min (21, 20, 24) = 20 servers per rack.

Similarly, we consider the remaining options:

S _rack ¹ = 20

S _rack ³ = 16

And we are almost there. We count the number of racks for the distribution of all our S servers (let it be 1000):

R = roundup (S / (S _rack * R _net )) * R _net

R ₁ = roundup (1000 / (20 * 1)) * 1 = 50 * 1 = 50 racks

R ₂ = roundup (1000 / (20 * 2)) * 2 = 25 * 2 = 50 racks

R ₃ = roundup (1000 / (16 * 3)) * 3 = 21 * 3 = 63 racks

Next, we consider the TCO for each option based on the number of racks, the required number of switches, cabling, etc. We choose the option where TCO is less. Profit!

Note that although the required number of racks for options 1 and 2 is the same, their price will be different, because the number of switches for the second option is half as much, and the length of the required cables is longer.

PS If you can play power on the rack and the height of the rack, the variability increases. But the process can be reduced to the above, just sorting through the options. Yes, there will be more combinations, but still a very limited number - the power to the rack for calculation can be increased in increments of 1 kW, typical racks are of a limited number of sizes: 42U, 45U, 47U, 48U, 52U. And here, Excel's What-If analysis in the Data Table mode can help with the calculation. We look at the received plates and select the minimum.

Optimize rack distribution of servers

More articles: