Mini ITX Cluster Turing Pi 2 with 32 GB RAM

Mini ITX Cluster Turing Pi 2 with 32 GB RAM

Greetings to the Habr community! I recently wrote about our first version cluster board [V1]. And today I want to tell you how we worked on the version Turing V2 with 32 GB random access memory.

We are fond of mini servers that can be used for both local development and local hosting. Unlike desktop computers or laptops, our servers are designed to work 24/7, they can be quickly federated, for example, there were 4 processors in a cluster, and after 5 minutes there were 16 processors (no additional network equipment) and all this in a compact form factor silent and energy efficient.

The architecture of our servers is based on the cluster principle of construction, i.e. we make cluster boards that, using the ethernet network on the board, connect several computing modules (processors). To simplify, we do not make our own computing modules yet, but use Raspberry Pi Compute Modules and we really hoped for the new CM4 module. But, everything went against the plans with their new form factor and I think many are disappointed.

Under the cut, how we went from V1 to V2 and how we had to get out with the new Raspberry Pi CM4 form factor.

So, after creating a cluster for 7 nodes, the questions are - what's next? How to increase the value of a product? 8, 10 or 16 nodes? Which module manufacturers? Thinking about the product as a whole, we realized that the main thing here is not the number of nodes or who the manufacturer is, but the very essence of clusters as a building block. We need to look for the minimum building block that

first, will be a cluster and at the same time be able to connect disks and expansion boards. The cluster block should be a self-sufficient base node and with a wide range of expansion options.

Second, so that the minimum cluster blocks can be connected to each other by building clusters of a larger size and so that it is efficient in terms of budget and scaling speed. The scaling speed must be faster than connecting ordinary computers to a network and much cheaper than server hardware.

The third, the minimum cluster units should be sufficiently compact, mobile, energy efficient, cost-effective and not demanding on operating conditions. This is one of the key differences from server racks and everything connected with them.

We started by determining the number of nodes.

Number of nodes

With simple logical judgments, we realized that 4 nodes is the best option for the minimum cluster block. 1 node is not a cluster, 2 nodes are not enough (1 master 1 worker, there is no possibility of scaling within a block, especially for heterogeneous options), 3 nodes looks ok, but not a multiple of powers of 2 and scaling within a block is limited, 6 nodes come at a price almost like 7 nodes (from our experience this is already a big cost price), 8 is a lot, does not fit in the mini ITX form factor and an even more expensive PoC solution.

Four nodes per block are considered the golden mean:

  • less materials per cluster board, hence cheaper to manufacture
  • multiple of 4, total 4 blocks give 16 physical processors
  • stable circuit 1 master and 3 workers
  • more heterogeneous variations, general-compute + accelerated-compute modules
  • mini ITX form factor with SSD drives and expansion cards

Compute modules

The second version is based on CM4, we thought that it will be released in SODIMM form factor. But…
We made a decision to make a SODIMM daughterboard and assemble CM4 directly into modules so that users don't have to think about CM4.

Mini ITX Cluster Turing Pi 2 with 32 GB RAM
Turing Pi Compute Module Supporting Raspberry Pi CM4

In general, in search of modules, a whole market of computing modules was opened from small modules with 128 MB RAM to 8 GB RAM. Modules with 16 GB RAM and more are ahead. For edge application hosting based on cloud native technologies, 1 GB of RAM is already not enough, and the recent appearance of modules for 2, 4 and even 8 GB of RAM provides good room for growth. They even considered options with FPGA modules for machine learning applications, but their support has been delayed because the software ecosystem is not developed. While studying the module market, we came up with the idea of ​​creating a universal interface for modules, and in V2 we begin to unify the interface of computing modules. This will allow owners of the V2 version to connect modules from other manufacturers and mix them for specific tasks.

V2 supports the entire Raspberry Pi 4 Compute Module (CM4) line, including Lite versions and 8 GB RAM modules

Mini ITX Cluster Turing Pi 2 with 32 GB RAM

Periphery

After determining the vendor of the modules and the number of nodes, we approached the PCI bus on which the peripherals are located. The PCI bus is the standard for peripherals and is found in almost all computing modules. We have several nodes, and ideally, each node should be able to share PCI devices in concurrent request mode. For example, if it is a disk connected to the bus, then it is available to all nodes. We started looking for PCI switches with multi-host support and found that none of them fit our requirements. All these solutions were mostly limited to 1 host or multi hosts, but without the mode of concurrent requests to endpoints. The second problem is the high cost of $50 or more per chip. In V2, we decided to postpone experiments with PCI switches (we will return to them later as we develop) and went along the path of assigning a role for each node: the first two nodes exposed mini PCI express port per node, the third node exposed 2-ports 6 Gbps SATA controller . To access disks from other nodes, you can use the network file system within the cluster. Why not?

Sneakpeek

We decided to share some sketches of how the minimum cluster block has evolved over time through discussion and reflection.

Mini ITX Cluster Turing Pi 2 with 32 GB RAMMini ITX Cluster Turing Pi 2 with 32 GB RAMMini ITX Cluster Turing Pi 2 with 32 GB RAM

As a result, we came to a cluster unit with 4 260-pin nodes, 2 mini PCIe (Gen 2) ports, 2 SATA (Gen 3) ports. The board has a Layer-2 Managed Switch with VLAN support. A mini PCIe port has been removed from the first node, into which you can install a network card and get another Ethernet port or 5G modem and make a router for the network on the cluster and Ethernet ports from the first node.

Mini ITX Cluster Turing Pi 2 with 32 GB RAM

The cluster bus has more features, including the ability to flash modules directly through all slots and of course FAN connectors on each node with speed control.

Application

Edge infrastructure for self-hosted applications & services

We designed V2 to be the minimum building block for a consumer/commercial-grade edge infrastructure. With V2, it's cheap to start proof-of-concept and scale as you grow, gradually porting applications that are more cost-effective and practical to host on edge. Cluster blocks can be connected together to build larger clusters. This can be done gradually without much risk to established
processes. Already today there are a huge number of applications for business, which can be hosted locally.

ARM Workstation

With up to 32 GB RAM per cluster, the first node can be used for the desktop version of the OS (for example, Ubuntu Desktop 20.04 LTS) and the remaining 3 nodes for compilation, testing and debugging tasks, developing cloud native solutions for ARM clusters. As a node for CI / CD on ARM edge infrastructure in the prod.

Turing V2 cluster with CM4 modules is almost identical architecturally (difference in minor versions of ARMv8) to cluster based on AWS Graviton instances. The CM4 module processor uses the ARMv8 architecture so you can build images and applications for AWS Graviton 1 and 2 instances, which are known to be much cheaper than x86 instances.

Source: habr.com