How many TPS are on your blockchain?

A favorite question about any distributed system from a non-technical person is “How many tps are on your blockchain?”. However, the number given in response usually has little to do with what the questioner would like to hear. In fact, he wanted to ask “is your blockchain suitable for my business requirements”, and these requirements are not one number, but many conditions - here are the fault tolerance of the network, and the requirements for finality, the size, nature of transactions and many other parameters. So the answer to the question “how many tps” is unlikely to be simple, and almost never will be complete. A distributed system with dozens or hundreds of nodes that perform quite complex calculations can be in a huge number of different states related to the state of the network, the contents of the blockchain, technical failures, economic problems, attacks on the network, and many other reasons. The stages at which performance problems are possible differ from traditional services, and the blockchain network server is a network service that combines the functionality of a database, a web server and a torrent client, which makes it extremely complex in terms of the load profile on all subsystems : processor, memory, network, storage

It so happened that decentralized networks and blockchains are quite specific and unusual software for centralized software developers. Therefore, I would like to highlight important aspects of the performance and sustainability of decentralized networks, approaches to measuring them and finding bottlenecks. We will look at various performance issues that limit the speed of providing a service to blockchain users and note the features that are specific to this type of software.

Steps for requesting a service by a blockchain client

In order to honestly talk about the quality of any more or less complex service, you need to take into account not only the average values, but also the maximum / minimum, medians, percentiles. Theoretically, we can talk about 1000 tps in some blockchain, but if 900 transactions were completed at a tremendous speed, and 100 “frozen” for a few seconds, then the average time collected over all transactions is not a completely honest metric for a client who in a few seconds I was unable to complete the transaction. Temporary pitfalls caused by missed rounds of consensus or network splits can severely impact a service that has been performing well on test benches.

In order to identify such bottlenecks, it is necessary to have a good understanding of the stages at which a real blockchain can experience difficulties in serving users. Let's describe the cycle of delivery and processing of a transaction, as well as receiving a new state of the blockchain, from which the client can verify that his transaction has been processed and accounted for.

  1. transaction is generated on the client
  2. the transaction is signed on the client
  3. the client selects one of the nodes and sends his transaction to it
  4. the client subscribes to updates of the state database of the node, waiting for the results of the execution of its transaction
  5. the node propagates the transaction over the p2p network
  6. several or one BP (block producer) process the accumulated transactions, updating the state database
  7. BP forms a new block after processing the required number of transactions
  8. BP distributes new block over p2p network
  9. the new block is delivered to the node accessed by the client
  10. node updates the state database
  11. the node sees the update regarding the client and sends him a transaction notification

Now let's take a closer look at these stages and describe the potential performance issues at each stage. Unlike centralized systems, we will also look at code execution on network clients. Quite often, when measuring tps, the transaction processing time is collected from the nodes, and not from the client - this is not entirely fair. The client does not care how quickly the node processed his transaction, the most important thing for him is the moment when reliable information about this transaction included in the blockchain becomes available to him. It is this metric that is essentially the transaction execution time. This means that different clients, even sending the same transaction, can get completely different times, which depend on the channel, load and proximity of the node, etc. So it is absolutely necessary to measure this time on clients, since this is the parameter that needs to be optimized.

Preparing a transaction on the client side

Let's start with the first two points: the transaction is formed and signed by the client. Oddly enough, this can also be a blockchain performance bottleneck from the client’s point of view. This is unusual for centralized services that take all the calculations and data operations for themselves, and the client simply prepares a short request that can request a large amount of data or calculations, getting the finished result. In blockchains, the client code becomes more and more powerful, and the blockchain core becomes more and more lightweight, and it is customary to give massive computing tasks to the client software. In blockchains, there are clients that can prepare one transaction for quite a long time (I'm talking about various merkle proofs, succinct proofs, threshold signatures and other complex operations on the client side). A good example of easy on-chain verification and heavy transaction preparation on the client is Merkle-tree based list membership proof, here article.

Also, do not forget that the client code does not just send transactions to the blockchain, but first requests the state of the blockchain - and this activity can affect the workload of the network and blockchain nodes. So when making measurements, it's wise to emulate the behavior of the client code as closely as possible. Even if your blockchain has ordinary light clients that put an ordinary digital signature on the simplest transaction for the transfer of some kind of asset, every year there are still more massive calculations on the client, crypto algorithms are getting stronger, and this part of the processing can turn into a significant bottleneck in the future. Therefore, be careful not to miss the situation when in a transaction lasting 3.5s, 2.5s is spent preparing and signing the transaction, and 1.0s is spent sending it to the network and waiting for a response. To assess the risks of this bottleneck, you need to collect metrics from client machines, and not just from blockchain nodes.

Sending a transaction and monitoring its status

The next step is to send the transaction to the selected blockchain node and get the status of its acceptance into the transaction pool. This step is similar to a normal database access, the node must write the transaction to the pool and start distributing information about it through the p2p network. The approach to performance assessment here is similar to the assessment of the work of traditional Web API microservices, and the transactions themselves in blockchains can be updated and actively change their status. In general, updating information about a transaction in some blockchains can happen several times, for example, when switching between forks of the chain or when BPs indicate their intention to include a transaction in a block. Limitations on the size of this pool and the number of transactions in it can affect the performance of the blockchain. If the transaction pool is filled to the maximum possible size, or does not fit in RAM, network performance may drop sharply. Blockchains do not have centralized protections against the flow of junk messages, and if the blockchain supports high volume transactions and low fees, this can lead to an overflow of the transaction pool - this is another potential bottleneck of performance.

In blockchains, the client sends a transaction to any blockchain node he likes, the hash of the transaction is usually known to the client before sending, so all he needs is to establish a connection and, after transmission, wait for the blockchain to change its state, including his transaction. Note that by measuring "tps" you can get completely different results for different ways to connect to a blockchain node. This can be a regular HTTP RPC or WebSocket that allows you to implement the "subscribe" pattern. In the second case, the client will receive a notification earlier, and the node will spend less resources (mainly memory and traffic) on responses about the status of the transaction. So when measuring “tps”, you need to take into account the way clients connect to the nodes. Therefore, in order to assess the risks of this bottleneck, the benchmark blockchain must be able to emulate clients with both WebSocket and HTTP RPC requests, in proportions corresponding to real networks, and also change the nature of transactions and their size.

To assess the risks of this bottleneck, you also need to collect metrics from client machines, and not just from blockchain nodes.

Transfer of transactions and blocks over the p2p network

In blockchains, peer-to-peer (p2p) networking is used to transfer transactions and blocks between participants. Transactions propagate through the network, starting from one of the nodes, until they reach block producer peers, which pack transactions into blocks and, using the same p2p, distribute new blocks to all network nodes. The basis of most modern p2p networks is various modifications of the Kademlia protocol. Here a good overview of this protocol, and here — an article with various measurements in the BitTorrent network, from which one can understand that this kind of network is more complex, and less predictable, than a rigidly configured network of a centralized service. Also, here an article about measuring various interesting metrics for Ethereum nodes.

In short, each peer in such networks maintains its own dynamic list of other peers from which it requests blocks of information that are addressable by content. Upon receipt of a request, the peer either gives the necessary information, or passes the request to the next pseudo-random peer from the list, and after receiving the answer, passes it to the requester and caches it for a while, giving this block of information earlier next time. Thus, popular information ends up in a large number of caches from a large number of peers, and unpopular information is gradually forced out. Peers keep a record of who sent information to whom, and the network tries to stimulate active distributors by increasing their rating and providing them with a higher level of service, automatically excluding inactive participants from the lists of peers.

So, the transaction now needs to be propagated across the network so that block-producers can see it and include it in the block. The node actively “distributes” a new transaction to everyone and listens to the network, waiting for a block in the index of which the desired transaction will appear in order to notify the waiting client. The time until the network transfers information about new transactions and blocks to each other in p2p networks depends on a very large number of factors: the number of honest nodes working side by side (from a network point of view), the “warming up” of the caches of these nodes, the size of blocks, transactions, the nature of the changes , network geography, number of nodes and many more factors. Comprehensive measurement of performance metrics in such networks is a difficult task, it is necessary to simultaneously evaluate the processing time of requests both on clients and on peers (blockchain nodes). Problems in any of the p2p mechanisms, incorrect data preemption and caching, inefficient management of active peer lists, and many other factors can cause delays that affect the efficiency of the entire network as a whole, and this bottleneck is the most difficult to analyze, test and interpretation of the results.

Block chain processing and state database update

The most important part of the blockchain is the consensus algorithm, applying it to new blocks received from the network and processing transactions with the results recorded in the state database. Adding a new block to the chain and then selecting the main chain should work as quickly as possible. However, in real life, “should” does not mean “works”, and one can, for example, imagine a situation where two long competing chains constantly switch between themselves, changing the metadata of thousands of transactions in the pool on each switch, and making constant rollbacks of the state database state. This stage, in terms of bottleneck definition, is simpler than the p2p network layer, because transaction execution and the consensus algorithm are strictly deterministic, and it is easier to measure anything here.
The main thing is not to confuse the random performance degradation of this stage with network problems - the nodes are slower to give blocks and information about the main chain, and for an external client it may look like a slow network, although the problem lies in a completely different place.

To optimize performance at this stage, it is useful to collect and monitor metrics from the nodes themselves, and include in them those related to updating the state-database: the number of blocks processed on the node, their size, the number of transactions, the number of switches between chain forks, the number of invalid blocks , virtual machine uptime, data commit time, etc. This will avoid confusing network problems with errors in chain processing algorithms.

A virtual machine processing transactions can be a useful source of information that can optimize the operation of the blockchain. The number of memory allocations, the number of read/write instructions, and other metrics regarding contract code execution efficiency can provide a lot of useful information to developers. At the same time, smart contracts are programs, which means that in theory they can consume any of the resources: cpu/memory/network/storage, so transaction processing is a rather indefinite stage, which, in addition, changes greatly when switching between versions and when changing the code of contracts. Therefore, metrics related to transaction processing are also needed to effectively optimize blockchain performance.

Receipt by the client of a notification about the inclusion of a transaction in the blockchain

This is the final stage in the receipt of the service by the blockchain client, compared to other stages, there are no large overheads here, but it is still worth considering the possibility of the client receiving a large response from the node (for example, a smart contract that returns an array of data). In any case, this very moment is the most important for the one who asked the question “how many tps are in your blockchain?” at this moment, the time of receiving the service is fixed.

In this place, there is necessarily a sending of the full time that the client had to spend waiting for a response from the blockchain, it is this time that the user will wait for confirmation in his application, and it is its optimization that is the main task of the developers.

Conclusion

As a result, it is possible to describe the types of operations performed on blockchains and divide them into several categories:

  1. cryptographic transformations, construction of proofs
  2. peer-to-peer networking, transaction and block replication
  3. transaction processing, execution of smart contracts
  4. applying changes in the blockchain to the state database, updating transaction and block data
  5. read-only state database requests, blockchain node APIs, subscription services

In general, the technical requirements for the nodes of modern blockchains are extremely serious - these are fast CPUs for cryptography, a large amount of RAM in order to store and quickly access the state database, network interaction using a large number of simultaneously open connections, voluminous storage. Such high requirements and an abundance of various types of operations inevitably lead to the fact that the resources of the nodes may not be enough, and then any of the stages discussed above can become another bottleneck for the overall network performance.

When designing and evaluating the performance of blockchains, you will have to consider all these points. To do this, you need to collect and analyze metrics simultaneously from clients and network nodes, look for correlations between them, evaluate the time of service provision to clients, take into account all the main resources: cpu/memory/network/storage, understand how they are used and affect each other. All this makes comparing the speeds of different blockchains in the form of “how many TPS” an extremely thankless task, since there are a huge number of different configurations and states. In large centralized systems, clusters of hundreds of servers, these problems are also complex and also require the collection of a large number of different metrics, but in blockchains, due to p2p networks, virtual machines processing contracts, internal economy, the number of degrees of freedom is much greater, which makes the test even on several servers, it is inconspicuous and shows only extremely approximate values ​​that have almost no connection with reality.

Therefore, when developing in the core of the blockchain, to assess performance and answer the question “whether it has improved compared to the last time”, we use rather complex software that orchestrates the launch of a blockchain with dozens of nodes and automatically launches a benchmark and collects metrics, without this information it is extremely difficult to debug protocols that work with multiple participants.

So, having received the question “how much TPS is in your blockchain?”, offer your interlocutor tea and ask if he is ready to get acquainted with a dozen graphs and also listen to all three boxes of blockchain performance problems and your suggestions for solving them…

Source: habr.com

Add a comment