Web applications are now ubiquitous, and among all transport protocols, HTTP occupies the lion's share. When studying the nuances of developing web applications, most pay very little attention to the operating system where these applications actually run. The separation of development (Dev) and operations (Ops) only made matters worse. But with the spread of the DevOps culture, developers are becoming responsible for running their applications in the cloud, so it is very useful for them to become familiar with the backend of the operating system. This is especially useful if you are trying to deploy a system for thousands or tens of thousands of concurrent connections.

Restrictions in Web Services are very similar to restrictions in other applications. Whether it's load balancers or database servers, all of these applications have similar problems in a high performance environment. Understanding these fundamental limitations and how to overcome them in general will allow you to evaluate the performance and scalability of your web applications.

I am writing this series of articles in response to questions from young developers who want to become well-informed system architects. It's impossible to clearly understand Linux application optimization techniques without diving into the basics of how they work at the operating system level. Although there are many types of applications, in this cycle I want to explore network applications, not desktop applications such as a browser or a text editor. This material is intended for developers and architects who want to understand how Linux or Unix programs work and how to structure them for high performance.

Linux is server room operating system, and most often your applications run on this OS. Although I say "Linux", most of the time you can safely assume that I mean all Unix-like operating systems in general. However, I have not tested the accompanying code on other systems. So, if you are interested in FreeBSD or OpenBSD, the result may be different. When I try something Linux specific, I point it out.

While you can use what you've learned to build an application from scratch and it'll be superbly optimized, it's best not to. If you write a new web server in C or C++ for your organization's business application, it might be your last day on the job. However, knowing the structure of these applications will help in choosing existing programs. You will be able to compare process-based systems with thread-based systems as well as event-based ones. You will understand and appreciate why Nginx performs better than Apache httpd, why a Tornado-based Python application can serve more users than a Django-based Python application.

ZeroHTTPd: A Learning Tool

ZeroHTTPd is a web server I wrote from scratch in C as a learning tool. It has no external dependencies, including access to Redis. We run our own Redis routines. See below for details.

While we could discuss theory at length, there's nothing better than writing code, running it, and comparing all server architectures. This is the most obvious method. Therefore, we will write a simple ZeroHTTPd web server using each model: based on processes, threads, and events. Let's test each of these servers and see how they perform compared to each other. ZeroHTTPd is implemented in a single C file. The event-based server includes uthash, a great hash table implementation that comes in a single header file. In other cases, there are no dependencies, so as not to complicate the project.

There are a lot of comments in the code to help understand. Being a simple web server in a few lines of code, ZeroHTTPd is also a minimal web development framework. It has limited functionality, but is capable of serving static files and very simple "dynamic" pages. I must say that ZeroHTTPd is well suited for learning how to build high performance Linux applications. By and large, most web services wait for requests, validate them, and process them. This is exactly what ZeroHTTPd will do. This is a training tool, not a production tool. It's not very good at handling errors and is unlikely to boast of best security practices (oh yes, I used strcpy) or clever tricks of the C language. But I hope it does its job well.

ZeroHTTPd main page. It can output different types of files including images

Guest book application

Modern web applications are usually not limited to static files. They have complex interactions with various databases, caches, etc. Therefore, we will create a simple web application called "Guestbook", where visitors leave entries under their names. The guest book keeps the entries left earlier. There is also a visitor counter at the bottom of the page.

ZeroHTTPd Guestbook Web Application

The visitor counter and guestbook entries are stored in Redis. For communication with Redis, own procedures are implemented, they do not depend on an external library. I'm not a big fan of rolling out homebrew code when there are public and well-tested solutions. But the goal of ZeroHTTPd is to study Linux performance and access to external services, while serving HTTP requests seriously affects performance. We must have full control over Redis communications in each of our server architectures. In one architecture we use blocking calls, in others we use event-based procedures. Using an external Redis client library will not give you this control. Also, our little Redis client only performs a few functions (getting, setting, and incrementing a key; getting and adding to an array). In addition, the Redis protocol is extremely elegant and simple. It doesn't even need to be taught. The very fact that the protocol does all the work in about a hundred lines of code speaks of how well thought out it is.

The following figure shows the application's actions when a client (browser) requests /guestbookURL.

Mechanism of the guest book application

When a guestbook page needs to be served, there is one call to the file system to read the template into memory, and three network calls to Redis. The template file contains most of the HTML content for the page in the screenshot above. There are also special placeholders for the dynamic part of the content: posts and visitor counter. We get them from Redis, insert them into the page, and serve the fully formed content to the client. The third call to Redis can be avoided because Redis returns the new key value when incremented. However, for our server with an asynchronous event-based architecture, many network calls are a good test for educational purposes. So we discard the Redis return value of the number of visitors and query it with a separate call.

ZeroHTTPd Server Architectures

We are building seven versions of ZeroHTTPd with the same functionality but different architectures:

iterative
Fork server (one child process per request)
Pre-fork server (pre-forking processes)
Server with execution threads (one thread per request)
Server with thread pre-creation
Based architecture poll()
Based architecture epoll

We measure the performance of each architecture by loading the server with HTTP requests. But when comparing architectures with a high degree of parallelism, the number of requests increases. We test three times and calculate the average.

Testing Methodology

ZeroHTTPd Load Testing Setup

It is important that when running tests, all components do not run on the same machine. In this case, the OS incurs additional scheduling overhead as components compete for the CPU. Measuring operating system overhead with each of the selected server architectures is one of the most important goals of this exercise. Adding more variables will be detrimental to the process. Therefore, the setting in the figure above works best.

What does each of these servers do

load.unixism.net: here we run ab, the Apache Benchmark utility. It generates the load needed to test our server architectures.
nginx.unixism.net: sometimes we want to run more than one instance of a server program. To do this, the Nginx server with the appropriate settings works as a load balancer coming from ab to our server processes.
zerohttpd.unixism.net: This is where we run our server programs on seven different architectures, one at a time.
redis.unixism.net: This server runs the Redis daemon where the guestbook entries and the visitor counter are stored.

All servers run on the same processor core. The idea is to evaluate the maximum performance of each of the architectures. Since all server programs are tested on the same hardware, this is the baseline for comparing them. My test setup consists of virtual servers leased from Digital Ocean.

What are we measuring?

You can measure different indicators. We evaluate the performance of each architecture in a given configuration by loading servers with requests at different levels of concurrency: the load grows from 20 to 15 concurrent users.

Test results

The following chart shows the performance of servers on different architectures at different levels of concurrency. On the y-axis - the number of requests per second, on the x-axis - parallel connections.

Below is a table with the results.

requests per second

parallelism
iterative
fork
pre-fork
streaming
pre-streaming
in.
epoll

20
7
112
2100
1800
2250
1900
2050

50
7
190
2200
1700
2200
2000
2000

100
7
245
2200
1700
2200
2150
2100

200
7
330
2300
1750
2300
2200
2100

300
-
380
2200
1800
2400
2250
2150

400
-
410
2200
1750
2600
2000
2000

500
-
440
2300
1850
2700
1900
2212

600
-
460
2400
1800
2500
1700
2519

700
-
460
2400
1600
2490
1550
2607

800
-
460
2400
1600
2540
1400
2553

900
-
460
2300
1600
2472
1200
2567

1000
-
475
2300
1700
2485
1150
2439

1500
-
490
2400
1550
2620
900
2479

2000
-
350
2400
1400
2396
550
2200

2500
-
280
2100
1300
2453
490
2262

3000
-
280
1900
1250
2502
large spread
2138

5000
-
large spread
1600
1100
2519
-
2235

8000
-
-
1200
large spread
2451
-
2100

10 000
-
-
large spread
-
2200
-
2200

11 000
-
-
-
-
2200
-
2122

12 000
-
-
-
-
970
-
1958

13 000
-
-
-
-
730
-
1897

14 000
-
-
-
-
590
-
1466

15 000
-
-
-
-
532
-
1281

It can be seen from the graph and table that above 8000 simultaneous requests we have only two players left: pre-fork and epoll. As the load grows, the poll-based server performs worse than the streaming one. The pre-threaded architecture is a worthy competitor to epoll: it is a testament to how well the Linux kernel schedules large numbers of threads.

ZeroHTTPd Source Code

ZeroHTTPd Source Code here. There is a separate directory for each architecture.

ZeroHTTPd │ ├── 01_iterative │ ├── main.c ├── 02_forking │ ├── main.c ├── 03_preforking │ ├── main.c ├── 04_threading │ ├── main.c ├── 05_pre threading │ ├── main.c ├── 06_poll │ ├── main.c ├── 07_epoll │ └── main.c ├── Makefile ├── public │ ├── tux. png └── templates └── guestbook └── index.html

In addition to the seven directories for all architectures, there are two more in the top-level directory: public and templates. The first one contains the index.html file and the image from the first screenshot. You can put other files and folders in there, and ZeroHTTPd should serve those static files without any problems. If the path in the browser matches the path in the public folder, then ZeroHTTPd looks in that directory for the index.html file. Guestbook content is generated dynamically. It only has a main page and its content is based on the 'templates/guestbook/index.html' file. ZeroHTTPd easily adds dynamic pages for extension. The idea is that users can add templates to this directory and extend ZeroHTTPd as needed.

To build all seven servers, run make all from the top-level directory - and all builds will appear in this directory. Executables look for the public and templates directories in the directory they are run from.

Linux API

You don't need to be well versed in the Linux API to understand the information in this article series. However, I recommend reading more on this topic, there are many reference resources on the Web. Although we will touch on several categories of the Linux API, our focus will be mainly on processes, threads, events, and the networking stack. In addition to books and articles about the Linux API, I also recommend reading mana for system calls and library functions used.

Performance and scalability

One note about performance and scalability. Theoretically, there is no connection between them. You can have a web service that performs very well, with a response time of a few milliseconds, but it doesn't scale at all. Similarly, there might be a poorly performing web application that takes a few seconds to respond, but scales out by the tens to handle tens of thousands of concurrent users. However, the combination of high performance and scalability is a very powerful combination. High performance applications generally use resources sparingly and thus efficiently serve more concurrent users on the server, reducing costs.

CPU and I/O Tasks

Finally, in computing there are always two possible types of tasks: for I/O and CPU. Receiving requests over the Internet (network I/O), serving files (network and disk I/O), communicating with the database (network and disk I/O) are all I/O activities. Some database queries can be slightly CPU intensive (sorting, averaging a million results, etc.). Most web applications are limited in the maximum possible I/O, and the processor is rarely used to its full capacity. When you see that some I/O task is using a lot of CPU, this is most likely a sign of poor application architecture. This may mean that CPU resources are wasted on process management and context switching - and this is not entirely useful. If you're doing something like image processing, audio file conversion, or machine learning, then the application requires heavy CPU resources. But for most applications this is not the case.

Learn more about server architectures

Source: habr.com

Performance of Linux network applications. Introduction