SRE: Performance Analysis. Configuration method using a simple web server in Go

Performance analysis and tuning is a powerful performance matching tool for customers.

Performance analysis can be used to check for bottlenecks in a program by applying a scientific approach to testing tuning experiments. This article defines a general approach to performance analysis and tuning, using a Go webserver as an example.

Go is particularly well suited here, as it has profiling tools. pprof in the standard library.

SRE: Performance Analysis. Configuration method using a simple web server in Go

Strategy

Let's create a summary list for our structural analysis. We will try to use some data to make decisions instead of making changes based on intuition or guesswork. To do this, let's do this:

  • We determine the boundaries of optimization (requirements);
  • We calculate the transaction load for the system;
  • We execute the test (create data);
  • Watching;
  • We analyze whether all requirements are met?
  • We set up in a scientific way, we make a hypothesis;
  • We perform an experiment to test this hypothesis.

SRE: Performance Analysis. Configuration method using a simple web server in Go

Simple HTTP Server Architecture

For this article, we will be using a small Golang HTTP server. All code from this article can be found here.

The application being analyzed is an HTTP server that polls Postgresql for each request. Additionally, there is Prometheus, node_exporter and Grafana for collecting and displaying application and system metrics.

SRE: Performance Analysis. Configuration method using a simple web server in Go

For simplicity, we consider that for horizontal scaling (and simplifying calculations), each service and database is deployed together:

SRE: Performance Analysis. Configuration method using a simple web server in Go

We define goals

At this step, we define the goal. What are we trying to analyze? How do we know when it's time to end? In this article, we will imagine that we have clients and that our service will process 10 requests per second.

Π’ Google SRE Book methods of selection and modeling are considered in detail. Let's do the same, build models:

  • Latency: 99% of requests should be completed in less than 60ms;
  • Cost: The service should consume the minimum amount of money that we think is reasonably possible. To do this, we maximize throughput;
  • Capacity Planning: Understanding and documenting how many application instances will need to run, including the overall scaling feature, and how many instances will be needed to meet initial load and provisioning requirements redundancy n+1.

Latency may require optimization in addition to analysis, but throughput clearly needs to be analyzed. When using the SRE SLO process, the delay request comes from the customer and/or business represented by the product owner. And our service will fulfill this obligation from the very beginning without any settings!

Setting up a test environment

With the help of the test environment, we will be able to issue a dosed load on our system. Web service performance data will be generated for analysis.

Transaction load

This environment uses Vegetate to create a custom HTTP request rate until stopped:

$ make load-test LOAD_TEST_RATE=50
echo "POST http://localhost:8080" | vegeta attack -body tests/fixtures/age_no_match.json -rate=50 -duration=0 | tee results.bin | vegeta report

Observation

At runtime, a transactional load will be applied. In addition to application metrics (number of requests, response latency) and operating system (memory, CPU, IOPS), application profiling will be run to understand where it has problems, as well as how CPU time is consumed.

Profiling

Profiling is a type of measurement that allows you to see where CPU time goes when an application is running. It allows you to determine exactly where and how much processor time is running:

SRE: Performance Analysis. Configuration method using a simple web server in Go

This data can be used during analysis to get an idea of ​​wasted CPU time and unnecessary work being done. Go (pprof) can generate profiles and render them as a flame graph using the standard toolset. I will cover their usage and setup guide a bit later in the article.

Implementation, observation, analysis.

Let's do an experiment. We will execute, observe and analyze until the performance suits us. We choose an arbitrarily low load value to apply it to obtain the results of the first observations. At each subsequent step, we will increase the load with some scaling factor chosen with some spread. Each run of load testing is performed with the number of requests adjusted: make load-test LOAD_TEST_RATE=X.

50 requests per second

SRE: Performance Analysis. Configuration method using a simple web server in Go

Notice the top two graphs. The top left shows that our application is processing 50 requests per second (in his opinion), and the top right is the duration of each request. Both parameters help us to look and analyze whether we fit into our performance boundaries or not. Red line on the chart HTTP Request Latency shows SLO at 60ms. You can see from the line that we are well below our maximum response time.

Let's look at the cost:

10000 requests per second / per 50 requests per server = 200 servers + 1

We can still improve this figure.

500 requests per second

More interesting things start to happen when the load becomes 500 requests per second:

SRE: Performance Analysis. Configuration method using a simple web server in Go

Again, in the upper left graph, you can see that the application captures the usual load. If this is not the case, there is a problem on the server where the application is running. The response latency graph at the top right shows that 500 requests per second resulted in a response latency of 25-40ms. The 99th percentile still fits nicely into the 60ms SLO chosen above.

In terms of cost:

10000 requests per second / per 500 requests per server = 20 servers + 1

Still can be improved.

1000 requests per second

SRE: Performance Analysis. Configuration method using a simple web server in Go

Great launch! The application shows that it processed 1000 requests per second, but the latency limit was violated by SLO. This can be seen from the p99 line in the top right graph. Even though the p100 line is much higher, the actual delays are higher than the maximum of 60ms. Let's dive into profiling to find out what the application actually does.

Profiling

For profiling, we set the load to 1000 requests per second, then use pprof to take data, to find out where the application is spending CPU time. This can be done by activating the HTTP endpoint pprof, after which, when loading, save the results using curl:

$ curl http://localhost:8080/debug/pprof/profile?seconds=29 > cpu.1000_reqs_sec_no_optimizations.prof

The results can be displayed like this:

$ go tool pprof -http=:12345 cpu.1000_reqs_sec_no_optimizations.prof

SRE: Performance Analysis. Configuration method using a simple web server in Go

The graph shows where and how much the application spends CPU time. From description by Brendan Gregg:

The x-axis is the stack profile fill, sorted alphabetically (it's not time), the y-axis shows the stack depth, counting from zero at [top]. Each rectangle is a stack frame. The wider the frame, the more often it is present in the stacks. The one on top runs on the CPU, and the one below is the child elements. Colors usually don't mean anything, but are simply chosen randomly to distinguish frames.

Analysis - hypothesis

For tweaking, we'll focus on trying to find a waste of CPU time. We will look for the largest sources of waste and remove them. Well, given that profiling reveals very precisely where exactly the application is spending its CPU time, you may need to do this several times, and you will also need to change the application source code, rerun tests and see that the performance approaches the intended one.

Following the recommendations of Brendan Gregg, we will read the chart from top to bottom. Each line displays a stack frame (function call). The first line is the entry point to the program, the parent of all other calls (in other words, all other calls will have it on their stack). The next line is already different:

SRE: Performance Analysis. Configuration method using a simple web server in Go

If you hover over the function name on the graph, the total time it was on the stack during debugging will be displayed. The HTTPServe function was there 65% of the time, other functions runtime, runtime.mcall, mstart ΠΈ gctook the rest of the time. Interesting fact: 5% of the total time is spent on DNS queries:

SRE: Performance Analysis. Configuration method using a simple web server in Go

The addresses the program is looking for belong to Postgresql. Click on FindByAge:

SRE: Performance Analysis. Configuration method using a simple web server in Go

Interestingly, the program shows that, in principle, there are three main sources that add delays: opening/closing connections, requesting data, and connecting to the database. The graph shows that DNS queries, opening and closing connections take about 13% of the total execution time.

Hypothesis: Reuse of connections using pooling should reduce the time of a single request over HTTP, allowing higher throughput and lower latency.

Application setup - experiment

We update the source code, we try to remove the connection to Postgresql for each request. The first option is to use connection pool at the application level. In this experiment we set up connection pooling with sql driver for go:

db, err := sql.Open("postgres", dbConnectionString)
db.SetMaxOpenConns(8)

if err != nil {
   return nil, err
}

Execution, observation, analysis

After restarting the test with 1000 requests per second, it is clear that p99 has bounced back in terms of delays with a SLO of 60ms!

What's the cost?

10000 requests per second / per 1000 requests per server = 10 servers + 1

Let's make it even better!

2000 requests per second

SRE: Performance Analysis. Configuration method using a simple web server in Go

Doubling the load shows the same thing, the upper left graph shows that the application manages to process 2000 requests per second, p100 is lower than 60ms, p99 satisfies the SLO.

In terms of cost:

10000 requests per second / per 2000 requests per server = 5 servers + 1

3000 requests per second

SRE: Performance Analysis. Configuration method using a simple web server in Go

Here the application can process 3000 requests with a p99 latency of less than 60ms. SLO is not violated, and the cost is taken as follows:

10000 requests per second / per 3000 requests per server = 4 servers + 1 (author rounded to the nearest approx. translator)

Let's try another round of analysis.

Analysis - hypothesis

We collect and display the results of debugging the application at 3000 requests per second:

SRE: Performance Analysis. Configuration method using a simple web server in Go

Still 6% of the time is spent making connections. The pool setup has improved performance, but you can still see that the application continues to create new connections to the database.

Hypothesis: Connections, despite being pooled, are still being dropped and purged, so the application needs to re-establish them. Setting the number of pending connections to the pool size should help with latency by minimizing the time an application spends creating a connection.

Application setup - experiment

Trying to install MaxIdleConns equal to the size of the pool (also described here):

db, err := sql.Open("postgres", dbConnectionString)
db.SetMaxOpenConns(8)
db.SetMaxIdleConns(8)
if err != nil {
   return nil, err
}

Execution, observation, analysis

3000 requests per second

SRE: Performance Analysis. Configuration method using a simple web server in Go

p99 is less than 60ms with a much smaller p100!

SRE: Performance Analysis. Configuration method using a simple web server in Go

Checking the flame graph shows that the connection is no longer visible! Checking in more detail pg(*conn).query - also do not notice the connection here.

SRE: Performance Analysis. Configuration method using a simple web server in Go

Conclusion

Performance analysis is critical to understanding that customer expectations and non-functional requirements are being met. Analysis by comparing observations with customer expectations can help determine what is acceptable and what is not. Go provides powerful tools built into the standard library to make parsing simple and accessible.

Source: habr.com

Add a comment