Cloudflare's hours-long outage was the result of improper error handling.

Cloudflare has published a report on one of the largest incidents affecting its infrastructure, which yesterday left much of its content delivery network unavailable for over three hours. The outage occurred after a change to the structure of a database hosted in ClickHouse storage, which doubled the size of the file containing parameters for the anti-bot system. Duplicate tables were created in the database, despite the fact that the SQL query used to generate the file simply retrieved all data from all tables by key, without filtering out duplicates. SELECT name, type FROM system.columns WHERE table = 'http_requests_features' order by name;

Cloudflare's hours-long outage was the result of improper error handling.

The created file was distributed to all nodes in the cluster processing incoming requests. In the handler that used this file to check for bot requests, the parameters specified in the file were stored in RAM, and to protect against excessive memory consumption, the code included a limit on the maximum file size. Under normal conditions, the actual file size was significantly smaller than the limit, but after duplicating tables, it exceeded the limit.

The problem turned out to be that instead of correctly handling the limit exceeded value and continuing to use the previous file version, informing the monitoring system of the emergency, the handler was crashing, blocking further traffic forwarding. The error was caused by the use of the unwrap() method with the Result type in the Rust code.

Cloudflare's hours-long outage was the result of improper error handling.

When the Result value is "Ok", the unwrap() method returns the object associated with that state, but if the result is unsuccessful, the call results in an abnormal termination (the "panic!" macro is called). Unwrap() is typically used during debugging or when writing test code and is not recommended for use in production projects.

Cloudflare's hours-long outage was the result of improper error handling.
Cloudflare's hours-long outage was the result of improper error handling.


Source: opennet.ru
Buy reliable hosting for sites with DDoS protection, VPS VDS servers 🔥 Buy reliable website hosting with DDoS protection, VPS VDS servers | ProHoster