5 Incidents in Distributed Systems and How We Could Avoid Them

Roman Ushakov
16 min readOct 20, 2023

--

When designing and implementing distributed systems, a set of essential patterns comes into play. In this article, we’ll explore three key patterns — timeouts, retries, and circuit breakers — each serving to enhance the system’s robustness and protect it from cascading failures due to individual component issues.

We will take a look into some examples from real life and check how these incidents could be avoided by utilising software patterns in our code. In all of my examples I will be describing a system that is run in kubernetes and is written in PHP code, but all practices can be freely applied to any other languages as well.

Case 1: When a Simple API Method Was Affected by an Unpredicted Culprit

In one of our projects, we encountered a monolithic application serving as a gateway for specific requests and handling various online store-related logic. This monolith contained a vast API with complex business logic, dependencies, even different authorization methods.

One day, a complaint arrived from one of our partners, reporting an increasing number of requests timing out, taking more than 10 seconds to complete. It was puzzling because the reported method didn’t involve intricate logic; it merely forwarded requests to another of our components. This S2S (Server-to-Server) method simply validated authorization by comparing API keys with hashes from our database.

Upon investigation, we identified the culprit: one of our authorization providers, responsible for authorization of client methods, had started responding slowly. But why did this affect S2S methods?

To grasp the situation, let’s delve into how PHP-FPM operates:

PHP-FPM consists of a master process and a pool of workers, with each worker handling a single request at a time. Requests are distributed among these workers.

When the application handles a client request, it makes an HTTP call to the authorization provider to validate the client’s token.

At some point, possibly due to application bugs or network disruptions, the provider began responding in 5 seconds, instead of the expected 500 milliseconds.

Due to the high load, all workers became occupied while awaiting responses from remote service. These workers didn’t engage in intensive processing, so they didn’t consume CPU or memory. Consequently, Kubernetes (K8S) didn’t trigger an application scaling (as the resource thresholds weren’t breached), yet the workers remained blocked, rendering the application incapable of handling new requests.

But what happens to new requests? It depends on the configuration, but the default behaviour is to queue them. Consequently, even though authorization was unnecessary for S2S methods, these requests lingered in the queue, leading to escalating latency and SLA violations.

As a result, we have decreased throughput on API almost for all of our methods

Median response time from Authorization Provider
RPM for different API methods. You can see huge drop of throughput because of no PHP-FPM workers were available at this moment

Case 1: Postmortem

The lessons drawn from this scenario are as follows:

  1. Developers should consider not only CPU and memory utilisation but also other resource factors, such as worker pools (e.g., PHP-FPM workers, Nginx workers, RabbitMQ consumers), open ports/sockets, and more.
  2. While implementing K8S scaling based on worker utilisation may offer a quick solution, it’s only a temporary measure that postpones the problem, rather than solving it.
  3. Never wait excessively for a remote service. If your expectations include a 99th percentile response time within 500 milliseconds, set an appropriate timeout instead of relying on chance. In this context, having a 1-second timeout would have resulted in increased latency up to 1–2 seconds, rather than the 5–10 seconds experienced previously.

Timeouts

While we can’t prevent remote services from failing, we can certainly mitigate their impact on our system by making strategic use of timeouts.

When it comes to timeouts, there are two crucial parameters that deserve our attention:

  • connect_timeout: This timeout governs the duration it takes to establish a connection between our application and remote services. It encompasses both TCP and TLS (HTTPS) connection establishment.
  • read_timeout: Once the connection has been successfully established, this timeout kicks in, governing the time allowed for receiving a response from the remote service.

In most cases, configuring these parameters is a straightforward task, often manageable through default HTTP clients. This requires minimal extra effort from the development side, making it a no-brainer addition to your system.

Example using Guzzle for PHP Applications:

// Timeout if the client fails to connect to the server in 3.14 seconds.
$response = $client->request('GET', '/stream', [
'stream' => true,
'read_timeout' => 10,
‘connect_timeout’ => 3.14,
]);

$body = $response->getBody();

// Returns false on timeout
$data = $body->read(1024);

However, it’s crucial not to limit ourselves solely to these two timeouts. Real-life applications often throw non-trivial challenges our way.

Case 2: When Improvement of Security Led to Unforeseen Consequences

Within our system, we facilitate various types of communications, including asynchronous interactions powered by RabbitMQ. This is instrumental for tasks such as sending emails with receipts, notifications, and marketing promotions.

One day, our monitoring system caught our attention with some suspicious logs:

LockException: The code executed for 952.00 seconds. 
But the timeout is 30 seconds. The last 922.00 seconds were executed outside of the lock.

We employ mutexes based on Redis to synchronize the execution of different components. The error above indicated that processing an AMQP (Advanced Message Queuing Protocol) message took longer than our specified 30-second timeout — actually, it took a whopping 15 minutes.

Simultaneously, we began receiving complaints from one of our partners, with their users reporting a 15-minute wait to receive receipts. It appeared that these two issues were interconnected.

Oddly, there had been no changes to the code, and an exhaustive investigation turned up no missed timeouts in the codebase. The only suspect in the picture was an infrastructure update, which also involved updates to our firewalls.

To get to the bottom of this, let’s take a closer look at the order processor’s workflow.

The difference between PHP-FPM-based deployment and a long-running PHP CLI daemon, managed by systemd, lies in how they handle resources, especially connections. With PHP-FPM, each worker terminates and frees all resources after processing a request. Consequently, for each new request, a fresh connection must be established (unless persistent connections are in use). In contrast, daemons establish connections only once on startup or lazily upon access (with mechanisms for reestablishing closed connections).

The order processor connects to RabbitMQ and relies on the same connection for publishing multiple events. However, the actual flow is more intricate, with various nodes like switches and firewalls potentially interposed between our daemon and RabbitMQ.

When there’s no activity, connections can be closed, either by the remote server (typical behavior for databases and other services) or, in our case, by a firewall located between our zones. The firewall terminates hanging connections, returning a rejection package. The daemon interprets this as a closed connection and promptly reestablishes it — a process that worked flawlessly until recent hardware updates.

At first glance, the sequence diagram might appear identical, but there’s a critical distinction. The firewall no longer returns rejection packages; instead, it simply drops packets originating from our order processor. Given that TCP guarantees packet delivery, the daemon retries these packets — up to 16 times — with exponentially increasing timeouts.

If you examine the default retransmission configuration, you’ll find precisely the same delays that users experienced.

Case 2: Postmortem

Here are some essential takeaways from this case:

  1. Always utilize connect_timeout and read_timeout parameters — they’re indispensable in most scenarios.
  2. Gain a deep understanding of how network protocols operate at a low level; this comprehension can help diagnose unexpected behaviour based on its symptoms. Be in a good relationship with your IT department and be ready to dive deeply into code of third party libraries or even to sniff network packages between your servers.
  3. In network communications, be prepared for anything. In high-load systems, anticipate potential issues stemming from each component:
    - TCP retransmission timeouts
    - Time needed to resolve DNS
    - Servers responding slowly by delivering response portions every 1–2 seconds (contributing to overall timeout)
    - And much more.

Case 3: Unlucky Users or Handling Response Time Spikes

In the world of distributed systems, we often encounter network fluctuations that result in occasional “spikes” in response times. For instance, while the majority of our requests to an authorization provider may typically execute in a swift 500 milliseconds, sporadic delays can occur, stretching response times to an inconvenient 15 seconds. The root causes of these delays can be elusive, ranging from DNS resolution hiccups to route resolution challenges.

The most apparent remedy might seem to be reducing the timeout to 500 milliseconds, ensuring users don’t have to endure a 15-second wait only to receive an error. But can we do even better?

Retries

By implementing retries alongside timeouts, we can not only reduce waiting times but also ensure users receive the expected response with slightly higher latency than usual. With a 99.99% probability of requests executing within 500 milliseconds, the overall execution time after a retry would typically be no more than 1 second, a significant improvement compared to the 5–15 seconds without a timeout.

However, there are several important considerations to keep in mind:

  1. Retry Conditions: Retries should not be limited to timeouts alone. Server errors (HTTP status codes in the 5xx range) can also warrant retries. For instance, if your system experiences disruptions during new release deployments, clients might encounter 502 errors. These requests are excellent candidates for retries.
  2. Idempotent Requests: Not all requests are suitable for retries. Non-idempotent requests, such as creating a resource via a POST request in a REST API, can lead to unintended consequences, such as multiple resources being created. To mitigate this risk, you can employ a request_id mechanism — remote backends should reject duplicate requests with the same request_id and return the previous response instead.
  3. Rate Limiting: Be cautious not to overwhelm the remote service with excessive retries. Always impose a maximum limit on the number of retries. Implementing an exponential backoff algorithm with jitter is a good practice to avoid scenarios where multiple clients retry simultaneously.

In client-server communication, retries can also be managed at the reverse proxy level. For example, in an Nginx configuration, you can specify retries for 5xx errors as follows:

proxy_connect_timeout 5s;
proxy_send_timeout 60s;
proxy_read_timeout 60s;
proxy_next_upstream error timeout http_502 http_503 http_504 non_idempotent;
proxy_next_upstream_timeout 1;
proxy_next_upstream_tries 3;

In PHP, implementing retries can take a form similar to the following code:

class RequestRepeater
{
private const CACHE_PREFIX = 'x-repeater:';

public function __construct(
private RepeaterConfiguration $configuration,
private CacheInterface $cache
) {
}

public function withRetries(callable $fn, string $resourceName)
{
for ($i = 0; $i <= $this->configuration->retries; $i++) {
$totalNumberOfRequests = $this->incr($resourceName . '_total');

try {
return $fn();
} catch (\Throwable $exception) {
$this->checkExceptionRetryable($exception);
}

$isLastIteration = $i === $this->configuration->retries;
if ($isLastIteration) {
break;
}

$retriedNumberOfRequests = $this->incr($resourceName . '_retried');
if ($this->isRetryBudgetExceeded($retriedNumberOfRequests, $totalNumberOfRequests)) {
throw new RetryBudgetIsExceededException();
}

usleep($this->calculateDelay($i) * 1000);
}

throw new MaximumRetriesReachedException();
}

private function incr(string $counter): int
{
// Implementation of moving average
// For example, you can save counter with TTL 1 minute to redis
// It's good idea to write these counters to your metrics storage
// for better monitoring
return $this->cache->incr(
self::CACHE_PREFIX . $counter,
DateProvider::MINUTE
);
}

private function checkExceptionRetryable(\Throwable $exception): void
{
// You want to repeat only exceptions related to connection timeout
// or server error.
// There is no point retrying if it returns client error or smth like that
foreach ($this->configuration->retryExceptionClasses as $retryableException) {
if ($exception instanceof $retryableException) {
return;
}
}

throw $exception;
}

private function isRetryBudgetExceeded(int $retries, int $total): bool
{
// In order not to overwhelm remote server and avoid all workers
// being blocked by retrying and sleeping we will allow only
// certain percent of retries
return $total > $this->configuration->retryBudgetThreshold &&
$retries / $total > $this->configuration->retryBudget;
}

private function calculateDelay(int $iteration): int
{
// Exponential backoff
$delay = min($this->configuration->maxDelayMs, $this->configuration->baseDelayMs * pow(2, $iteration));

// Jitter
$jitter = (int)floor($delay * $this->configuration->jitterShiftPercent);
$delay += mt_rand($delay - $jitter, $delay + $jitter);

return $delay;
}
}

Incorporating retries intelligently into your system can significantly enhance its robustness, ensuring that sporadic response time spikes do not negatively impact your users’ experience.

Case 4: When We Regretted How Diligently Client Implemented Timeouts

In the world of SaaS providers, Service Level Agreements (SLAs) are a common commitment, encompassing metrics like availability and, significantly, guaranteed response latency for users.

For example, you may have declared that 99.95% of all requests to your API are executed in 1 second.

However, some of your clients may assume that all requests would be executed within 2 seconds and, as a precaution, set their timeouts to 2 seconds while also implementing retries on their side — an approach that can indeed help avoid various problems. But the question arises: Is this approach beneficial for you as well?

There came a time when we introduced complex personalization logic for one of our stores. Due to suboptimal optimization, this led to situations where the execution of requests for some users could stretch to a lengthy 10 seconds, a far cry from SLA’s 1-second target. This deviation resulted in a significant failure.

Let’s delve into how timeouts operate. The basic flow is as follows:

However, this sequence diagram fails to capture all the important details. In the PHP world, we rely on PHP-FPM, which adds complexity to the flow. As illustrated:

As you can see, a retried request is actually handled by another worker (or potentially on another server), while the initial worker remains occupied, even though the client no longer requires the response. Combining this situation with the implementation of retries can quickly lead to all our workers being blocked.

Case 4: Postmortem

Up to this point, we’ve discussed techniques to prevent wasteful resource usage and avoid excessive wait times for remote server responses. However, the roles are now reversed. The client has successfully implemented our techniques, but our server still isn’t operating efficiently.

The truth is, timeouts aren’t exclusive to clients; they can also be applied to our HTTP server, often referred to as a “deadline.” Here are some key considerations:

  • We should not allow a worker to process requests longer than our SLA permits.
  • This becomes especially crucial for short-polling methods, where clients make requests every 1–3 seconds. After the timeout, there’s no point in our server attempting to return a response, as the data would be irrelevant to the client.
  • Implementing backend timeouts requires careful consideration to avoid leaving incomplete operations; all changes should be appropriately rolled back. Achieving this can be particularly challenging in PHP applications.

In PHP applications, given their single-process nature, the only options are configuring timeouts in the .ini configuration or PHP-FPM configuration (e.g., request_terminate_timeout).

In our codebase, we enclose all requests within a DB transactional middleware to ensure that changes are either committed or rolled back after request handling. If a process is prematurely terminated, the transaction may remain open. Therefore, it’s crucial to register a shutdown function to ensure the proper rollback of changes.

More complex techniques such as deadline propagation can significantly enhance the system’s robustness, particularly when managing a large number of microservices.

Case 5: Never Hit Someone Who Is Down

During the Christmas sales season, some of our billing components started responding slowly, eventually leading to 5xx errors and, at times, complete unavailability. Even with retries implemented on our side, they proved ineffective when the remote backend became entirely unreachable.

An incident was promptly created, and our Site Reliability Engineering (SRE) team launched an investigation. Attempts to resolve the issue through server reboots yielded no results. While the application would launch, it would inevitably become unresponsive again after a short period.

So, what went wrong?

It turned out that our application employed lazy initialization of the APCu cache upon the first request. This cache stored Dependency Injection (DI) dependencies and commonly used data dictionaries from the database. As the cache needed to hold a substantial amount of data, memory consumption skyrocketed to 100MB.

With ten PHP workers concurrently handling requests, all attempting to load the cache, memory usage reached a staggering 1GB. Consequently, the container running the application was killed by the Out-Of-Memory (OOM) killer.

Case 5: Postmortem

The problem was clear: clients overwhelmed the application and continued sending requests even though it was already incapacitated and unable to recover. This situation can occur without any programmatic retries, especially during scenarios like users repeatedly clicking a button to claim a free item during a marketing campaign.

To prevent such scenarios, we need a mechanism to halt requests until the application is ready to process them. One effective pattern for this situation is the circuit breaker — a metaphorical emergency seal on a submarine that prevents the entire vessel from sinking due to a leak in one compartment.

Here’s how the circuit breaker pattern works:

  1. After a certain number of failed requests, the circuit breaker returns an error instead of making actual requests to the remote service.
  2. After a timeout, it starts determining if the remote service has recovered or is responsive again. This can involve running health checks or allowing a fraction of traffic to reach the destination. Depending on the results, the circuit breaker transitions between “Open” or “Closed” status.

By implementing this pattern, you can reduce unnecessary load on other components. Consider an API method with the following logic:

  1. Check authorization
  2. Retrieve item data from the database
  3. Calculate prices and promotions
  4. Make a request to billing for tax calculation

If the billing service is unavailable, the system would still execute the authorization check, database queries, and complex calculations, even though the final request would ultimately fail. With a circuit breaker in place, you can assess the status of all dependencies before initiating request handling. If the circuit breaker is in the “Closed” state, there’s no point in starting request processing; an error can be returned immediately.

Here’s an example of PHP code implementation for a circuit breaker:

class CircuitBreaker implements HttpClientInterface
{
private const CLOSED_TIMEOUT = 5 * DateProvider::MINUTE;
private const HALF_CLOSED_TIMEOUT = 2 * self::CLOSED_TIMEOUT;

private const SLIDING_WINDOW_LENGTH = DateProvider::MINUTE;

private const CLOSE_MINIMUM_REQUESTS = 100;
private const CLOSE_THRESHOLD = 0.8;

private ?int $closedAt = null;

public function __construct(
private HttpClientInterface $innerClient,
private DateProviderInterface $dateProvider,
private \Redis $redis,
private string $circuitBreakerName
) {
}

public function doRequest(string $method, string $url, $body): ResponseInterface
{
if ($this->isClosed()) {
return new Response(503);
}

if ($this->isHalfClosed()) {
// only 5% of requests are actually forwarded to remote server when circuit is in HALF_CLOSED state
return mt_rand(0, 1) > 0.95
? $this->doHalfClosedRequest($method, $url, $body)
: new Response(503);
}

return $this->doOpenRequest($method, $url, $body);
}

private function doHalfClosedRequest(string $method, string $url, $body): Response
{
$response = $this->innerClient->doRequest($method, $url, $body);
if ($response->getStatusCode() >= 500) {
$this->closedAt = $this->dateProvider->getCurrentTimestamp();
}

return $response;
}

private function doOpenRequest(string $method, string $url, $body): Response
{
$response = $this->innerClient->doRequest($method, $url, $body);
$totalRequestsNum = $this->incrCounter($this->circuitBreakerName . '_total');
if ($response->getStatusCode() >= 500) {
$failedRequestNum = $this->incrCounter($this->circuitBreakerName . '_failed');

if (
$totalRequestsNum >= self::CLOSE_MINIMUM_REQUESTS
&& $failedRequestNum / $totalRequestsNum <= self::CLOSE_THRESHOLD
) {
$this->closedAt = $this->dateProvider->getCurrentTimestamp();
}
}

return $response;
}

// You can make this method public and check the status of circuit breaker somewhere in middleware before executing business logic
private function isClosed(): bool
{
return null !== $this->closedAt
&& $this->dateProvider->getCurrentTimestamp() - $this->closedAt <= self::CLOSED_TIMEOUT;
}

private function isHalfClosed(): bool
{
return null !== $this->closedAt
&& $this->dateProvider->getCurrentTimestamp() - $this->closedAt <= self::HALF_CLOSED_TIMEOUT;
}

private function incrCounter(string $counter): int
{
$value = $this->redis->incr($counter);

if (1 === $value) {
$this->redis->expire($counter, self::SLIDING_WINDOW_LENGTH);
}

return $value;
}
}

Other approaches that can help mitigate this issue include:

  • Implementation of a retry budget to significantly reduce the likelihood of clients overwhelming the server with retries.
  • Proper implementation of readiness probes for your Kubernetes (k8s) pods, allowing you to exclude a pod from load balancing until it’s fully ready.
  • Avoiding lazy cache initialization in production by preparing all caches and autogenerated classes before enabling traffic to a new release.

By applying these strategies, you can enhance the reliability and stability of your system, even during high-stress periods like holiday sales events.

Summary

  • Reliability patterns play a crucial role in enhancing the stability and robustness of distributed systems. They not only reduce the impact of outages but also prevent errors from cascading further.
  • Timeouts are essential for preventing worker blocking and reducing wasteful resource consumption. They ensure that requests do not linger indefinitely and help maintain system responsiveness.
  • Retries, when used in conjunction with timeouts, are effective for handling failed requests and mitigating temporary issues. However, they should be employed cautiously, considering the consequences of repeated retries if the remote service is struggling or unavailable.
  • Circuit breakers are valuable for reducing the load on a failing component and allowing it time to recover. They act as safeguards against overwhelming a struggling service and help maintain system stability.
  • A fundamental understanding of network protocols, as well as the inner workings of the software, languages, and frameworks you employ, is essential, particularly in complex, high-load environments. This knowledge empowers you to make informed decisions and optimize system performance.

The author has used AI tools for writing this article. Even though initial idea and structure were their own, inspired by real life experience of working in IT, AI has been utilized to make grammatically correct text with varied vocabulary.

--

--

Roman Ushakov

I am software developer and just like to do cool stuff.. or at least I try :)