Improving Observability With ELK Alerts

Roman Ushakov
13 min readJan 4, 2024

--

This article describes approaches to the configuration of different alerts in ELK. Here we will be discussing mainly Open Distro Alerting, but the same can be applied to OpenSearch.

We will cover the basic alert configuration flow as well as some advanced techniques such as anomaly detection and historical data collection.

This knowledge can allow you to greatly increase the observability and robustness of your application and decrease the time needed for detection and reaction to any incidents.

ELK Monitor Overview

Monitor execution flow

ELK uses abstractions named monitors for checking logs and raising alerts.

The basic structure of a monitor is as follows:

  • An extraction query that is run according to a certain schedule and finds any logs against which you want to check your conditions
  • Trigger condition that is written using painless script. You write here simple post-processing logic for query results.
  • Trigger action is just a destination where your alert must be sent. In most cases, it is a webhook, but ELK supports sending emails too.
    You can use mustache templates syntax to format your message and include any useful information that will allow recipients to quickly analyse situation and perform necessary actions.

Monitors can be scheduled to run every N minutes, hours, etc. or more complex schedules using crontab syntax.

The Simplest Monitors

The most basic scenario we want our monitoring system to handle is the detection of unexpected bugs or server errors. Such easily configured alerts just check the presence of specific error messages in logs.

In our example, we want to configure alerts for 5xx error codes (server error) and 499 error codes (client has closed connection by their timeout).

The first step is to define an extraction query.

We use access logs from our nginx for this monitor, but the instructions are the same for any other source of logs, including application-level logs.

We need to filter out logs and leave only the ones related to our application, remove unnecessary OPTIONS requests, and check the response status.

Extraction query with query_string and range parameters

The easiest way for developers to configure such an alert is by utilising query_string syntax.

According to official documentation, this parameter was specifically developed for search box implementation. It enables users to provide queries by simply entering the whole search condition in the form you use in Kibana UI:

{
"query": {
"bool": {
"filter": [
{
"query_string": {
"query": "HTTP.VHOST: \"example.com\" AND (HTTP.STATUS > 500 OR HTTP.STATUS: (499)) AND NOT HTTP.METHOD: \"OPTIONS\""
}
},
{
"range": {
"@timestamp": {
"gte": "now-5m",
"lte": "now"
}
}
}
]
}
}
}

Also note the second component of the query: the date range. Here we use now which indicates the current time, relatively to query execution. In combination with now-5m this period of time means the last 5 minutes.

To find more about query options, refer to the official documentation on the Search API. It contains information on how to utilise indices, descriptions of optional parameters, differences between various query modes, etc.

Next comes the trigger condition.

The default trigger condition is the following:

ctx.results[0].hits.total.value > 0

It means that an alert will be triggered if there are any logs found by your extraction query. Logs are available under the hits property of the trigger context. The UI provides an example of logs if you want to configure conditions depending on a specific property of the result.

Example of extraction query response

The last step is to configure the alert payload and destination.

The configuration of the destination is pretty straightforward, so I won’t explain it here.

For formatting the payload of alert requests, you can use mustache template, which makes webhooks very flexible.

Example of mustache template for webhook payload

We can access all context information under the ctx variable by utilising template syntax with {{}}.

Here we send a webhook request, providing brief info about the exception and a link to this request in logs for further investigation by the team.

{
"title": "{{ctx.monitor.name}}",
"messages": [
"{{ctx.results.0.hits.hits.0._source.ISODATE}}: {{ctx.results.0.hits.hits.0._source.command}} errors. Please investigate\n {{ctx.results.0.hits.hits.0.context.payload.exception.message}}"
],
"link": "https://us-elk.local/goto/c3df3d0072031a0364664a4cc159fd88?security_tenant=global"
}

Improving Monitors A Little Bit

The previous example is good for monitoring bugs or unexpected errors in your application. Yet it has one flaw, which you probably already have noticed: we access only the first log in our hits array, meaning if there are more than 1 error per execution interval, the system will not notify you about these extra logs.

It’s not so critical, as the developer will check the request using the link in the message and will notice other errors, but nevertheless, we can improve our notification.

Mustache provides ways to iterate over arrays and render payloads as we need them using operators {{#list}}{{/list}}

{
"title": "{{ctx.monitor.name}}",
"messages": [
{{#ctx.results.0.hits.hits}}
"{{_source.ISODATE}}: {{_source.command}} errors. Please investigate\n {{context.payload.exception.message}}",
{{/ctx.results.0.hits.hits}}
],
"link": "https://us-elk.local/goto/c3df3d0072031a0364664a4cc159fd88?security_tenant=global"
}

This will work fine for HTML or plain text, but not for JSON. Do you see the problem here?

The JSON rendered with the template written above is incorrect because of the trailing comma after the last error message. Here is the example of incorrect payload (note comma after last message):

{
"title": "Server Error Monitor",
"messages": [
"2024-01-04: catalog errors. Please investigate\n Unexpected EOF",
"2024-01-04: email errors. Please investigate\n Return type must be int, null returned",
"2024-01-04: billing errors. Please investigate\n Unexpected API response",
],
"link": "https://us-elk.local/goto/c3df3d0072031a0364664a4cc159fd88?security_tenant=global"
}

Unfortunately, mustache doesn’t provide any “smart” functionality, which allows us to elegantly avoid this situation.

But we can use a painless script trigger condition not only to check logs but also to pass additional information into the context, which later can be used by mustache.

if (ctx.results[0].hits.total.value > 0) {
ctx.results[0].hits.hits[0].first = true;
return true;
}

Basically, we set a flag for the first log and check this flag in mustache to understand whether we should render a comma or not.

{
"title": "{{ctx.monitor.name}}",
"messages": [
{{#ctx.results.0.hits.hits}}{{^first}},{{/first}}
"{{_source.ISODATE}}: {{_source.command}} errors. Please investigate\n {{context.payload.exception.message}}"
{{/ctx.results.0.hits.hits}}
],
"link": "https://us-elk.local/goto/c3df3d0072031a0364664a4cc159fd88?security_tenant=global"
}

As you can see, we use construction {{^first}},{{/first}. The ^ operator allows us to render a comma only if the property first does not exist or is false. Because we have set first to true for the first element, comma won’t be rendered.

Refer to the official mustache documentation for more information on template operators.

Detecting Anomalies

The alerts mentioned above are pretty basic and simply compare the number of logs with a fixed constant.

It works okay with cases like unexpected bugs, server errors, etc.

But what if we want to check client errors or the number of successful requests?

For example, we constantly have some unauthorised requests resulting in 401 status responses from our clients. But what if we accidentally deploy a bug in the authorization middleware and most requests start getting authorization errors? How can we detect such spikes?

We can increase the border line for our trigger condition, e.g., set up 1000 instead of 1, to remove false positive triggers, but can we do better?

There is functionality specifically made for this that is available in other monitoring software, such as anomaly detectors.

One of the examples in New Relic:

Incident detected with anomaly monitor

Monitor that has detected a sudden increase in traffic from 18k to 24k RPM.

How does it work? It checks the difference in metric for a certain period from the average. The maximum allowed difference is defined by the “threshold”, which can be either a constant value or a number of standard deviations. The greater the threshold, the fewer alerts there will be.

Here is a good visualisation from New Relic:

UI of anomaly detector configuration in New Relic
  • The blue line is the actual metric value.
  • Dotted line: average value
  • Grey borders are borders that are defined by our threshold; the greater the value, the wider the borders.
  • Red zones are periods in time when the actual metric value is outside the border.

Open Distro Alerting doesn’t provide any UI for configuring such alerts, but we can achieve similar behaviour using other features it has.

The first idea is to use a painless script trigger condition and calculate threshold yourselves, but there are two main problems:

  1. Calculation of math statistics functions is painful, even for painless script (ha-ha).
  2. There is a limitation on the number of logs your monitor can process: you can’t have more than 1000 logs in your hits array.

Luckily, there is a simpler way to do this: by using aggregation functions.

You can specify aggregations inside your extraction query to calculate specific metrics over your query result.

There are three types of aggregations:

  • Metric aggregations to calculate metrics based on specific fields of the logs
  • Bucket aggregations that group your logs into a single aggregate based on specific fields
  • Pipeline aggregations that take input from other aggregations and allow you to calculate metrics based on those aggregations

We can utilise extended_stats_bucket to calculate statistical variables for our query result:

{
"size": 0,
"query": {...},
"aggregations": {
"hits": {
"date_histogram": {
"field": "@timestamp",
"fixed_interval": "1m",
"offset": 0,
"order": {
"_key": "asc"
},
"keyed": false,
"min_doc_count": 0,
"hard_bounds": {
"min": "now-62m/m",
"max": "now-3m/m"
}
},
"aggregations": {
"the_count": {
"value_count": {
"field": "@timestamp"
}
}
}
},
"last_hit": {
"date_histogram": {
"field": "@timestamp",
"fixed_interval": "1m",
"offset": 0,
"order": {
"_key": "asc"
},
"keyed": false,
"min_doc_count": 0,
"hard_bounds": {
"min": "now-3m/m",
"max": "now-2m/m"
}
}
},
"stats": {
"extended_stats_bucket": {
"buckets_path": [
"hits>the_count"
],
"gap_policy": "insert_zeros",
"sigma": 6
}
}
}
}

I. date_histogram groups our logs into buckets with a fixed interval of 1 minute.

Notice that we actually use two date_histogram aggregations: one for the last minute (last_hit) and another one for all minutes except the last (hits).

We use separate aggregation for the last minute in order to not affect the results of the extended_stats_bucket function by its value: if we have an anomaly at the last minute, it shouldn’t influence the result values of average and standard deviation.

You also may wonder what this strange syntax means: now-3m/m. /m means that range starts at the beginning of this time unit, i.e., at 0 seconds for minutes. For instance, now may return 19:15:46, but now/m will return 19:15:00 for the same moment in time.

After the application of these aggregations, we will have the following results:

"hits": {
"buckets": [
{
"key_as_string": "2024-01-02T08:56:00.000Z",
"doc_count": 1,
"key": 1704185760000
},

]
}

II. value_count counts the number of unique logs in each bucket. We use this aggregation in combination with the date_histogram aggregation to count the number of logs per each minute.

After the application of these aggregations, we will have the following results:

"hits": {
"buckets": [
{
"key_as_string": "2024-01-02T08:56:00.000Z",
"doc_count": 1,
"the_count": {
"value": 1
},
"key": 1704185760000
},

]
}

You may wonder why we use this type of aggregation only for the first part of our intervals and not for the last minute. You see, we already have a number of documents available in the doc_count property, which can be accessed from the trigger condition. However, extended stats aggregations can’t use this field, so we add additional value aggregation here.

III. extended_stats_bucket calculates statistical variables using aggregations provided in the buckets_path property. Here we provide our count aggregation.

The example of the result:

"stats": {
"variance_population": 10.077288941736029,
"std_deviation_bounds": {
"upper_population": 20.52960617881583,
"upper": 20.52960617881583,
"lower": -17.56408893743652,
"lower_population": -17.56408893743652,
"upper_sampling": 20.69595735085029,
"lower_sampling": -17.73044010947098
},
"max": 21,
"count": 58,
"sum": 86,
"std_deviation_population": 3.174474593021029,
"sum_of_squares": 712,
"std_deviation_sampling": 3.2021997883601054,
"min": 0,
"avg": 1.4827586206896552,
"variance": 10.077288941736029,
"std_deviation": 3.174474593021029,
"variance_sampling": 10.254083484573503
}

Parameter sigma that you may have noticed in the extraction query above is actually the “threshold” we were talking about previously. It is the number of standard deviations that is used to calculate upper and lower bounds. These bounds later can be used in the trigger condition.

However, we won’t use it because sigma is specified in the extraction query while we want to have two separate triggers for the same monitor (one to alert in chat and another, more severe, to call the person on duty), so we will specify “sigma” in the trigger condition.

As we don’t need hits for our monitor, we can specify size: 0, so hits will always be empty.

The example of overall result of our query looks like this:

{
"_shards": {
"total": 90,
"failed": 0,
"successful": 90,
"skipped": 0
},
"hits": {
"hits": [],
"total": {
"value": 87,
"relation": "eq"
},
"max_score": null
},
"took": 223,
"timed_out": false,
"aggregations": {
"hits": {
"buckets": [
{
"key_as_string": "2024-01-02T08:56:00.000Z",
"doc_count": 1,
"the_count": {
"value": 1
},
"key": 1704185760000
},

]
},
"last_hit": {
"buckets": []
},
"stats": {
"variance_population": 10.077288941736029,
"std_deviation_bounds": {
"upper_population": 20.52960617881583,
"upper": 20.52960617881583,
"lower": -17.56408893743652,
"lower_population": -17.56408893743652,
"upper_sampling": 20.69595735085029,
"lower_sampling": -17.73044010947098
},
"max": 21,
"count": 58,
"sum": 86,
"std_deviation_population": 3.174474593021029,
"sum_of_squares": 712,
"std_deviation_sampling": 3.2021997883601054,
"min": 0,
"avg": 1.4827586206896552,
"variance": 10.077288941736029,
"std_deviation": 3.174474593021029,
"variance_sampling": 10.254083484573503
}
}
}

The trigger condition code may look like this:

// threshold (sigma)
ctx.results[0].maxStd = 4;

// optional. we don’t want to use such complex conditions if there is not enough traffic
if (ctx.results[0].hits.total.value < 100) {
return false;
}

def aggregations = ctx.results[0].aggregations;
def extendedStats = aggregations.stats;

// determine average for our metric
def avg = 0;
if (extendedStats != null) {
avg = extendedStats.avg;
}

// determine current (last minute) value for our metric
def current = 0;
if (aggregations.last_hit.buckets.length > 0) {
current = aggregations.last_hit.buckets[0].doc_count;
}

// optional. we are not interested in breaching lower boundary
if (current < avg) {
return false;
}

// determine standard deviation
def std_deviation = 1;
if (extendedStats.std_deviation != null) {
std_deviation = extendedStats.std_deviation;
}

// check if the boundary has been reached.
ctx.results[0].curStd = (current - avg) / std_deviation;
return ctx.results[0].curStd > ctx.results[0].maxStd;

You can further read documentation on aggregations and theory on how to detect anomalies to create more flexible alerts for your specific business domain. For instance, you may decide to apply moving averages here or other approaches to smooth diagrams of your metric.

As always, such things can only be created through trials and errors in order to find the perfect balance between the number of false positive alerts and the observability of the whole system.

One More Unexpected Usage Of Monitors

There is one alternative way to use Open Distro Alerting for historical reports for your system.

The queries and aggregations we discussed earlier are useful not only for creating alerts but also for dashboards or retrieving data through API.

There is an Open Distro Reporting module that can be used for creating various reports depending on dashboards or queries, but there are a couple of problems:

  • You can export dashboard data only in PDF or PNG format, i.e., only visualisation without actual metrics.
  • You can export saved queries in CSV format, but it’s no use if you have a large amount of data.
  • Usually you have limited storage for your ELK, so there are only logs available for the last 2–4 weeks.

The solution that comes to mind is to implement some kind of cron to move aggregated metrics via API to some other storage.

Or you can use the alerting module for that purpose!

You have an extraction query, a certain schedule, and a very flexible destination configuration. So, the only thing you need to do is always trigger your alert and configure backend that accepts your metrics (we will use Google Apps Script for simplicity)

Proposed data flow

Example of the aggregation of error rate and latency metrics:

{
"size": 0,
"query": {
...
},
"version": true,
"track_total_hits": 2147483647,
"aggregations": {
"error_requests": {
"filter": {
"query_string": {
"query": "HTTP.STATUS: 500 OR HTTP.STATUS: 502 OR HTTP.STATUS: 503 OR HTTP.STATUS: 504"
}
}
},
"slow_requests": {
"filter": {
"bool": {
"filter": [
{
"range": {
"HTTP.REQUEST_TIME": {
"from": 1,
"to": null,
"include_lower": false,
"include_upper": true
}
}
},
{
"query_string": {
"query": "HTTP.STATUS: 200 OR HTTP.STATUS: 201 OR HTTP.STATUS: 499"
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
}
}
}
}

We query all requests to our service and, by using aggregations, separate them into two buckets: requests ending with a server error and slow requests that took more than 1 second to execute.

The trigger condition is simply:

return true;

And lastly, provide metrics in the payload of the alert webhook. For example, if you want to export your data into Google Sheets, you can configure simple app scripts in the web application to accept your webhook.

{
"requestsNum": "{{ctx.results.0.hits.total.value}}",
"requestsErrors": "{{ctx.results.0.aggregations.error_requests.doc_count}}",
"requestsSlow": "{{ctx.results.0.aggregations.slow_requests.doc_count}}"
}

In the end, with exported data and Google Sheets functions, you can achieve something like the following:

Diagram with historical daily SLO for latency and error rate

It’s a basic SLO graphic that shows the historical error rate and latency for a specific day.

Conclusion

In this article, we have briefly explored the Open Distro Alerting functionality. It’s not the best monitoring tool, but in combination with the basic ELK Search API, it provides rich opportunities to create flexible alerts for your system.

You can consider the most popular dashboards and alerts for your system:

  • Unexpected API responses or server errors
  • Metrics related to SLA. Such as error rate or latency of application.
    - Error rate: percent of 5xx from all of your requests
    - Latency: percent of slow requests (e.g., response time greater than 1 second) from all of your requests
  • Check for anomalies in traffic.
    - Number of successful responses
    - Number of 429 (excessive traffic)
    - Number of client errors (sudden spikes may be related to recent code deployment)
  • etc.

And this stack provides all the necessary tools to cover them.

Special thanks to Alexander Cherepanov and Anastasiia Mosiazh for their patience and help in reviewing my drafts.

--

--

Roman Ushakov

I am software developer and just like to do cool stuff.. or at least I try :)