The blog for Design Patterns, Linux, HA and Myself!

InfluxDB Out of Memory Solution

InfluxDB Out of Memory: Ways to Debug and solve the influxDB Out of Memory(OOM) issue

The solution to the issue, InfluxDB Out of Memory, can be solved in multiple folds sequentially. This article presents the solutions that I’ve tried to resolve InfluxDB out of Memory issue.

How to confirm if the InfluxDB is getting killed because of OOM? You can execute dmesg -T | grep influx to find that out:

$ dmesg -T | grep influx
[Mon Feb 14 00:19:48 2020] [***] ***** ****** ******** ********   *****                   * influxd
[Mon Feb 14 00:19:48 2020] Out of memory: Kill process 15453 (influxd) score 720 or sacrifice child

The very first occurrence of this issue was due of the default InfluxDB configuration for index-version.

  # The type of shard index to use for new shards.  The default is an in-memory index that is
  # recreated at startup.  A value of "tsi1" will use a disk based index that supports higher
  # cardinality datasets.
  index-version = "inmem"

The default value is inmem for index-version and it makes the InfluxDB to store the data in the memory. In case, you’re getting the Out of Memory error, and the value of this config is inmem then, may be, the issue can be resolved if you change the value of this configuration from inmem to tsi1.

If you’ve already changed the index-version to tsi1 and still facing the issue then it’s most probably due to the data and the high cardinality of series and the tag values that is present inside the influxDB server.

In this article, I’ll present the methods that I’ve used to get the information about the data that is causing the Out of Memory for the InfluxDB process.

The first step is to find the database where the problem lies, and then to find the measurement, and the tags present inside the found database. I used influx_inspect utility get the report of each of the databases that are using the tsi1 as their index version. This utility is present at the same location as the influxd binary, and the location where the InfluxDB stores the data is present in the configuration file:

  # The directory where the TSM storage engine stores TSM files.
  dir = "/var/lib/influxdb/data"

Navigate to the data directory:

$ cd /var/lib/influxdb/data
$ ls -ltr 
total 16
drwxrwxrwx 1 x x 1738 Feb 14 18:19 location_data/
drwxrwxrwx 1 x x 4096 Feb 14 18:19 usage_metric/
drwxrwxrwx 1 x x 4096 Feb 14 18:19 pg_metric/

Then you can execute the command influx_inspect with the sub command reporttsi to generate a report of the tsi index.

$ influx_inspect reporttsi -db-path location_data

Link to the Influx Inspect disk utility.

The report that it generates has two parts, one for all the measurements inside the database and another for all the shards in the database.

The numbers that you’re seeing here is dummy, and I’ve just created them for this page.

This is the first part:

Database Path: /var/lib/influxdb/data/location_data/
Cardinality (exact): 437993

Measurement	Cardinality (exact)

"india"						218996
"srilanka"			        54749
"thailand"				    27374
"nepal"				        24333
"bhutan"					3041

So, we find here that measurement, india, has the highest, 50%, contribution to the high cardinality. This way it becomes the first suspect.

In the second part of the report, you can find the growth in the cardinality on shard basis:

This is the first shard from the report:

Shard ID: 1537
Path: /var/lib/influxdb/data/location_data/1537
Cardinality (exact): 1080

Measurement	Cardinality (exact)

"india"						579
"srilanka"			        233
"thailand"				    135
"nepal"				        124
"bhutan"					9

the next one:

Shard ID: 1610
Path: /var/lib/influxdb/data/location_data/1610
Cardinality (exact): 2159

Measurement	Cardinality (exact)

"india"						1181
"srilanka"			        312
"thailand"				    253
"nepal"				        247
"bhutan"					166

You’ll find here that the cardinality has increased for all the measurements but for india it has doubled. Now, let’s look into the actual tag values for the measurement, india, that is contributing to the high series cardinality.

We’ll use the influx binary/client to work get the numbers. The first query being executed is to get all the tag keys from the measurement.

$ influx -precision=rfc3339 -database=location_data -execute="SHOW tag keys from india"
name: india

We have 3 tag keys here: city, state and zip.

Now, I’ll execute a query to find the unique values for each of the tags. For readability, I’ll save the output in a file.

$ influx -precision=rfc3339 -database=location_data -execute="SHOW tag values from india with key = \"city\"" > cities.txt
$ influx -precision=rfc3339 -database=location_data -execute="SHOW tag values from india with key = \"state\"" > states.txt
$ influx -precision=rfc3339 -database=location_data -execute="SHOW tag values from india with key = \"zip\"" > zips.txt 

Now, let’s look into the count of the unique tag values for each of the tags:

$ cat cities.txt | wc -l
$ cat states.txt  | wc -l
$ cat zips.txt | wc -l

The output that you’ll get here will be different from this, but you will, definitely, find at least one tag that contains more data than expected.

For this sample data, it’s the city tag because it is containing not only the name of the city but also, the name of the town, like,

$ tail -n 5 cities.txt
Delhi, New Delhi
Andheri, Mumbai
Goregaon, Mumbai
Jogeshwari, Mumbai
Juhu, Mumbai

Now that you’ve found the issue, next item that we’ve to pick up is the approach to solve this issue. In my case, the solution was very easy as I had to just delete these series. In an another scenario, I had to downsample the data that is more than 15 days old, that way I was only keeping the HiFi data for a very recent period(last 15 days) and for the older period, every 1 hour’s data was converted into 1 InfluxDB point.

Here’s link to the InfluxDB downsampling guide: Downsample and retain data | InfluxDB

Loading Comments... Disqus Loader
comments powered by Disqus