The blog for Design Patterns, Linux, HA and Myself!
The solution to the issue, InfluxDB Out of Memory, can be solved in multiple folds sequentially. This article presents the solutions that I’ve tried to resolve InfluxDB out of Memory issue.
How to confirm if the InfluxDB is getting killed because of OOM? You can execute
dmesg -T | grep influx to find that
$ dmesg -T | grep influx [Mon Feb 14 00:19:48 2020] [***] ***** ****** ******** ******** ***** * influxd [Mon Feb 14 00:19:48 2020] Out of memory: Kill process 15453 (influxd) score 720 or sacrifice child
The very first occurrence of this issue was due of the default InfluxDB configuration for
# The type of shard index to use for new shards. The default is an in-memory index that is # recreated at startup. A value of "tsi1" will use a disk based index that supports higher # cardinality datasets. index-version = "inmem"
The default value is
index-version and it makes the InfluxDB to store the data in the memory. In case, you’re
getting the Out of Memory error, and the value of this config is
inmem then, may be, the issue can be resolved if you
change the value of this configuration from
If you’ve already changed the
tsi1 and still facing the issue then it’s most probably due to the
data and the high cardinality of series and the tag values that is present inside the influxDB server.
In this article, I’ll present the methods that I’ve used to get the information about the data that is causing the Out of Memory for the InfluxDB process.
The first step is to find the database where the problem lies, and then to find the measurement, and the tags present
inside the found database. I used
influx_inspect utility get the report of each of the databases that are using the
tsi1 as their index version. This utility is present at the same location as the
influxd binary, and the location
where the InfluxDB stores the data is present in the configuration file:
[data] # The directory where the TSM storage engine stores TSM files. dir = "/var/lib/influxdb/data"
Navigate to the data directory:
$ cd /var/lib/influxdb/data $ ls -ltr total 16 drwxrwxrwx 1 x x 1738 Feb 14 18:19 location_data/ drwxrwxrwx 1 x x 4096 Feb 14 18:19 usage_metric/ drwxrwxrwx 1 x x 4096 Feb 14 18:19 pg_metric/ ... ...
Then you can execute the command
influx_inspect with the sub command
reporttsi to generate a report of the tsi
$ influx_inspect reporttsi -db-path location_data
Link to the Influx Inspect disk utility.
The report that it generates has two parts, one for all the measurements inside the database and another for all the shards in the database.
The numbers that you’re seeing here is dummy, and I’ve just created them for this page.
This is the first part:
Summary Database Path: /var/lib/influxdb/data/location_data/ Cardinality (exact): 437993 Measurement Cardinality (exact) "india" 218996 "srilanka" 54749 "thailand" 27374 "nepal" 24333 "bhutan" 3041
So, we find here that measurement,
india, has the highest, 50%, contribution to the high cardinality. This way it becomes
the first suspect.
In the second part of the report, you can find the growth in the cardinality on shard basis:
This is the first shard from the report:
=============== Shard ID: 1537 Path: /var/lib/influxdb/data/location_data/1537 Cardinality (exact): 1080 Measurement Cardinality (exact) "india" 579 "srilanka" 233 "thailand" 135 "nepal" 124 "bhutan" 9 ===============
the next one:
=============== Shard ID: 1610 Path: /var/lib/influxdb/data/location_data/1610 Cardinality (exact): 2159 Measurement Cardinality (exact) "india" 1181 "srilanka" 312 "thailand" 253 "nepal" 247 "bhutan" 166 ===============
You’ll find here that the cardinality has increased for all the measurements but for
india it has doubled. Now, let’s
look into the actual tag values for the measurement,
india, that is contributing to the high series cardinality.
We’ll use the
influx binary/client to work get the numbers. The first query being executed is to get all the tag keys
from the measurement.
$ influx -precision=rfc3339 -database=location_data -execute="SHOW tag keys from india" name: india tagKey ------ city state zip
We have 3 tag keys here:
Now, I’ll execute a query to find the unique values for each of the tags. For readability, I’ll save the output in a file.
$ influx -precision=rfc3339 -database=location_data -execute="SHOW tag values from india with key = \"city\"" > cities.txt $ influx -precision=rfc3339 -database=location_data -execute="SHOW tag values from india with key = \"state\"" > states.txt $ influx -precision=rfc3339 -database=location_data -execute="SHOW tag values from india with key = \"zip\"" > zips.txt
Now, let’s look into the count of the unique tag values for each of the tags:
$ cat cities.txt | wc -l 614 $ cat states.txt | wc -l 29 $ cat zips.txt | wc -l 889
The output that you’ll get here will be different from this, but you will, definitely, find at least one tag that contains more data than expected.
For this sample data, it’s the
city tag because it is containing not only the name of the city but also, the name of the
$ tail -n 5 cities.txt Delhi, New Delhi Andheri, Mumbai Goregaon, Mumbai Jogeshwari, Mumbai Juhu, Mumbai
Now that you’ve found the issue, next item that we’ve to pick up is the approach to solve this issue. In my case, the solution was very easy as I had to just delete these series. In an another scenario, I had to downsample the data that is more than 15 days old, that way I was only keeping the HiFi data for a very recent period(last 15 days) and for the older period, every 1 hour’s data was converted into 1 InfluxDB point.
Here’s link to the InfluxDB downsampling guide: Downsample and retain data | InfluxDB