Skip to end of metadata
Go to start of metadata

Problem

You want to evaluate the performance of Livestatus and understand if there are problems with the performance.

Solution

Debugging via Livestatus

If you receive for the OMD <SITENAME> Performance Service "CRIT - Site currently not running" or Livestatus is always dead, please go through this manual to find the bottleneck:

Please execute this command as site user, to see how long Livestatus needs to respond. This command should give immediately a response. If not, Livestatus is busy.

time lq "GET status\nColumns: program_start"

The OMD <SITENAME> Performance Service ist using this command:

time echo -e "GET status" | waitmax 5 "/omd/sites/<SITENAME>/bin/unixcat" "/omd/sites/<SITENAME>/tmp/run/live"

Now we need to find why Livestatus is busy: 

This command will show you some statistics (run it in /omd/sites/<SITENAME>/var/log):

grep "processed request in" cmc.log | cut -d" " -f9 | sed 's,$, 100 / 100 * p,' | dc | sort -n | uniq -c

With this  command you will receive the processed time for every request.

cat cmc.log |grep -Po "client.*processed request in [\d]* ms"  |sort -n |uniq -c

Now you can search for the client with the long-running command. In my case ist "client 22"

cat cmc.log |grep "client 22" |grep "request"

Copy this GET query and execute it with time. What kind of GET command does this is?

time lq "GET ........... "

Usually we see mostly commands like "GET log Colums ...".  

In this case you should check:

  • Your views and dashboards
    • Maybe there is a view causing this long-running command
  • Do you have a script accessing data via Livestatus?


Debugging long running GET log commands

The "GET log" query is to fetch all kinds of log data for the Checkmk Views e.g. Events of last 4 hours.

Depending on how many such views you have and how big your history is, this can take long and break Livestatus:

The restriction in log parsing is of course the storage of the Checkmk server!


How big is the history and archive of Checkmk? Do you really need all data?

In these files we save the state changes of the host and services
OMD[site]:~/var/check_mk/core$ du -sh history archive/
688K	history
113M	archive/

One quick and dirty solution could be to remove old history files to speed up things

Settings to improve the log parsing

In Setup → Global setting we have several settings to improve the log parsing




Maximum concurrent Livestatus connections Usually the default value should be fine. If you have a larger number of users, views or distributed monitoring you can increase this value step-by step (50 - 100 - 150)

Maximum number of cached log messages

In order to speed up queries for historic data, the core keeps an in-memory cache of log file message. This number can be configured here. A larger number needs more RAM. Note: even if you set this to 0 there might be some cases where messages need to be cached anyway.

You can set this to one million, if you have enough memory

History log rotation: Rotate by size (Limit of the size)A rotation of the log file will be forced whenever its size exceed that limit. In a big environment you can increase this value to e.g. 200mb. Checkmk will now need to parse the same amount of data through fewer files.
Maximum number of parsed lines per logfile

In order to avoid large timeouts in case of oversized history logfiles the core limits the number of lines read from history logfiles. The limit is on a per-file-base and can be configured here. Exceeding lines are simply ignored, and an error is logged in the CMC daemon logfile


E.g:

2021-12-03 09:30:30 [3] [client 1] more than 500000 lines in "/omd/sites/cmk/var/check_mk/core/history", ignoring the rest! 
2021-12-03 09:30:37 [3] [client 1] more than 500000 lines in "/omd/sites/cmk/var/check_mk/core/history", ignoring the rest! 
2021-12-03 09:30:42 [3] [client 2] more than 500000 lines in "/omd/sites/cmk/var/check_mk/core/history", ignoring the rest! 
2021-12-03 09:30:42 [3] [client 1] more than 500000 lines in "/omd/sites/cmk/var/check_mk/core/history", ignoring the rest! 


Settings in a distributed setup

In a distributed setting you're using the Livestatus Proxy Daemon. This can be tuned for the central and all remote sites here:

Setup → General → Global settings → Livestatus Proxy → Livestatus Proxy default connection parameters

Specific settings for each remote site can be done here: Setup → Distributed monitoring →  → Livestatus Proxy → Livestatus Proxy default connection parameters


Please change the "Number of channels to keep open" from 5 to a required number from  the central site log: ~/var/log/liveproxyd.state. Here you will see the state of all remote sites and all pending or waiting connections as well.