You want to evaluate the performance of Livestatus and understand if there are problems with the performance.
Debugging via Livestatus
If you receive for the OMD <SITENAME> Performance Service "CRIT - Site currently not running" or Livestatus is always dead, please go through this manual to find the bottleneck:
Please execute this command as site user, to see how long Livestatus needs to respond. This command should give immediately a response. If not, Livestatus is busy.
The OMD <SITENAME> Performance Service ist using this command:
Now we need to find why Livestatus is busy:
This command will show you some statistics (run it in /omd/sites/<SITENAME>/var/log):
With this command you will receive the processed time for every request.
Now you can search for the client with the long-running command. In my case ist "client 22"
Copy this GET query and execute it with time. What kind of GET command does this is?
Usually we see mostly commands like "GET log Colums ...".
In this case you should check:
- Your views and dashboards
- Maybe there is a view causing this long-running command
- Do you have a script accessing data via Livestatus?
Debugging long running GET log commands
The "GET log" query is to fetch all kinds of log data for the Checkmk Views e.g. Events of last 4 hours.
Depending on how many such views you have and how big your history is, this can take long and break Livestatus:
The restriction in log parsing is of course the storage of the Checkmk server!
How big is the history and archive of Checkmk? Do you really need all data?
One quick and dirty solution could be to remove old history files to speed up things
Settings to improve the log parsing
In Setup → Global setting we have several settings to improve the log parsing
|Maximum concurrent Livestatus connections||Usually the default value should be fine. If you have a larger number of users, views or distributed monitoring you can increase this value step-by step (50 - 100 - 150)|
Maximum number of cached log messages
In order to speed up queries for historic data, the core keeps an in-memory cache of log file message. This number can be configured here. A larger number needs more RAM. Note: even if you set this to 0 there might be some cases where messages need to be cached anyway.
You can set this to one million, if you have enough memory
|History log rotation: Rotate by size (Limit of the size)||A rotation of the log file will be forced whenever its size exceed that limit. In a big environment you can increase this value to e.g. 200mb. Checkmk will now need to parse the same amount of data through fewer files.|
|Maximum number of parsed lines per logfile|
In order to avoid large timeouts in case of oversized history logfiles the core limits the number of lines read from history logfiles. The limit is on a per-file-base and can be configured here. Exceeding lines are simply ignored, and an error is logged in the CMC daemon logfile
Settings in a distributed setup
In a distributed setting you're using the Livestatus Proxy Daemon. This can be tuned for the central and all remote sites here:
Setup → General → Global settings → Livestatus Proxy → Livestatus Proxy default connection parameters
Specific settings for each remote site can be done here: Setup → Distributed monitoring →→ Livestatus Proxy → Livestatus Proxy default connection parameters
Please change the "Number of channels to keep open" from 5 to a required number from the central site log: ~/var/log/liveproxyd.state. Here you will see the state of all remote sites and all pending or waiting connections as well.