In this how-to, we will explain to you how to debug Nagvis performance issues
Basics
- Check the OMD <SITENAME> performance graphs of the affected
- Important are the following graphs
- Livestatus Connects and Requests - localhost - OMD nagnis_master performance
- Livestatus Requests per Connection - localhost - OMD nagnis_master performance
- Livestatus usage - localhost - OMD nagnis_master performance
- Check_MK helper usage - localhost - OMD nagnis_master performance
- Do you see peaks in these graphs? If yes, please check the liveproxyd.log inside the site user context
- Please check the Livestatus Proxy settings
- "Maximum concurrent Livestatus connections": inside the global and site specific global settings
"Livestatus Proxy default connection parameters" : inside the global and site specific global settings
We recommend using the default levels. In some cases, it makes sense to configure higher values. Please ask the support for some guidance.
- Cleanup your map:
- Do you have objects in your map which are no longer available in Checkmk?
- Do you have a map with nested maps? Please check if you have objects there which are no longer available in Checkmk
- How often is your Nagvis map refreshing? You can modify this value
- Important are the following graphs
If the map takes a lot of time to open, you might need to debug further. In this case, we recommend checking the Livestatus queries while reloading the map
Network analyze
To see how long the map really needs, we recommend using the network analyzer of your internet browser: Checkmk profiling#NetworkAnalyzewiththeinternetbrowser
Debugging with Livestatus
Enable the debug log
How to collect troubleshooting data for various issue types#LivestatusProxy
Debug with the lq queries
The best way to debug with the lq queries is:
- tail -f ~/var/log/liveproxyd.log >/path/to/file.txt
- reload the nagvis map
- analyze the file
detect long running lq query
Do you see any:
- bigger lq qeury
- a log query
- a periodical message
You can try to execute this query via the network and see how long it takes:
Livestatus queries#Livestatusqueriesovernetwork
One example
Infrastructure
OS: Ubuntu 20.4
Version: Checkmk 1.6.0p24
Sites: 1 Master and one Slave
The map
This is a dynamic map with my slave as a backend. I created and access the map via the master site.
The Debugging
this approach is only if you're running a distributed setup. So in this case, you can run that command on the central site.
If you're running a single site, please increase livestatus logging (How to collect troubleshooting data for various issue types#Core) to debug and check the cmc.log
OMD[cme1]:~$ tail -f var/log/liveproxyd.log >/tmp/lq_nagvis.txt OMD[cme1]:~$ cat /tmp/lq_nagvis.txt |grep "GET downtimes" |more 2021-07-21 13:53:04,645 [10] [cmk.liveproxyd.(1108792).Site(cmes).Client(13)] Send request 'GET downtimes\nColumns: author comment start_time end_time\nFilter: host_name = random_095543 1380\nOutputFormat: json\nKeepAlive: on\nResponseHeader: fixed16\n\n' 2021-07-21 13:53:04,646 [10] [cmk.liveproxyd.(1108792).Site(cmes).Thread(Thread-2).Channel(7)] Send: 'GET downtimes\nColumns: author comment start_time end_time\nFilter: host_name = ran dom_0955431380\nOutputFormat: json\nKeepAlive: on\nResponseHeader: fixed16\n\n' 2021-07-21 13:53:04,696 [10] [cmk.liveproxyd.(1108792).Site(cmes).Client(13)] Send request 'GET downtimes\nColumns: author comment start_time end_time\nFilter: host_name = random_095543 1380\nOutputFormat: json\nKeepAlive: on\nResponseHeader: fixed16\n\n' 2021-07-21 13:53:04,697 [10] [cmk.liveproxyd.(1108792).Site(cmes).Thread(Thread-2).Channel(7)] Send: 'GET downtimes\nColumns: author comment start_time end_time\nFilter: host_name = ran dom_0955431380\nOutputFormat: json\nKeepAlive: on\nResponseHeader: fixed16\n\n' 2021-07-21 13:53:04,747 [10] [cmk.liveproxyd.(1108792).Site(cmes).Client(13)] Send request 'GET downtimes\nColumns: author comment start_time end_time\nFilter: host_name = random_095543 1380\nOutputFormat: json\nKeepAlive: on\nResponseHeader: fixed16\n\n' 2021-07-21 13:53:04,748 [10] [cmk.liveproxyd.(1108792).Site(cmes).Thread(Thread-2).Channel(7)] Send: 'GET downtimes\nColumns: author comment start_time end_time\nFilter: host_name = ran dom_0955431380\nOutputFormat: json\nKeepAlive: on\nResponseHeader: fixed16\n\n' 2021-07-21 13:53:04,798 [10] [cmk.liveproxyd.(1108792).Site(cmes).Client(13)] Send request 'GET downtimes\nColumns: author comment start_time end_time\nFilter: host_name = random_095543 1380\nOutputFormat: json\nKeepAlive: on\nResponseHeader: fixed16\n\n' 2021-07-21 13:53:04,798 [10] [cmk.liveproxyd.(1108792).Site(cmes).Thread(Thread-2).Channel(7)] Send: 'GET downtimes\nColumns: author comment start_time end_time\nFilter: host_name = ran dom_0955431380\nOutputFormat: json\nKeepAlive: on\nResponseHeader: fixed16\n\n'
The whole logfile: lq_nagvis.txt
What I noticed in the logfile
- a big amount of lq "GET downtimes" commands, during the map reload
If I count the "GET downtimes" lines, there are 4836
OMD[cme1]:~$ cat /tmp/lq_nagvis.txt |grep "GET downtimes" |wc -l 4836
- All the other commands are looking small and good
Further debugging
From the log I noticed the big amount of "GET downtimes". Every time I reload the map, my master is sending thousands of commands via livestatus.
When I check my Checkmk site, I set several host downtimes. This could explain why my master is collecting all Downtimes before nagvis is showing the map.
The Workaround
- Remove all downtimes. The map will open really faster
- Access the map directly via the slave/local site
We fixed this behavior with Checkmk 2.0. The downtimes will not affect the reload time of the map
Related articles