This document is for debugging the new k8 special agent introduced with checkmk 2.1 Werk #13810
Installation of Checkmk Cluster Collectors (a.k.a Checkmk Kubernetes agent)
We strongly recommend using our helm charts for installing the Checkmk Cluster Collectors unless you are very experienced with Kubernetes and want to install the agent using the manifests in YAML format provided by us. Please keep in mind that we can not support you on installing the agent while using the manifests.
The Helm chart installs and configures all necessary components to run the agent and also exposes several useful configuration options which will help you to automatically set up complex resources. The prerequisites have to be fulfilled before you proceed with the installation.
Below is an example of deploying the helm charts using a LoadBalancer (requires ability of cluster to create a LoadBalancer):
$ helm repo add checkmk-chart https://checkmk.github.io/checkmk_kube_agent $ helm repo update $ helm upgrade --install --create-namespace -n checkmk-monitoring checkmk checkmk-chart/checkmk --set clusterCollector.service.type="LoadBalancer" Release "checkmk" does not exist. Installing it now. NAME: checkmk LAST DEPLOYED: Tue May 17 22:01:07 2022 NAMESPACE: checkmk-monitoring STATUS: deployed REVISION: 1 TEST SUITE: None NOTES:You can access the checkmk `cluster-collector` via: LoadBalancer: ========== NOTE: It may take a few minutes for the LoadBalancer IP to be available. You can watch the status of by running 'kubectl get --namespace checkmk-monitoring svc -w checkmk-cluster-collector' export SERVICE_IP=$(kubectl get svc --namespace checkmk-monitoring checkmk-cluster-collector --template "{{ range (index .status.loadBalancer.ingress 0) }}{{.}}{{ end }}"); echo http://$SERVICE_IP:8080 # Cluster-internal DNS of `cluster-collector`: checkmk-cluster-collector.checkmk-monitoring ========================================================================================= With the token of the service account named `checkmk-checkmk` in the namespace `checkmk-monitoring` you can now issue queries against the `cluster-collector`. Run the following to fetch its token and the ca-certificate of the cluster: export TOKEN=$(kubectl get secret checkmk-checkmk -n checkmk-monitoring -o=jsonpath='{.data.token}' | base64 --decode); export CA_CRT="$(kubectl get secret checkmk-checkmk -n checkmk-monitoring -o=jsonpath='{.data.ca\.crt}' | base64 --decode)"; # Note: Quote the variable when echo'ing to preserve proper line breaks: `echo "$CA_CRT"` To test access you can run: curl -H "Authorization: Bearer $TOKEN" http://$SERVICE_IP:8080/metadata | jq
As an example, you can further set some configuration options on the command line to the above helm command (these are some examples but depending on your requirement you can specify multiple or seperate values) :
Flags | Description |
---|---|
--set clusterCollector.service.type="LoadBalancer" | This sets the cluster collector service type to LoadBalancer. |
--set clusterCollector.service.type="NodePort" --set clusterCollector.service.nodePort=30035 | Here, you can specify again a different service type and also the port. |
--version 1.0.0-beta.2 | specify a version constraint for the chart version to use |
We recommend however to use this values.yaml to configure your Helm chart.
For this, you need to then run the command:
$ helm upgrade --install --create-namespace -n checkmk-monitoring myrelease checkmk-chart/checkmk -f values.yaml
After the chart has been successfully deployed, you will be presented with a set of commands with which you can access the cluster-collector from the command line. In case you want to see those commands, you can just do:
helm status checkmk -n checkmk-monitoring
At the same time you can also verify if all the important resources in the namespace has been deployed successfully. The below command in the scrrenshot lists some important resources:
Exposing the Checkmk Cluster Collector
By default, the API of Checkmk Cluster Collector is not exposed to the outside (not to be mistaken with Kubernetes API itself). This is required for gather usage metrics though and enrich your monitoring.
Checkmk pulls data from this API, which can be exposed via the service checkmk-cluster-collector. To do so, you have to run it with one of the following flags or set these flags in a values.yaml.
Flags | Description |
---|---|
--set clusterCollector.service.type="LoadBalancer" | This sets the cluster collector service type to LoadBalancer. |
--set clusterCollector.service.type="NodePort" --set clusterCollector.service.nodePort=30035 | Here, you can specify again a different service type and also the port. |
Debugging K8s special agent
- The first step would be to find the complete command of the Kubernetes special agent.
The command can be found under "Type of agent >> Program" . It will consists of multiple parameters depending on how the datasource program rule has been configured.
OMD[mysite]:~$ cmk -D k8s | more k8s Addresses: No IP Tags: [address_family:no-ip], [agent:special-agents], [criticality:prod], [networking:lan], [piggyback:auto-piggyback], [site:a21], [snmp_ds:no-snmp], [tcp:tcp] Labels: [cmk/kubernetes/cluster:at], [cmk/kubernetes/object:cluster], [cmk/site:k8s] Host groups: check_mk Contact groups: all Agent mode: No Checkmk agent, all configured special agents Type of agent: Program: /omd/sites/mysite/share/check_mk/agents/special/agent_kube '--cluster' 'k8s' '--token' 'xyz' '--monitored-objects' 'deployments' 'daemonsets' 'statefulsets' 'nodes' 'pods' '--api-server-endpoint' 'https://<YOUR-IP>:6443' '--api-server-proxy' 'FROM_ENVIRONMENT' '--cluster-collector-endpoint' 'https://<YOUR-ENDPOINT>:30035' '--cluster-collector-proxy' 'FROM_ENVIRONMENT' Process piggyback data from /omd/sites/mysite/tmp/check_mk/piggyback/k8s Services: ...
An easier way would be this command: /bin/sh -c "$(cmk -D k8s | grep -A1 "^Type of agent:" | grep "Program:" | cut -f2- -d':')"
Please note that if a line matching "^Type of agent:" followed by a line matching "^ Program:" exists more than once, then the output might be messed up.
The special agent has the below options available for debugging purpose:
OMD[mysite]:~$ /omd/sites/mysite/share/check_mk/agents/special/agent_kube -h ... --debug Debug mode: raise Python exceptions -v / --verbose Verbose mode (for even more output use -vvv) --vcrtrace FILENAME Enables VCR tracing for the API calls ...
Now, you can modiy the above command of the Kubernetes special agent like this:
OMD[mysite]:~$ /omd/sites/mysite/share/check_mk/agents/special/agent_kube \ '--cluster' 'at' \ '--token' 'xyz' \ '--monitored-objects' 'deployments' 'daemonsets' 'statefulsets' 'nodes' 'pods' \ '--api-server-endpoint' 'https://<YOUR-IP>:6443' \ '--api-server-proxy' 'FROM_ENVIRONMENT' \ '--cluster-collector-endpoint' 'https://<YOUR-ENDPOINT>:30035' \ '--cluster-collector-proxy' 'FROM_ENVIRONMENT' \ --debug -vvv --vcrtrace ~/tmp/vcrtrace.txt > ~/tmp/k8s_with_debug.txt 2>&1
Here, you can also reduce the number of '--monitored-objects' to few resources to get less output.
Run the special agent with no debug options to create an agent output or you could download from the cluster host via the Checkmk web interface.
/omd/sites/mysite/share/check_mk/agents/special/agent_kube '--cluster' 'at' '--token' 'xyz' '--monitored -objects' 'deployments' 'daemonsets' 'statefulsets' 'nodes' 'pods' '--api-server-endpoint' 'https://<YOUR-IP>:6443' '--api-server-proxy' 'FROM_ENVIRONMENT' '--cluster-collector-endpoint' 'https://<YOUR-ENDPOINT>:30035' '--cluster-collector-proxy' 'FROM_ENVIRONMENT' > ~/tmp/k8s_agent_output.txt 2>&1
Please upload the following files to the support ticket.
~/tmp/vcrtrace.txt | Tracefile |
~/tmp/k8s_with_debug.txt | Debug output |
~/tmp/k8s_agent_output.txt | Agent output |
Common errors
- Context: the Kubernetes special agent is slightly unconventional relative to other Special agents as it handles up to three different datasources (the API, the cluster collector container metrics, the cluster collector node metrics)
- the connection to the Kubernetes API server is mandatory while the connection to the others are optional (and decided through the configured Datasource rule)
- Failure to connect to the Kubernetes API server will be shown by the Checkmk service (as usual) → the agent crashes
- Failure to connect to the cluster collector will be highlighted in the Cluster Collector service → the error is not raised by the agent in production
- the error is only raised when executing the agent with the --debug flag
- the connection to the Kubernetes API server is mandatory while the connection to the others are optional (and decided through the configured Datasource rule)
- Version: We only support the latest three Kubernetes version (https://kubernetes.io/releases/#:~:text=The%20Kubernetes%20project%20maintains%20release,9%20months%20of%20patch%20support.)
- if a customer has the latest release and the release itself is quite new (less than 1 month) then ask one of the devs if we already have support for it
- Kubernetes API connection error: If the agent fails to make a connection to the Kubernetes API (e.g. 401 Unauthorized to query api/v1/core/pods) then the output based on the --debug flag should be sufficient
- common causes:
- serviceaccount was not properly configured in the Kubernetes cluster
- wrong token configured
- Forgot to upload the ca.crt in the Global settings >> Trusted certificate authorities for SSL but --verify-cert-api is enabled.
- Wrong IP or Port
- Proxy not configured in the datasource rule
- Checkmk Cluster Collector connection error:
- Common causes:
- The cluster collector is not exposed via either NodePort or Ingress.
- The important resources like pods,deployments.daemon-sests,replicas etc. are not running or frequently restarting.
- The cluster collector IP blocked by firewall or a security group.
- Port/IP incorrect.
- Forgot to upload the ca.crt in the Global settings >> Trusted certificate authorities for SSL but --verify-cert-api is enabled.
- Proxy not configured in the datasource rule
- Common causes:
- API processing error: If the agent reports a bug similar to "value ... was not set" then the user should be asked for the vcrtrace file.