Skip to end of metadata
Go to start of metadata

This document is for debugging the new k8 special agent introduced with checkmk 2.1 Werk #13810 

Installation of Checkmk Cluster Collectors (a.k.a Checkmk Kubernetes agent)

We strongly recommend using our helm charts  for installing the Checkmk Cluster Collectors unless you are very experienced with Kubernetes and want to install the agent using the manifests in YAML format provided by us. Please keep in mind that we can not support you on installing the agent while using the manifests. 

The Helm chart installs and configures all necessary components to run the agent and also exposes several useful configuration options which will help you to  automatically set up complex resources.  The prerequisites  have to be fulfilled before you proceed with the installation.

Below is an example of deploying the helm charts: 

$ helm repo add tribe29 https://tribe29.github.io/checkmk_kube_agent
$ helm repo update
$ helm upgrade --install --create-namespace -n checkmk-monitoring checkmk tribe29/checkmk
Release "checkmk" does not exist. Installing it now.
NAME: checkmk
LAST DEPLOYED: Tue May 17 22:01:07 2022
NAMESPACE: checkmk-monitoring
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
You can access the checkmk `cluster-collector` via:
ClusterIP:
export POD_NAME=$(kubectl get pods --namespace checkmk-monitoring -l "app.kubernetes.io/name=checkmk,app.kubernetes.io/instance=checkmk" -o jsonpath="{.items[0].metadata.name}")
export CONTAINER_PORT=$(kubectl get pod --namespace checkmk-monitoring $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
echo "Use http://127.0.0.1:8080 for access."
kubectl --namespace checkmk-monitoring port-forward $POD_NAME 8080:$CONTAINER_PORT
# Cluster-internal DNS of `cluster-collector`: checkmk-cluster-collector.checkmk-monitoring
With the token of the service account named `checkmk-checkmk` in the namespace `checkmk-monitoring` you can now issue queries against the `cluster-collector`.
Run the following to fetch its token, resp. ca-certificate:
export TOKEN=$(kubectl get secret $(kubectl get serviceaccount checkmk-checkmk -o=jsonpath='{.secrets[*].name}' -n checkmk-monitoring) -n checkmk-monitoring -o=jsonpath='{.data.token}' | base64 --decode)
export CA_CRT="$(kubectl get secret $(kubectl get serviceaccount checkmk-checkmk -o=jsonpath='{.secrets[*].name}' -n checkmk-monitoring) -n checkmk-monitoring -o=jsonpath='{.data.ca\.crt}' | base64 --decode)"
# Note: Quote the variable when echo'ing to preserve proper line breaks: `echo "$CA_CRT"`
To test access then run:
kubectl -n checkmk-monitoring run -it test-access --rm --image wbitt/network-multitool --restart=Never -- env TOKEN=$TOKEN sh -c 'curl -H "Authorization: Bearer $TOKEN" checkmk-cluster-collector.checkmk-monitoring.svc:8080/metadata | jq'

As an example, you can further set some configuration options on the command line to the above helm command (these are some examples but depending on your requirement you can specify multiple or seperate values) :

FlagsDescription

--set clusterCollector.service.type="LoadBalancer"

This sets the cluster collector service type to LoadBalancer.
--set clusterCollector.service.type="NodePort" --set clusterCollector.service.nodePort=30035Here, you can specify again a different service type and also the port.
--version 1.0.0-beta.2specify a version constraint for the chart version to use

We recommend however to use this values.yaml to configure your Helm chart.

For this, you need to then run the command:

$ helm upgrade --install --create-namespace -n checkmk-monitoring checkmk tribe29/checkmk -f values.yaml

After the chart has been successfully deployed, you will be presented with a set of commands with which you can access the cluster-collector from the command line.   In case you want to see those commands, you can just do:

helm status checkmk -n checkmk-monitoring

At the same time you can also verify if all the important resources in the namespace has been deployed successfully.  The below command in the scrrenshot lists some important resources:

Exposing the Checkmk Cluster Collector

By default, the API of Checkmk Cluster Collector is not exposed to the outside (not to be mistaken with Kubernetes API itself). This is required for gather usage metrics though and enrich your monitoring.
Checkmk pulls data from this API, which can be exposed via the service checkmk-cluster-collector. To do so, you have to run it with one of the following flags or set these flags in a values.yaml.

FlagsDescription

--set clusterCollector.service.type="LoadBalancer"

This sets the cluster collector service type to LoadBalancer.
--set clusterCollector.service.type="NodePort" --set clusterCollector.service.nodePort=30035Here, you can specify again a different service type and also the port.

Debugging K8s special agent

  1. The first step would be to find the complete command of the Kubernetes special agent. 
    • The command can be found under "Type of agent >> Program" . It will consists of multiple parameters depending on how the datasource program rule has been configured. 

      OMD[mysite]:~$ cmk -D k8s | more

      k8s
      Addresses: No IP
      Tags: [address_family:no-ip], [agent:special-agents], [criticality:prod], [networking:lan],
      [piggyback:auto-piggyback], [site:a21], [snmp_ds:no-snmp], [tcp:tcp]
      Labels: [cmk/kubernetes/cluster:at], [cmk/kubernetes/object:cluster], [cmk/site:k8s]
      Host groups: check_mk
      Contact groups: all
      Agent mode: No Checkmk agent, all configured special agents
      Type of agent:
      Program: /omd/sites/mysite/share/check_mk/agents/special/agent_kube '--cluster' 'k8s' '--token' 'xyz' '--monitored-objects' 'deployments' 'daemonsets' 'statefulsets' 'nodes' 'pods' '--api-server-endpoint' 'https://<YOUR-IP>:6443' '--api-server-proxy' 'FROM_ENVIRONMENT' '--cluster-collector-endpoint' 'https://<YOUR-ENDPOINT>:30035' '--cluster-collector-proxy' 'FROM_ENVIRONMENT'
      Process piggyback data from /omd/sites/mysite/tmp/check_mk/piggyback/k8s
      Services:
      .....

      An easier way would be this command: /bin/sh -c "$(cmk -D k8s | grep -A1 "^Type of agent:" | grep "Program:" | cut -f2- -d':')"

      Please note that if a line matching "^Type of agent:" followed by a line matching "^  Program:" exists more than once, then the output might be messed up.

    • The special agent has the below options available for debugging purpose:

      OMD[mysite]:~$ /omd/sites/mysite/share/check_mk/agents/special/agent_kube -h
      ...
      --debug                     Debug mode: raise Python exceptions
      -v / --verbose 				Verbose mode (for even more output use -vvv)
      --vcrtrace FILENAME         Enables VCR tracing for the API calls
      ...
    • Now, you can modiy the above  command of the Kubernetes special agent like this:

      OMD[mysite]:~$ /omd/sites/mysite/share/check_mk/agents/special/agent_kube '--cluster' 'at' '--token' 'xyz' '--monitored-objects' 'deployments' 'daemonsets' 'statefulsets' 'nodes' 'pods' '--api-server-endpoint' 'https://<YOUR-IP>:6443' '--api-server-proxy' 'FROM_ENVIRONMENT' '--cluster-collector-endpoint' 'https://<YOUR-ENDPOINT>:30035' '--cluster-collector-proxy' 'FROM_ENVIRONMENT' --debug -vvv --vcrtrace ~/tmp/vcrtrace.txt > ~/tmp/k8s_with_debug.txt 2>&1

      Here, you can also reduce the number of '--monitored-objects' to few resources to get less output. 

    • Run the special agent with no debug options  to create an agent output or you could download from the cluster host via the Checkmk web interface. 

      /omd/sites/mysite/share/check_mk/agents/special/agent_kube '--cluster' 'at' '--token' 'xyz' '--monitored
      -objects' 'deployments' 'daemonsets' 'statefulsets' 'nodes' 'pods' '--api-server-endpoint' 'https://<YOUR-IP>:6443' '--api-server-proxy' 'FROM_ENVIRONMENT' '--cluster-collector-endpoint' 'https://<YOUR-ENDPOINT>:30035' '--cluster-collector-proxy' 'FROM_ENVIRONMENT' > ~/tmp/k8s_agent_output.txt 2>&1
  2. Please upload the following files to the support ticket.

~/tmp/vcrtrace.txt Tracefile
~/tmp/k8s_with_debug.txt Debug output
~/tmp/k8s_agent_output.txt Agent output

Common errors

  • Context: the Kubernetes special agent is slightly unconventional relative to other Special agents as it handles up to three different datasources (the API, the cluster collector container metrics, the cluster collector node metrics)
    • the connection to the Kubernetes API server is mandatory while the connection to the others are optional (and decided through the configured Datasource rule)
      • Failure to connect to the Kubernetes API server will be shown by the Checkmk service (as usual) → the agent crashes
      • Failure to connect to the cluster collector will be highlighted in the Cluster Collector service → the error is not raised by the agent in production
        • the error is only raised when executing the agent with the --debug flag
  • Version: We only support the latest three Kubernetes version (https://kubernetes.io/releases/#:~:text=The%20Kubernetes%20project%20maintains%20release,9%20months%20of%20patch%20support.)
    • if a customer has the latest release and the release itself is quite new (less than 1 month) then ask one of the devs if we already have support for it
  • Kubernetes API connection error: If the agent fails to make a connection to the Kubernetes API (e.g. 401 Unauthorized to query api/v1/core/pods) then the output based on the --debug flag should be sufficient
    • common causes:
      • serviceaccount was not properly configured in the Kubernetes cluster 
      • wrong token configured
      • Forgot to upload the ca.crt in the Global settings >> Trusted certificate authorities for SSL  but --verify-cert-api is enabled.
      • Wrong IP or Port
      • Proxy not configured in the datasource rule
  • Checkmk Cluster  Collector connection error:
    • Common causes:
      • The cluster collector is not exposed via either NodePort or Ingress.
      • The important resources like pods,deployments.daemon-sests,replicas etc. are not running or frequently restarting.
      • The cluster collector IP blocked by firewall or a security group. 
      • Port/IP incorrect.
      • Forgot to upload the ca.crt in the Global settings >> Trusted certificate authorities for SSL  but --verify-cert-api is enabled.
      • Proxy not configured in the datasource rule
  • API processing error: If the agent reports a bug similar to "value ... was not set" then the user should be asked for the vcrtrace file.