CheckMK Distributed Monitoring

Summary

The purpose of this guide is to show the strength and flexibility of CheckMK’s distributed monitoring. As you add hosts and services, the requirements can grow. It can be easy to get in the rut of adding more CPU and RAM until you have a monstrosity of a server that you cannot expand anymore.

Centrally monitoring all sites may not even work. The central CheckMK server may not have access to all of the remote devices.

Pre-requisites and Installation

To start off, we will need another CheckMK instance. If you do not already have one, check out the Introduction to CheckMK guide. Once you have CheckMK installed and a new unique site setup, the rest is trivial

Distributed monitoring also involves the slave listening on TCP/6557 so we need to open that up

[root@chckmk2 ~]# firewall-cmd --zone=public --add-port=6557/tcp --permanent
[root@chckmk2 ~]# firewall-cmd --reload

[root@chckmk2 ~]# su - second

OMD[second]:~$ omd config

We then enable distributed monitoring and enable livestatus

omd config - Distributed Monitoring
Select “Distributed Monitoring”
omd livestatus
omd livestatus

Enable livestatus which will listen on port 6557

Configure Connection

From the main site, in our series http://chckmk1.woohoosvcs.com/main navigate to Distributed Monitoring

Distributed Monitoring
Distributed Monitoring
New Connection
New Connection
Second connection options
Fill out the appropriate options.
Login to second site
Select the credentials that have access to login to the second site.
Login
Login again!
Save changes
Save changes
Activate
Activate!

Use Connection

Now that we have the connection, how do we actually use it? One of the easiest and likely ways is to have a folder configured just with that monitoring site

New Folder
Create a folder with a monitored site

Next I added a host and put it in this new folder. Here you can see CheckMK is smart enough to know only the “second” site needs to be reloaded due to the changes that only affect it.

Activate second site changes
Activate second site changes

Final Words

This article was mostly pictures but hopefully the point comes across. That point is how flexible and easy it is to setup distributed monitoring via CheckMK.

Upgrading CheckMK

Summary

In this article we will discuss the upgrade process using OMD. We will also go over the “werks” or changes and incompatibilities in the versions after upgrading.

If you have a fully functional environment such as one installed per Introduction to CheckMK – this should be fairly straight forward.

Prerequisites

CheckMK 1.5 RPM – We downloaded this in the previous article

[root@chckmk1 ~]# curl -O https://checkmk.com/support/1.5.0p23/check-mk-raw-1.5.0p23-el7-38.x86_64.rpm

[root@chckmk1 ~]# yum install check-mk-raw-1.5.0p23-el7-38.x86_64.rpm 

Wait, what just happened? Was that it? Did it just get upgraded? Yes and no. CheckMK 1.5 was installed but our instance is not upgraded to it.

[root@chckmk1 ~]# omd sites
SITE             VERSION          COMMENTS
main             1.4.0p38.cre      

[root@chckmk1 ~]# omd versions
1.4.0p38.cre
1.5.0p23.cre (default)

Each omd instance runs as the user so we will su to main to run commands for that instance

[root@chckmk1 ~]# su - main
OMD[main]:~$ omd update
Please completely stop 'main' before updating it.

# Yes that's a great idea!
OMD[main]:~$ omd stop
OMD[main]:~$ omd update

omd update
“omd update”
OMD[main]:~$ omd version
OMD - Open Monitoring Distribution Version 1.5.0p23.cre
OMD[main]:~$ omd start
Version 1.5.0p23 with “57”

Are we there yet?

We most definitely are. We are on 1.5.0p23 and that went fairly well. But what is the “57”.

Werks

From the CheckMK page a Werk is a bug or change that has an affect on the UI – https://checkmk.de/check_mk-werks.php

We can click on the 57 and there are 57 incompatible “werks” that we should be aware of. Many times these are non issues. Other times it is certain metrics that have gone away or configurations that have changed.

Werks / Release Notes
“Werks” AKA Release Notes

This is a huge help when upgrading. Instead of having to dig through text file release notes to see what major changes happened, we have werks! You can see any incompatible changes here and drilling into them will give you details on what to do. Once you have addressed it, you can acknowledge the werk.

Click on the “Show unacknowledged incompatible werks”

Incompatible Werk
Incompatible werk. Address and acknowledge

As we can see here, there is a clear description of what changed. Once we have addressed, you can acknowledge and the 57 will decrement to 56.

You also have the option on the release notes page to “acknowledge all” if you really do not care to work through the werks. for this lab I have gone through this before and I will just acknowledge all.

If you have a large deployment, most of your time will be spent going through the werks and addressing. With that said, going from 1.4 to 1.5 has been a breeze. 1.2.8 to 1.4 was a bit rougher with more incompatible werks that caused issues.

Rollback

The power of OMD lets rollbacks come fairly easy. OMD does not care if an “upgrade” is going forward or backwards. The only issues you may have is if you made a config change only compatible or implemented in 1.5. In that case, 1.4 may have issues with it. Otherwise the rollback is the same as the upgrade. 1) stop the site 2) omd update

Final Words

Now we have a fairly up to date CheckMK. Try doing the same to 1.6.0? For my production deploys I usually wait for a few revisions and 1.6.0 is still very early for my tastes.

Introduction to CheckMK

Summary

The purpose of this guide is to provide a high level overview of CheckMK. CheckMK is a great monitoring tool that has progressed greatly over the years. I have heavily depended on it for at least 3 years now.

Background

I came across CheckMK on a project that required a shot gun replacement of the current monitoring solution. A few solutions were vetted and Nagios was attempted. Unfortunately, the time to tweak and tune it was not compatible with the project timelines.

About CheckMK

CheckMK is an ecosystem that original was built around nagios. Many of the components of Nagios exist. Mathias Kettner is the founder of CheckMK.

There are quite a few editions but the scope of this is for the “Raw” edition. This is essentially the free unlimited tier.

Installation

Enough of the background, let’s get down to the technical installation

Requirements

For the purposes of this installation, we will be using a vanilla “minimal” install of CentOS 7.0. CentOS 7 is the latest supported version CheckMK supports. The VM will have 1 core, 1GB RAM and 8GB HDD. We will first be installing 1.4.0 so that the upgrade process can be shown.

curl -O https://checkmk.com/support/1.4.0p38/check-mk-raw-1.4.0p38-el7-85.x86_64.rpm

curl -O https://checkmk.com/support/1.5.0p23/check-mk-raw-1.5.0p23-el7-38.x86_64.rpm

# Always good to update first!
yum update

# Enable EPEL package repo
yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

# Then attempt to install
yum install check-mk-raw-1.4.0p38-el7-85.x86_64.rpm

Ok, great, ran through all of that. What’s next?

Configuration

Open Monitoring Distribution

CheckMK builds upon a framework called Open Monitoring Distribution (OMD). You may be asking, why the complexity? OMD makes upgrades quite useful and can allow you to run multiple versions of CheckMK on the same machine as OMD also allows multiple instances. The CheckMK rpms install this.

The first step after installing CheckMK will be to create an OMD site

# Here you can see there are no sites
[root@chckmk1 ~]# omd sites
SITE             VERSION          COMMENTS

# We then create a site
[root@chckmk1 ~]# omd create main
Adding /opt/omd/sites/main/tmp to /etc/fstab.
Creating temporary filesystem /omd/sites/main/tmp...OK
Restarting Apache...OK
Created new site main with version 1.4.0p38.cre.

  The site can be started with omd start main.
  The default web UI is available at http://chckmk1.woohoosvcs.com/main/

  The admin user for the web applications is cmkadmin with password: OkWZHNQr
  (It can be changed with 'htpasswd -m ~/etc/htpasswd cmkadmin' as site user.
)
  Please do a su - main for administration of this site.


[root@chckmk1 ~]# omd sites
SITE             VERSION          COMMENTS
main             1.4.0p38.cre     default version 

[root@chckmk1 ~]# omd start main
Starting mkeventd...OK
Starting rrdcached...OK
Starting npcd...OK
Starting nagios...2019-11-02 15:43:45 [6] updating log file index
2019-11-02 15:43:45 [6] updating log file index
OK
Starting dedicated Apache for site main...OK
Initializing Crontab...OK

# Open port 80 with firewalld
[root@chckmk1 ~]# firewall-cmd --zone=public --add-service=http --permanent
success
[root@chckmk1 ~]# firewall-cmd --reload

# Set SELINUX

setsebool -P httpd_can_network_connect 1

If everything went well, you should be able to browse to the IP or URL and get a login page.

CheckMK 1.4.0 Login Page

CheckMK 1.4.0 Login Page

It is really that simple. You now have a working CheckMK instance ready to be configured.

CheckMK Main Page

CheckMK Main Page

Configuring CheckMK Application

Now that we have a working instance and want to actually monitor something, why not the CheckMK server itself? There are a few options for this. You can use SNMP, CheckMK Agent or both. We will go over installing and configuring the agent.

Installing the CheckMK Agent

The agent requires xinetd as it essentially ties a script (the agent) to a socket/tcp port (6556). We will put the agent directly on the checkmk server.

The agents can be found in the “Monitoring Agents” section. For RPM based distributions it is easy to just install the RPM.

[root@chckmk1 ~]# curl -O http://chckmk1.woohoosvcs.com/main/check_mk/agents/check-mk-agent-1.4.0p38-1.noarch.rpm

[root@chckmk1 ~]# yum install check-mk-agent-1.4.0p38-1.noarch.rpm

[root@chckmk1 ~]# netstat -an | grep 6556
tcp6       0      0 :::6556                 :::*                    LISTEN     

Adding Host

Now we get to add the host to CheckMK

New hosts
Go to “Hosts” and then “New host”
Select host monitoring configuration
Enter the Hostname or IP. The agent type defaults to Check_MK Agent but I wanted to point it out. You then want to “Save & go to Services”.
Fix all missing/vanished services
This screen shows all of the services it can monitor. We want to “Fix all missing/vanished” and then save “1 change” which should after this actually change to save “2 changes”.
Activate new changes made
At this point we need to activate those changes. This typically requires a reload of Nagios which the activation does.
Host added but stale host and services
But wait, what am I seeing now and why are those hosts stale?
Some activity
Here we go, some activity!

At this point, we added a host and added some metrics to it. It ran through a discovery and found some more. Many times this happens. This is because some checks run asynchronously in the background. The first time you check a host, it does not return all of the services. On the second run they show up. From here you can go through similar steps to click on the host and acknowledge the new services.

In this case I simply forgot to follow my own instructions and click the “fix button”. Likely more services would have shown up later but not as many in the screenshot.

Other Configurations

Just because we have a monitoring system in place does not mean it is fully configured. We still have notifications, alert levels and many other tuning. Those are out of scope of this document but I will likely generate them going forward.

Final Words

We stood up a Check_MK server from scratch and are monitoring one host. At the beginning of the article I discussed upgrading. I will follow up with another article on upgrading. The process is fairly simple but there