Monitoring systemd services in realtime with Chronograf

That’s my InfluxDB service deactivating for a moment.

We can even capture the timestamp related to the change!The org.

freedesktop.

systemd service specifies 6 different states : active, reloading, inactive, failed, activating, deactivating.

We are obviously particularly interested in the failed signal as it notifies a service failure on our system.

Now that we have a way to manually capture systemd signals on our system, let’s build our full automated system monitoring system.

Let the fun begin!III — Architecture & ImplementationIn order to monitor systemd services, we are going to use this architecture :Our systemd monitoring final architectureOur architecture is pretty straight-forward.

First, we ensured that we have the dbus-daemon running on our machine.

From there, we are going to build a simple D-Bus client (in Go!) that will subscribe to signals originating from systemd.

All incoming signals will be parsed and stored in InfluxDB.

Once points are stored in InfluxDB, we will create a Chronograf dashboard showing statistics about our services and gauges reflecting their current state on our machine.

When a service fails, Kapacitor (a stream processing engine) will pick it up and will automatically send an alert to Slack for our system administration group.

Simple!.Right?No.

a — Building A D-Bus client in GoThe first step in order to capture signals coming from systemd is to build a simple client that will :Connect to the bus.

Subscribe to systemd signals.

Parse and send points to InfluxDB.

Note : you may wondering why I chose Go to build my D-Bus client.

Both dbus and InfluxDB client libraries are written in Go, making this language the perfect candidate to handle this little experiment.

The client code is quite lengthy for it to be full displayed on this article, but here’s the main function that does most of the work.

Full code is available on my Github.

⭐For every single systemd signal, a point is created in InfluxDB.

I chose this implementation because I wanted to have a full history of all the changes occurring on my different services.

It can be quite useful for investigating about some recurrent service failure over a period.

b — Implementation choicesFor my InfluxDB data structure, I chose to have my service name as a tag (for indexes purposes), and the state (failed, active, activating.

) as the value.

A simple mapping links a constant value to every single state.

IQL aggregation functions work better when used with numeric values rather than text values.

Note : in the snippet above, one can notice that I get many properties updates from systemd, but I extract the ‘ActiveState’ property that we saw in the first section.

Now that we have our simple Go client, let’s wrap it into a service, run it, and head over to Chronograf.

III — Building a cute dashboard for sysadminsNow that we have our points in InfluxDB, this is where the fun begins.

We will build a Chronograf dashboard that will show us some statistics related to our services and gauges for important services we want to monitor.

The final dashboard has three main parts :Count of active, inactive and failed services at a given time.

Table showing a full history of state changes over time for every service.

12 gauges displaying 12 different systemd services we want to put the emphasis on.

Disclaimer : this part assumes that one has some preliminary knowledge of Chronograf ; how to set it up and link it to InfluxDB.

Documentation is available here.

Queries will be provided for each block of this dashboard.

a — Counting active, inactive and failed servicesHere’s the way to build the single-stat blocks :b — Full table history of state changesIn the same fashion, here are the inputs used to build the history table :c — Gorgeous gauges for specific servicesGauges, gauges everywhere!Of course, I encourage everyone to toy around with the widgets and to build your own dashboards, it doesn’t have to be the exact copycat of the dashboard present above.

Now that we have our dashboard, we have a very cool way to monitor in real-time our systemd services.

Nice!But what if we had realtime alerts on Slack when a running is failing?.Wouldn’t the DevOps team love this feature?Let’s head to it.

IV — Raising Alerts On Service FailureFor the last part, we are going to use Kapacitor, a stream processing engine that will be responsible for raising and processing alerts when a service is failing.

InfluxData DocumentationDocumentation for InfluxDB, Telegraf, Chronograf, Kapacitor, and Fluxdocs.

influxdata.

comOnce Kapacitor is installed and running on your machine, let’s go back to Chronograf and head over to the alert panel.

When clicking on Manage Tasks, you are presented with two sections : alert rules and tick scripts.

Let’s create a new alert rule by clicking on the ‘Build Alert Rule’ button.

Build it already!And here’s the full alert configuration used for this alert :This alert is configured to send an alert to a Slack webhook when a service is failing (i.

e the state value is equal to minus one) on a fifteen minutes time window.

On the Slack side, the alerts has this format :V — ConclusionI learned many things building this little project.

Having no prior experience with D-Bus, or with Golang whatsoever, this experiment taught me that getting out of your comfort zone (even in programming) is the way to go to build new skills.

The process of building such a dashboard can seem quite arduous, but once deployed, it provides real value to operational teams and system administrators in general.

If you like hand-crafting your own monitoring solutions, you can definitely take some inspiration from this tutorial.

If you’re more into delegating to external tools, I would definitely recommend SignalFX or Telegraf.

They are both robust and efficient solutions for your infrastructure.

Real-Time Cloud Monitoring for Infrastructure, Microservices, Applications | SignalFxQuickly realize the full benefits of the Cloud and DevOps with the only solution designed for every stage of the…www.

signalfx.

comInfluxData DocumentationDocumentation for Telegraf, the plugin-driven server agent of the InfluxData time series platform, used to collect and…docs.

influxdata.

comI hope that you had some fun reading this little (well not so little) tutorial on how to build realtime systemd monitoring dashboards from scratch.

I had a ton of fun on my side building it and writing this article.

If you have any question about this tutorial, or software engineering in general I will be happy to help.

Until next time.

Kindly,Antoine.

. More details

Leave a Reply