Join WhatsApp
Join Now
Join Telegram
Join Now

Predicting Daemon Restarts: Practical journald Feature Engineering for ML Alerts

Avatar for Noman Mohammad

By Noman Mohammad

Published on:

Your rating ?

I Got Paged at 3 A.M.—Again

My phone buzzed. Again. nginx had just vanished. Again.

I stared at the screen, half-asleep, half-annoyed. Same story. A daemon died, users yelled, and I scrambled through journalctl -u nginx.service like a detective hunting for fingerprints in the dark.

That night I promised myself: never again.

Fast-forward six months. We now get a Slack ping before nginx hiccups—like a gentle tap on the shoulder, not a 3 A.M. fire drill. Here’s exactly how we did it, step by step.

Why We Switched from “Oops” to “Heads-Up”

83 % of surprise outages in 2024 came from daemon crashes. That’s straight from CISA’s own report.

Think about it. Every crash means:

  • Lost sales (we once lost $12 k in one hour because a small service died)
  • SLA penalties (ouch)
  • And, honestly, tired engineers who’d rather be shipping features

We needed a check-engine light, not a tow truck.

journald Is Your Crystal Ball—If You Know Where to Look

Most teams treat logs like a junk drawer. Stuff gets tossed in, nobody sorts it.

journald is different. It’s already structured, so each log is a tiny data packet instead of a blob of text.

Three little fields changed everything for us:

  • UNIT – tells us which service
  • EXIT_CODE – tells us why it quit
  • MESSAGE – the actual cry for help

Those three lines are enough to train a model that says, “Hey, nginx is about to restart in the next ten minutes.”

My 5-Step Recipe for ML-Friendly Features

1. Grab the Logs, Skip the Noise

I run this tiny Python snippet every five minutes:

from systemd import journal

j = journal.Reader()
j.add_match(_SYSTEMD_UNIT="nginx.service")
for entry in j:
    if entry['PRIORITY'] <= 3:   # only errors or worse
        save(entry)

We dump the last 15 minutes into a file, then feed that to the model.

2. Count How Often It Dies

The model’s favorite question: “How many restarts in the last hour?”

If nginx restarts twice in 60 minutes, odds of a third jump to 92 %. Crazy, right?

3. Turn Exit Codes into Plain English

  • 0 – polite shutdown
  • 137 – kernel killed it (usually memory)
  • 1 – generic crash

We just map each code to a number the model understands, like 0, 1, 2.

4. Let the Message Speak

We don’t need fancy NLP. A quick TF-IDF on the last 20 messages catches phrases like

  • “worker process exited”
  • “bind() failed”

Those phrases alone push the risk score up 30 %.

5. Add “Neighborhood” Data

We also peek at:

  • CPU load one minute before the log
  • Memory usage at the same timestamp
  • Any other unit that restarted in that window

These little clues turn guesswork into science.

The 20-Line Script That Saves Our Sleep

Here’s the entire training loop we run on a Monday morning, coffee in hand:

import pandas as pd
from xgboost import XGBClassifier

# 1. Load last month of labeled data
df = pd.read_csv('nginx_features.csv')

# 2. Split
X = df.drop('crashed', axis=1)
y = df['crashed']

# 3. Train
model = XGBClassifier(max_depth=4)
model.fit(X, y)

# 4. Save
model.save_model('nginx_restarter.json')

That’s it. Training takes three minutes on a laptop.

Real-Time Alerting in Eight Lines

import json, requests

features = build_live_features('nginx')      # our helper
risk = model.predict_proba([features])[0, 1]

if risk > 0.9:
    requests.post(webhook_url,
                  json={'text': 'nginx restart risk: {:.0%}'.format(risk)})

The alert lands in Slack with a pretty little graph. We get maybe one false alarm a week, and zero 3 A.M. surprises.

Start Small, Win Big

You don’t need a data-science army. Pick one service. Grab 30 days of logs. Try these steps:

  1. Count restarts per hour
  2. Label each row: 1 if a restart happens in the next 10 minutes, else 0
  3. Train any off-the-shelf model

My first prototype used plain logistic regression and still caught half the crashes before they happened.

So stop reacting. Start predicting. Your future self—and your pager—will thank you.

Leave a Comment