Skip to content

M-sasank/kernob

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What is Kernob

Kernob is an eBPF-based observability agent written in Go. It fills the gaps that OTel cannot cover without application instrumentation. It does not compete with OTel. It runs alongside it, watches what the kernel sees, and emits everything as OTel metrics so nothing else in your stack has to change.

The name is kernel + observer. Kernob.

Why I built this

OTel is great. SigNoz is great. But they both depend on your application cooperating. You instrument your code, add SDK calls, emit spans and metrics. If something goes wrong before your code even sees a request, you're blind.

For system-level stuff like CPU and RAM per process, OTel's hostmetrics receiver already reads /proc and handles it well. There's no point rebuilding that.

The actual gap is network. OTel has no way to tell you which process sent how many bytes to which destination. It has no way to tell you a TCP connection dropped before your application registered anything. It has no way to tell you your external dependency is flaky at the connection level, not the HTTP level.

There's also no lightweight open source tool that covers this for plain VMs with an OTel-native stack. Datadog does it well but it's expensive and locked into their ecosystem. nethogs and atop exist but they're manual CLI tools with no history, no alerting, no pipeline integration.

Kernob fills that specific gap. Only that gap.

Build Phases

Phase 1 — Per-process network bytes via eBPF

The smallest useful thing Kernob can do. Hook into tcp_sendmsg and tcp_cleanup_rbuf, attribute bytes sent and received to an exact PID and process name, emit as OTel metrics. Configurable scrape interval and aggregation window.

Visualize in Grafana with VictoriaMetrics as the backend. When network is saturated, you know exactly which process owns it and since when.

This is the phase you build first, ship first, and demo first. Everything else follows from here.

Phase 2 and beyond

Kernob does not reimplement what OTel already does well. CPU, RAM, disk IO per process — OTel hostmetrics handles that. Kernob only does what OTel genuinely cannot.

Per-process network attribution

/proc has no per-process network bytes equivalent. OTel gives you interface totals for eth0, nothing more. Kernob hooks into tcp_sendmsg and tcp_cleanup_rbuf to attribute bytes sent and received to an exact PID in real time. When your network is saturated, you know which process owns it.

TCP connection health per process

New connections, retransmits, RSTs, mid-response drops — all attributed to a specific PID with timestamps. OTel can tell you a span was slow. Kernob can tell you there were three TCP retransmits on that process during that span.

External dependency health scoring

For each destination hostname your processes connect to, Kernob tracks the full syscall chain outcome in a rolling window and produces scores across multiple dimensions.

Individual dimension scores:

  • Reliability — connect success rate. How often does the TCP handshake complete vs ETIMEDOUT or ECONNREFUSED.
  • Latency — connect p95 and read p95. Is the handshake or data transfer getting slower over time.
  • Stability — mid-response drop rate. How often does read return 0 before the response completed. The remote end started responding then went silent.
  • Packet health — TCP retransmit rate per destination. Elevated retransmits mean the network path to that dependency is degraded.
  • Responsiveness — epoll_wait timeout rate. How often did the process wait the full timeout duration and get nothing back.

Composite score — a single weighted score derived from all dimensions. Degrades as failures accumulate in the rolling window. Recovers as failures age out. Recent failures weigh more than old ones.

Both individual scores and the composite score are emitted as OTel metrics. Consumers can build dashboards on whichever granularity they need. Kernob does not force one view.

Textract, RabbitMQ, MongoDB, any internal cross-service call — each hostname gets scored. When a dependency starts degrading, Kernob surfaces it at the TCP level before your application layer registers anything.

Note on AWS and rotating IPs — Kernob hooks into getaddrinfo exit to capture hostname-to-IP mappings at connection time. When IPs rotate, the new IP is tagged with the same hostname. Scores are always per hostname, never per raw IP.

Syscall latency histograms

How long is a process spending inside read, write, epoll_wait, futex. Useful when a span is slow and all application-layer operations look normal. A process stuck in futex for 180ms means a locking or concurrency issue. A process with high epoll_wait timeouts means the remote end is not responding. Neither of these is visible from OTel.

Process lifecycle events

Exact fork and exit events with timestamps. Useful when a worker crashes and restarts too fast for logs to flush. The crash loop is visible at the kernel level even when your application logs show nothing.

Service Network Map

Kernob auto-discovers your service topology from observed TCP traffic. No instrumentation, no manual diagramming, no config beyond the network classification file.

Every TCP connection has a source and a destination. The initiating side calls connect. The receiving side calls accept. eBPF captures both sides, attributes them to a process name via /proc/{pid}/comm, and builds a directed graph of who talks to whom over time.

After running for a period in production, this graph is an honest map of your actual runtime service dependencies — not what you think the architecture is, but what it actually is based on observed traffic.

What the map looks like

Nodes are services. Identified by process name plus port for local processes, or by resolved hostname for external dependencies. Nodes are classified automatically using the network config — external, internal, self-hosted, localhost.

Edges are observed TCP connections, directed by who initiated. Edge color and weight carry the health scores already computed by the dependency scoring feature — reliability, latency, packet health, responsiveness, and composite score.

For BSA, after a few hours of traffic, this emerges automatically:

[rmq consumer] → [mongodb]
[rmq consumer] → [textract]
[rmq consumer] → [rmq publisher]
[rmq publisher] → [worker 2]

With color coded edges showing health on each connection in real time.

Visualization

Grafana's Node Graph panel renders directed graphs natively. Kernob emits the topology as OTel metrics — node labels, edge pairs, edge health scores — and Grafana renders it. No custom UI needed for v1.

Why this is genuinely new

Service topology maps exist in a few places. Datadog APM has one but requires their agent and instrumentation. Jaeger has one but requires distributed tracing setup. Hubble does auto-discover service topology using eBPF and does it well — but it only works inside Kubernetes with Cilium as the CNI. No K8s, no Cilium, no Hubble.

BSA runs on plain VMs. Most small and mid-size teams do. For that environment, with no service mesh, no K8s, no distributed tracing — there is nothing that auto-discovers service topology and overlays live health on it. You either draw it manually in Confluence and hope it stays accurate, or you don't have it at all.

Kernob runs as a single binary alongside whatever is already there. No cluster, no CNI replacement, no commitment. Discovered truth, not declared truth.

Acknowledging Hubble in the blog post and explaining this distinction clearly is part of the story. It shows understanding of the landscape, not ignorance of it.

One month production run

Deploy Kernob, run it for a month, and at the end you have an honest automatically generated map of your entire service network with real health scores on every connection based on actual observed traffic. No self-reporting, no instrumentation bias, no stale diagrams.

How it integrates

Kernob emits everything as OTel metrics using the OTel Go SDK. These ship to an OTel Collector and forward into SigNoz. Same pipeline I already use for BSA.

The result is application traces and kernel metrics sitting in the same SigNoz dashboard. One place, two layers of visibility.

What gap this actually fills

OTel tells you a span was slow. Kernob tells you why at the kernel level. A few real scenarios:

RabbitMQ connection drops silently every few hours. App logs show a reconnect but no cause. Kernob shows a TCP RST from the broker side at the exact timestamp. That's a broker idle timeout config issue, not a code bug. Without this you spend hours in the wrong place.

Network spikes at 2am. OTel shows nothing. Kernob shows a backup process consuming 80% of bandwidth during that window. Immediately obvious and actionable.

A worker span is consistently 200ms slow. All internal operations look fine. Kernob shows the worker is blocked on futex for 180ms. That's a concurrency or locking problem in the code. Not network, not database.

Latency spikes on a service with no obvious cause. Another process on the same host is doing a heavy backup at that exact time and saturating disk IO. Your application has no way to know this. Kernob shows which process owned the disk at that timestamp.

A process is leaking file descriptors slowly. The climb is gradual. By the time your app crashes with "too many open files" it's already a production incident. Kernob would have shown the trend hours earlier.

A worker crashes and restarts in under a second. Logs show nothing because the logger didn't flush. Process lifecycle events at the kernel level show the exact fork and exit with timestamps. The crash loop is visible even when logs aren't.

A process on your production host starts making connections to an external IP it never talked to before. Could be a compromised dependency, a misconfigured service. Your application logs show nothing because it's happening below your code. Kernel-level connection tracking catches it immediately.

Where Kernob is not useful

Bugs in business logic. Wrong data in a database. A misconfigured API call. Most application-layer errors. Kernel visibility is for the gap between "my code looks correct" and "something is still wrong at the infrastructure level." That gap exists in maybe 10 to 15 percent of production issues. But those are usually the hardest ones to debug.

What I am learning by building this

  • eBPF fundamentals. How programs attach to the kernel, how they pass data out via maps and ring buffers.
  • Go. Kernob is my first real Go project.
  • OTel Go SDK. Metric emission, collectors, exporters.
  • How serious observability tooling is architectured. Reading Cilium source as a reference point.

Tech stack

  • Language: Go
  • eBPF: cilium/ebpf library for userspace, C for kernel-side programs
  • Metrics export: OTel Go SDK
  • Backend: SigNoz

Project structure

kernob/
  cmd/         main entrypoint
  internal/
    proc/      /proc readers
    ebpf/      eBPF programs and Go loaders
    exporter/  OTel metric emission
  bpf/         C eBPF source files
  README.md

Goals

  • Public GitHub with a clean README that explains what this is and why
  • OTel and SigNoz integration working end to end as a demo
  • dev.to post explaining the gap in open source observability this fills
  • LinkedIn post linking to the article

The one-line pitch

Kernob sits below your application and watches what the kernel sees. Then it speaks OTel so the rest of your stack does not have to change.

About

eBPF based network stats monitoring tool

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors