Node Exporter Fusion

Telegen includes a drop-in replacement for Prometheus node_exporter, providing full compatibility with existing dashboards and alerts.

Overview

Node Exporter Fusion provides:

  • 120+ system metrics - Full node_exporter compatibility

  • node_* namespace - Works with existing dashboards

  • Zero configuration - Automatically enabled

  • eBPF enhanced - Additional metrics via eBPF


Compatibility

Telegen replaces node_exporter while maintaining full compatibility:

Feature

node_exporter

Telegen

Metric namespace

node_*

node_*

Grafana dashboards

Alert rules

Prometheus scraping

/metrics

/metrics

Collectors

50+

50+ ✅


Collectors

P0 Collectors (Always Enabled)

Collector

Metrics

Description

loadavg

3

node_load1, node_load5, node_load15

cpu

15+ per core

CPU time per mode, frequency, info

meminfo

50+

Memory statistics from /proc/meminfo

diskstats

17+ per device

Disk I/O statistics

filesystem

8 per mount

Filesystem space and inodes

netdev

25+ per interface

Network device statistics

stat

16

Boot time, context switches, interrupts

Sample Metrics

# Load averages
node_load1
node_load5
node_load15

# CPU usage per mode
node_cpu_seconds_total{mode="user"}
node_cpu_seconds_total{mode="system"}
node_cpu_seconds_total{mode="idle"}
node_cpu_seconds_total{mode="iowait"}

# Memory
node_memory_MemTotal_bytes
node_memory_MemFree_bytes
node_memory_MemAvailable_bytes
node_memory_Buffers_bytes
node_memory_Cached_bytes
node_memory_SwapTotal_bytes
node_memory_SwapFree_bytes

# Disk I/O
node_disk_read_bytes_total
node_disk_written_bytes_total
node_disk_reads_completed_total
node_disk_writes_completed_total
node_disk_io_time_seconds_total
node_disk_read_time_seconds_total
node_disk_write_time_seconds_total

# Filesystem
node_filesystem_size_bytes
node_filesystem_free_bytes
node_filesystem_avail_bytes
node_filesystem_files
node_filesystem_files_free

# Network
node_network_receive_bytes_total
node_network_transmit_bytes_total
node_network_receive_packets_total
node_network_transmit_packets_total
node_network_receive_errs_total
node_network_transmit_errs_total
node_network_receive_drop_total
node_network_transmit_drop_total

# System
node_boot_time_seconds
node_context_switches_total
node_forks_total
node_intr_total
node_procs_running
node_procs_blocked

Configuration

Enable/Disable Collectors

agent:
  nodeexporter:
    enabled: true
    
    # Listen address for /metrics endpoint
    listen_address: ":9100"
    
    # Metric namespace (default: node)
    namespace: "node"
    
    # Collectors to enable
    collectors:
      loadavg: true
      cpu: true
      meminfo: true
      diskstats: true
      filesystem: true
      netdev: true
      stat: true
      
      # P1 collectors
      netstat: true
      sockstat: true
      vmstat: true
      
      # P2 collectors
      hwmon: false      # Hardware monitoring
      thermal: false    # Thermal zones
      pressure: true    # PSI metrics

Device Filtering

Filter which devices to collect metrics from:

agent:
  nodeexporter:
    filesystem:
      # Ignore these filesystem types
      ignored_fs_types:
        - autofs
        - binfmt_misc
        - cgroup
        - configfs
        - debugfs
        - devpts
        - devtmpfs
        - fusectl
        - hugetlbfs
        - mqueue
        - nsfs
        - overlay
        - proc
        - procfs
        - pstore
        - securityfs
        - sysfs
        - tmpfs
        - tracefs
      
      # Ignore these mount points
      ignored_mount_points:
        - "^/(dev|proc|sys|var/lib/docker/.+)($|/)"
    
    diskstats:
      # Only these devices
      device_include:
        - "^sd[a-z]+$"
        - "^nvme[0-9]+n[0-9]+$"
      
      # Ignore these devices
      device_exclude:
        - "^loop[0-9]+$"
        - "^ram[0-9]+$"
    
    netdev:
      # Ignore these interfaces
      device_exclude:
        - "^veth.*"
        - "^docker.*"
        - "^br-.*"

TLS/mTLS Configuration

Secure the metrics endpoint with TLS and optional mutual TLS (mTLS):

agent:
  nodeexporter:
    enabled: true
    listen_address: ":9100"
    
    # TLS configuration
    tls:
      enabled: true
      
      # Server certificate and key
      cert_file: "/etc/telegen/certs/server.crt"
      key_file: "/etc/telegen/certs/server.key"
      
      # Enable mTLS (client certificate verification)
      client_auth: true
      
      # CA certificate for verifying client certs
      client_ca_file: "/etc/telegen/certs/ca.crt"

Option

Description

Default

tls.enabled

Enable TLS for metrics endpoint

false

tls.cert_file

Path to server certificate

-

tls.key_file

Path to server private key

-

tls.client_auth

Require client certificates (mTLS)

false

tls.client_ca_file

CA for verifying client certificates

-

Metric Cardinality Controls

Control metric cardinality to prevent explosion from high-cardinality labels:

agent:
  nodeexporter:
    enabled: true
    
    # Cardinality controls
    cardinality:
      enabled: true
      
      # Maximum number of metric families
      max_metrics: 1000
      
      # Include only these metrics (regex patterns)
      include_metrics:
        - "node_cpu_.*"
        - "node_memory_.*"
        - "node_disk_.*"
        - "node_filesystem_.*"
        - "node_network_.*"
        - "node_load.*"
      
      # Exclude these metrics (regex patterns)
      exclude_metrics:
        - "node_scrape_.*"
        - "go_.*"
      
      # Drop these labels from all metrics
      drop_labels:
        - "id"
        - "name"

Option

Description

Default

cardinality.enabled

Enable cardinality filtering

false

cardinality.max_metrics

Maximum metric families (0 = unlimited)

0

cardinality.include_metrics

Regex patterns to include

[] (all)

cardinality.exclude_metrics

Regex patterns to exclude

[]

cardinality.drop_labels

Labels to remove from all metrics

[]


API Endpoints

The node exporter provides several HTTP endpoints:

Endpoint

Description

/metrics

Prometheus metrics endpoint

/metrics/description

JSON documentation of all metrics with OTEL mappings

/health

Health check (JSON)

/ready

Readiness probe

/live

Liveness probe

Metric Descriptions Endpoint

The /metrics/description endpoint returns JSON documentation for all available metrics, including OTEL semantic convention mappings:

curl http://localhost:9100/metrics/description

Response:

{
  "categories": [
    {
      "category": "CPU",
      "count": 4,
      "metrics": [
        {
          "name": "node_cpu_seconds_total",
          "otel_name": "system.cpu.time",
          "description": "Seconds the CPUs spent in each mode",
          "unit": "s",
          "type": "counter",
          "labels": {"cpu": "cpu", "mode": "cpu.mode"},
          "has_otel_mapping": true
        }
      ]
    }
  ],
  "total": 35,
  "otel_info": {
    "version": "v1.38.0",
    "mapped_count": 35,
    "total_metrics": 35,
    "coverage": "100.0%"
  }
}

Prometheus Integration

Scrape Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets:
          - 'host1:9100'
          - 'host2:9100'

Service Discovery (Kubernetes)

scrape_configs:
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.+):10250'
        replacement: '${1}:9100'
        target_label: __address__

Grafana Dashboards

Telegen is compatible with standard node_exporter dashboards:

Import Dashboard

  1. Go to Grafana → Dashboards → Import

  2. Enter dashboard ID (e.g., 1860)

  3. Select Prometheus data source

  4. Dashboard works immediately with Telegen


eBPF-Enhanced Metrics

Telegen adds eBPF-based metrics beyond standard node_exporter:

Additional Metrics

Metric

Description

node_tcp_rtt_microseconds

TCP round-trip time

node_tcp_retransmits_total

TCP retransmissions

node_process_open_fds

Open file descriptors per process

node_cgroup_cpu_usage_seconds_total

Per-cgroup CPU usage

node_cgroup_memory_usage_bytes

Per-cgroup memory usage

Enable eBPF Enhancements

agent:
  nodeexporter:
    enabled: true
    
    # Enable eBPF-enhanced metrics
    ebpf_enhanced: true

Migration from node_exporter

Step 1: Deploy Telegen

Deploy Telegen alongside node_exporter:

# Telegen on different port initially
agent:
  nodeexporter:
    listen_address: ":9101"  # Different port

Step 2: Compare Metrics

Verify metric compatibility:

# Compare CPU metrics
node_cpu_seconds_total{port="9100"}  # node_exporter
node_cpu_seconds_total{port="9101"}  # Telegen

Step 3: Switch Scrape Targets

Update Prometheus to scrape Telegen:

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['host:9100']  # Now points to Telegen

Step 4: Remove node_exporter

Once verified, remove node_exporter.


Textfile Collector

Import custom metrics from files:

agent:
  nodeexporter:
    textfile:
      enabled: true
      directory: "/var/lib/node_exporter/textfile_collector"

Create Custom Metrics

# /var/lib/node_exporter/textfile_collector/custom.prom
# HELP node_custom_metric A custom metric
# TYPE node_custom_metric gauge
node_custom_metric{label="value"} 42

Performance

Resource Usage

Metric

Value

CPU overhead

< 0.5%

Memory

~20MB

Scrape time

< 100ms

Optimization

For large systems (many disks, interfaces):

agent:
  nodeexporter:
    # Increase scrape timeout
    timeout: 10s
    
    # Reduce collection frequency
    collector_interval: 30s
    
    # Limit concurrent collectors
    max_procs: 2

Common Queries

CPU Usage

# CPU utilization percentage
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Per-CPU utilization
1 - avg by(instance, cpu) (irate(node_cpu_seconds_total{mode="idle"}[5m]))

Memory Usage

# Memory utilization percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Memory breakdown
node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes

Disk I/O

# Disk read/write rate
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

# Disk I/O utilization
rate(node_disk_io_time_seconds_total[5m]) * 100

Network

# Network throughput
rate(node_network_receive_bytes_total[5m]) * 8
rate(node_network_transmit_bytes_total[5m]) * 8

# Packet errors
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])

Filesystem

# Filesystem usage percentage
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100

# Inode usage
(1 - node_filesystem_files_free / node_filesystem_files) * 100

Alerting Examples

groups:
  - name: node
    rules:
      - alert: HostHighCpuLoad
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU load on {{ $labels.instance }}"
      
      - alert: HostOutOfMemory
        expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Host {{ $labels.instance }} is running out of memory"
      
      - alert: HostOutOfDiskSpace
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Host {{ $labels.instance }} disk space is below 10%"
      
      - alert: HostHighDiskIO
        expr: rate(node_disk_io_time_seconds_total[5m]) > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High disk I/O on {{ $labels.instance }}"

Next Steps