Node Exporter Fusion

Telegen includes a drop-in replacement for Prometheus node_exporter, providing full compatibility with existing dashboards and alerts.

Overview

Node Exporter Fusion provides:

120+ system metrics - Full node_exporter compatibility
node_* namespace - Works with existing dashboards
Zero configuration - Automatically enabled
eBPF enhanced - Additional metrics via eBPF

Compatibility

Telegen replaces node_exporter while maintaining full compatibility:

Feature	node_exporter	Telegen
Metric namespace	`node_*`	`node_*` ✅
Grafana dashboards	✅	✅
Alert rules	✅	✅
Prometheus scraping	`/metrics`	`/metrics` ✅
Collectors	50+	50+ ✅

Collectors

P0 Collectors (Always Enabled)

Collector	Metrics	Description
loadavg	3	`node_load1`, `node_load5`, `node_load15`
cpu	15+ per core	CPU time per mode, frequency, info
meminfo	50+	Memory statistics from `/proc/meminfo`
diskstats	17+ per device	Disk I/O statistics
filesystem	8 per mount	Filesystem space and inodes
netdev	25+ per interface	Network device statistics
stat	16	Boot time, context switches, interrupts

Sample Metrics

# Load averages
node_load1
node_load5
node_load15

# CPU usage per mode
node_cpu_seconds_total{mode="user"}
node_cpu_seconds_total{mode="system"}
node_cpu_seconds_total{mode="idle"}
node_cpu_seconds_total{mode="iowait"}

# Memory
node_memory_MemTotal_bytes
node_memory_MemFree_bytes
node_memory_MemAvailable_bytes
node_memory_Buffers_bytes
node_memory_Cached_bytes
node_memory_SwapTotal_bytes
node_memory_SwapFree_bytes

# Disk I/O
node_disk_read_bytes_total
node_disk_written_bytes_total
node_disk_reads_completed_total
node_disk_writes_completed_total
node_disk_io_time_seconds_total
node_disk_read_time_seconds_total
node_disk_write_time_seconds_total

# Filesystem
node_filesystem_size_bytes
node_filesystem_free_bytes
node_filesystem_avail_bytes
node_filesystem_files
node_filesystem_files_free

# Network
node_network_receive_bytes_total
node_network_transmit_bytes_total
node_network_receive_packets_total
node_network_transmit_packets_total
node_network_receive_errs_total
node_network_transmit_errs_total
node_network_receive_drop_total
node_network_transmit_drop_total

# System
node_boot_time_seconds
node_context_switches_total
node_forks_total
node_intr_total
node_procs_running
node_procs_blocked

Configuration

Enable/Disable Collectors

agent:
  nodeexporter:
    enabled: true
    
    # Listen address for /metrics endpoint
    listen_address: ":9100"
    
    # Metric namespace (default: node)
    namespace: "node"
    
    # Collectors to enable
    collectors:
      loadavg: true
      cpu: true
      meminfo: true
      diskstats: true
      filesystem: true
      netdev: true
      stat: true
      
      # P1 collectors
      netstat: true
      sockstat: true
      vmstat: true
      
      # P2 collectors
      hwmon: false      # Hardware monitoring
      thermal: false    # Thermal zones
      pressure: true    # PSI metrics

Device Filtering

Filter which devices to collect metrics from:

agent:
  nodeexporter:
    filesystem:
      # Ignore these filesystem types
      ignored_fs_types:
        - autofs
        - binfmt_misc
        - cgroup
        - configfs
        - debugfs
        - devpts
        - devtmpfs
        - fusectl
        - hugetlbfs
        - mqueue
        - nsfs
        - overlay
        - proc
        - procfs
        - pstore
        - securityfs
        - sysfs
        - tmpfs
        - tracefs
      
      # Ignore these mount points
      ignored_mount_points:
        - "^/(dev|proc|sys|var/lib/docker/.+)($|/)"
    
    diskstats:
      # Only these devices
      device_include:
        - "^sd[a-z]+$"
        - "^nvme[0-9]+n[0-9]+$"
      
      # Ignore these devices
      device_exclude:
        - "^loop[0-9]+$"
        - "^ram[0-9]+$"
    
    netdev:
      # Ignore these interfaces
      device_exclude:
        - "^veth.*"
        - "^docker.*"
        - "^br-.*"

TLS/mTLS Configuration

Secure the metrics endpoint with TLS and optional mutual TLS (mTLS):

agent:
  nodeexporter:
    enabled: true
    listen_address: ":9100"
    
    # TLS configuration
    tls:
      enabled: true
      
      # Server certificate and key
      cert_file: "/etc/telegen/certs/server.crt"
      key_file: "/etc/telegen/certs/server.key"
      
      # Enable mTLS (client certificate verification)
      client_auth: true
      
      # CA certificate for verifying client certs
      client_ca_file: "/etc/telegen/certs/ca.crt"

Option	Description	Default
`tls.enabled`	Enable TLS for metrics endpoint	`false`
`tls.cert_file`	Path to server certificate	-
`tls.key_file`	Path to server private key	-
`tls.client_auth`	Require client certificates (mTLS)	`false`
`tls.client_ca_file`	CA for verifying client certificates	-

Metric Cardinality Controls

Control metric cardinality to prevent explosion from high-cardinality labels:

agent:
  nodeexporter:
    enabled: true
    
    # Cardinality controls
    cardinality:
      enabled: true
      
      # Maximum number of metric families
      max_metrics: 1000
      
      # Include only these metrics (regex patterns)
      include_metrics:
        - "node_cpu_.*"
        - "node_memory_.*"
        - "node_disk_.*"
        - "node_filesystem_.*"
        - "node_network_.*"
        - "node_load.*"
      
      # Exclude these metrics (regex patterns)
      exclude_metrics:
        - "node_scrape_.*"
        - "go_.*"
      
      # Drop these labels from all metrics
      drop_labels:
        - "id"
        - "name"

Option	Description	Default
`cardinality.enabled`	Enable cardinality filtering	`false`
`cardinality.max_metrics`	Maximum metric families (0 = unlimited)	`0`
`cardinality.include_metrics`	Regex patterns to include	`[]` (all)
`cardinality.exclude_metrics`	Regex patterns to exclude	`[]`
`cardinality.drop_labels`	Labels to remove from all metrics	`[]`

API Endpoints

The node exporter provides several HTTP endpoints:

Endpoint	Description
`/metrics`	Prometheus metrics endpoint
`/metrics/description`	JSON documentation of all metrics with OTEL mappings
`/health`	Health check (JSON)
`/ready`	Readiness probe
`/live`	Liveness probe

Metric Descriptions Endpoint

The /metrics/description endpoint returns JSON documentation for all available metrics, including OTEL semantic convention mappings:

curl http://localhost:9100/metrics/description

Response:

{
  "categories": [
    {
      "category": "CPU",
      "count": 4,
      "metrics": [
        {
          "name": "node_cpu_seconds_total",
          "otel_name": "system.cpu.time",
          "description": "Seconds the CPUs spent in each mode",
          "unit": "s",
          "type": "counter",
          "labels": {"cpu": "cpu", "mode": "cpu.mode"},
          "has_otel_mapping": true
        }
      ]
    }
  ],
  "total": 35,
  "otel_info": {
    "version": "v1.38.0",
    "mapped_count": 35,
    "total_metrics": 35,
    "coverage": "100.0%"
  }
}

Prometheus Integration

Scrape Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets:
          - 'host1:9100'
          - 'host2:9100'

Service Discovery (Kubernetes)

scrape_configs:
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.+):10250'
        replacement: '${1}:9100'
        target_label: __address__

Grafana Dashboards

Telegen is compatible with standard node_exporter dashboards:

Recommended Dashboards

Dashboard	Grafana ID	Description
Node Exporter Full	1860	Comprehensive system metrics
Node Exporter for Prometheus	11074	Clean, modern layout
Linux Server Metrics	180	Classic dashboard

Import Dashboard

Go to Grafana → Dashboards → Import
Enter dashboard ID (e.g., 1860)
Select Prometheus data source
Dashboard works immediately with Telegen

eBPF-Enhanced Metrics

Telegen adds eBPF-based metrics beyond standard node_exporter:

Additional Metrics

Metric	Description
`node_tcp_rtt_microseconds`	TCP round-trip time
`node_tcp_retransmits_total`	TCP retransmissions
`node_process_open_fds`	Open file descriptors per process
`node_cgroup_cpu_usage_seconds_total`	Per-cgroup CPU usage
`node_cgroup_memory_usage_bytes`	Per-cgroup memory usage

Enable eBPF Enhancements

agent:
  nodeexporter:
    enabled: true
    
    # Enable eBPF-enhanced metrics
    ebpf_enhanced: true

Migration from node_exporter

Step 1: Deploy Telegen

Deploy Telegen alongside node_exporter:

# Telegen on different port initially
agent:
  nodeexporter:
    listen_address: ":9101"  # Different port

Step 2: Compare Metrics

Verify metric compatibility:

# Compare CPU metrics
node_cpu_seconds_total{port="9100"}  # node_exporter
node_cpu_seconds_total{port="9101"}  # Telegen

Step 3: Switch Scrape Targets

Update Prometheus to scrape Telegen:

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['host:9100']  # Now points to Telegen

Step 4: Remove node_exporter

Once verified, remove node_exporter.

Textfile Collector

Import custom metrics from files:

agent:
  nodeexporter:
    textfile:
      enabled: true
      directory: "/var/lib/node_exporter/textfile_collector"

Create Custom Metrics

# /var/lib/node_exporter/textfile_collector/custom.prom
# HELP node_custom_metric A custom metric
# TYPE node_custom_metric gauge
node_custom_metric{label="value"} 42

Performance

Resource Usage

Metric	Value
CPU overhead	< 0.5%
Memory	~20MB
Scrape time	< 100ms

Optimization

For large systems (many disks, interfaces):

agent:
  nodeexporter:
    # Increase scrape timeout
    timeout: 10s
    
    # Reduce collection frequency
    collector_interval: 30s
    
    # Limit concurrent collectors
    max_procs: 2

Common Queries

CPU Usage

# CPU utilization percentage
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Per-CPU utilization
1 - avg by(instance, cpu) (irate(node_cpu_seconds_total{mode="idle"}[5m]))

Memory Usage

# Memory utilization percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Memory breakdown
node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes

Disk I/O

# Disk read/write rate
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

# Disk I/O utilization
rate(node_disk_io_time_seconds_total[5m]) * 100

Network

# Network throughput
rate(node_network_receive_bytes_total[5m]) * 8
rate(node_network_transmit_bytes_total[5m]) * 8

# Packet errors
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])

Filesystem

# Filesystem usage percentage
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100

# Inode usage
(1 - node_filesystem_files_free / node_filesystem_files) * 100

Alerting Examples

groups:
  - name: node
    rules:
      - alert: HostHighCpuLoad
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU load on {{ $labels.instance }}"
      
      - alert: HostOutOfMemory
        expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Host {{ $labels.instance }} is running out of memory"
      
      - alert: HostOutOfDiskSpace
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Host {{ $labels.instance }} disk space is below 10%"
      
      - alert: HostHighDiskIO
        expr: rate(node_disk_io_time_seconds_total[5m]) > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High disk I/O on {{ $labels.instance }}"

Next Steps

Auto-Discovery - Automatic service detection
Distributed Tracing - Application tracing
Agent Mode Configuration - Full configuration