# Node Exporter Fusion Telegen includes a drop-in replacement for Prometheus node_exporter, providing full compatibility with existing dashboards and alerts. ## Overview Node Exporter Fusion provides: - **120+ system metrics** - Full node_exporter compatibility - **`node_*` namespace** - Works with existing dashboards - **Zero configuration** - Automatically enabled - **eBPF enhanced** - Additional metrics via eBPF --- ## Compatibility Telegen replaces node_exporter while maintaining full compatibility: | Feature | node_exporter | Telegen | |---------|---------------|---------| | Metric namespace | `node_*` | `node_*` ✅ | | Grafana dashboards | ✅ | ✅ | | Alert rules | ✅ | ✅ | | Prometheus scraping | `/metrics` | `/metrics` ✅ | | Collectors | 50+ | 50+ ✅ | --- ## Collectors ### P0 Collectors (Always Enabled) | Collector | Metrics | Description | |-----------|---------|-------------| | **loadavg** | 3 | `node_load1`, `node_load5`, `node_load15` | | **cpu** | 15+ per core | CPU time per mode, frequency, info | | **meminfo** | 50+ | Memory statistics from `/proc/meminfo` | | **diskstats** | 17+ per device | Disk I/O statistics | | **filesystem** | 8 per mount | Filesystem space and inodes | | **netdev** | 25+ per interface | Network device statistics | | **stat** | 16 | Boot time, context switches, interrupts | ### Sample Metrics ```promql # Load averages node_load1 node_load5 node_load15 # CPU usage per mode node_cpu_seconds_total{mode="user"} node_cpu_seconds_total{mode="system"} node_cpu_seconds_total{mode="idle"} node_cpu_seconds_total{mode="iowait"} # Memory node_memory_MemTotal_bytes node_memory_MemFree_bytes node_memory_MemAvailable_bytes node_memory_Buffers_bytes node_memory_Cached_bytes node_memory_SwapTotal_bytes node_memory_SwapFree_bytes # Disk I/O node_disk_read_bytes_total node_disk_written_bytes_total node_disk_reads_completed_total node_disk_writes_completed_total node_disk_io_time_seconds_total node_disk_read_time_seconds_total node_disk_write_time_seconds_total # Filesystem node_filesystem_size_bytes node_filesystem_free_bytes node_filesystem_avail_bytes node_filesystem_files node_filesystem_files_free # Network node_network_receive_bytes_total node_network_transmit_bytes_total node_network_receive_packets_total node_network_transmit_packets_total node_network_receive_errs_total node_network_transmit_errs_total node_network_receive_drop_total node_network_transmit_drop_total # System node_boot_time_seconds node_context_switches_total node_forks_total node_intr_total node_procs_running node_procs_blocked ``` --- ## Configuration ### Enable/Disable Collectors ```yaml agent: nodeexporter: enabled: true # Listen address for /metrics endpoint listen_address: ":9100" # Metric namespace (default: node) namespace: "node" # Collectors to enable collectors: loadavg: true cpu: true meminfo: true diskstats: true filesystem: true netdev: true stat: true # P1 collectors netstat: true sockstat: true vmstat: true # P2 collectors hwmon: false # Hardware monitoring thermal: false # Thermal zones pressure: true # PSI metrics ``` ### Device Filtering Filter which devices to collect metrics from: ```yaml agent: nodeexporter: filesystem: # Ignore these filesystem types ignored_fs_types: - autofs - binfmt_misc - cgroup - configfs - debugfs - devpts - devtmpfs - fusectl - hugetlbfs - mqueue - nsfs - overlay - proc - procfs - pstore - securityfs - sysfs - tmpfs - tracefs # Ignore these mount points ignored_mount_points: - "^/(dev|proc|sys|var/lib/docker/.+)($|/)" diskstats: # Only these devices device_include: - "^sd[a-z]+$" - "^nvme[0-9]+n[0-9]+$" # Ignore these devices device_exclude: - "^loop[0-9]+$" - "^ram[0-9]+$" netdev: # Ignore these interfaces device_exclude: - "^veth.*" - "^docker.*" - "^br-.*" ``` ### TLS/mTLS Configuration Secure the metrics endpoint with TLS and optional mutual TLS (mTLS): ```yaml agent: nodeexporter: enabled: true listen_address: ":9100" # TLS configuration tls: enabled: true # Server certificate and key cert_file: "/etc/telegen/certs/server.crt" key_file: "/etc/telegen/certs/server.key" # Enable mTLS (client certificate verification) client_auth: true # CA certificate for verifying client certs client_ca_file: "/etc/telegen/certs/ca.crt" ``` | Option | Description | Default | |--------|-------------|---------| | `tls.enabled` | Enable TLS for metrics endpoint | `false` | | `tls.cert_file` | Path to server certificate | - | | `tls.key_file` | Path to server private key | - | | `tls.client_auth` | Require client certificates (mTLS) | `false` | | `tls.client_ca_file` | CA for verifying client certificates | - | ### Metric Cardinality Controls Control metric cardinality to prevent explosion from high-cardinality labels: ```yaml agent: nodeexporter: enabled: true # Cardinality controls cardinality: enabled: true # Maximum number of metric families max_metrics: 1000 # Include only these metrics (regex patterns) include_metrics: - "node_cpu_.*" - "node_memory_.*" - "node_disk_.*" - "node_filesystem_.*" - "node_network_.*" - "node_load.*" # Exclude these metrics (regex patterns) exclude_metrics: - "node_scrape_.*" - "go_.*" # Drop these labels from all metrics drop_labels: - "id" - "name" ``` | Option | Description | Default | |--------|-------------|---------| | `cardinality.enabled` | Enable cardinality filtering | `false` | | `cardinality.max_metrics` | Maximum metric families (0 = unlimited) | `0` | | `cardinality.include_metrics` | Regex patterns to include | `[]` (all) | | `cardinality.exclude_metrics` | Regex patterns to exclude | `[]` | | `cardinality.drop_labels` | Labels to remove from all metrics | `[]` | --- ## API Endpoints The node exporter provides several HTTP endpoints: | Endpoint | Description | |----------|-------------| | `/metrics` | Prometheus metrics endpoint | | `/metrics/description` | JSON documentation of all metrics with OTEL mappings | | `/health` | Health check (JSON) | | `/ready` | Readiness probe | | `/live` | Liveness probe | ### Metric Descriptions Endpoint The `/metrics/description` endpoint returns JSON documentation for all available metrics, including OTEL semantic convention mappings: ```bash curl http://localhost:9100/metrics/description ``` Response: ```json { "categories": [ { "category": "CPU", "count": 4, "metrics": [ { "name": "node_cpu_seconds_total", "otel_name": "system.cpu.time", "description": "Seconds the CPUs spent in each mode", "unit": "s", "type": "counter", "labels": {"cpu": "cpu", "mode": "cpu.mode"}, "has_otel_mapping": true } ] } ], "total": 35, "otel_info": { "version": "v1.38.0", "mapped_count": 35, "total_metrics": 35, "coverage": "100.0%" } } ``` --- ## Prometheus Integration ### Scrape Configuration ```yaml # prometheus.yml scrape_configs: - job_name: 'node' static_configs: - targets: - 'host1:9100' - 'host2:9100' ``` ### Service Discovery (Kubernetes) ```yaml scrape_configs: - job_name: 'kubernetes-nodes' kubernetes_sd_configs: - role: node relabel_configs: - source_labels: [__address__] regex: '(.+):10250' replacement: '${1}:9100' target_label: __address__ ``` --- ## Grafana Dashboards Telegen is compatible with standard node_exporter dashboards: ### Recommended Dashboards | Dashboard | Grafana ID | Description | |-----------|------------|-------------| | Node Exporter Full | 1860 | Comprehensive system metrics | | Node Exporter for Prometheus | 11074 | Clean, modern layout | | Linux Server Metrics | 180 | Classic dashboard | ### Import Dashboard 1. Go to Grafana → Dashboards → Import 2. Enter dashboard ID (e.g., `1860`) 3. Select Prometheus data source 4. Dashboard works immediately with Telegen --- ## eBPF-Enhanced Metrics Telegen adds eBPF-based metrics beyond standard node_exporter: ### Additional Metrics | Metric | Description | |--------|-------------| | `node_tcp_rtt_microseconds` | TCP round-trip time | | `node_tcp_retransmits_total` | TCP retransmissions | | `node_process_open_fds` | Open file descriptors per process | | `node_cgroup_cpu_usage_seconds_total` | Per-cgroup CPU usage | | `node_cgroup_memory_usage_bytes` | Per-cgroup memory usage | ### Enable eBPF Enhancements ```yaml agent: nodeexporter: enabled: true # Enable eBPF-enhanced metrics ebpf_enhanced: true ``` --- ## Migration from node_exporter ### Step 1: Deploy Telegen Deploy Telegen alongside node_exporter: ```bash # Telegen on different port initially agent: nodeexporter: listen_address: ":9101" # Different port ``` ### Step 2: Compare Metrics Verify metric compatibility: ```promql # Compare CPU metrics node_cpu_seconds_total{port="9100"} # node_exporter node_cpu_seconds_total{port="9101"} # Telegen ``` ### Step 3: Switch Scrape Targets Update Prometheus to scrape Telegen: ```yaml scrape_configs: - job_name: 'node' static_configs: - targets: ['host:9100'] # Now points to Telegen ``` ### Step 4: Remove node_exporter Once verified, remove node_exporter. --- ## Textfile Collector Import custom metrics from files: ```yaml agent: nodeexporter: textfile: enabled: true directory: "/var/lib/node_exporter/textfile_collector" ``` ### Create Custom Metrics ```bash # /var/lib/node_exporter/textfile_collector/custom.prom # HELP node_custom_metric A custom metric # TYPE node_custom_metric gauge node_custom_metric{label="value"} 42 ``` --- ## Performance ### Resource Usage | Metric | Value | |--------|-------| | CPU overhead | < 0.5% | | Memory | ~20MB | | Scrape time | < 100ms | ### Optimization For large systems (many disks, interfaces): ```yaml agent: nodeexporter: # Increase scrape timeout timeout: 10s # Reduce collection frequency collector_interval: 30s # Limit concurrent collectors max_procs: 2 ``` --- ## Common Queries ### CPU Usage ```promql # CPU utilization percentage 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # Per-CPU utilization 1 - avg by(instance, cpu) (irate(node_cpu_seconds_total{mode="idle"}[5m])) ``` ### Memory Usage ```promql # Memory utilization percentage (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 # Memory breakdown node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes ``` ### Disk I/O ```promql # Disk read/write rate rate(node_disk_read_bytes_total[5m]) rate(node_disk_written_bytes_total[5m]) # Disk I/O utilization rate(node_disk_io_time_seconds_total[5m]) * 100 ``` ### Network ```promql # Network throughput rate(node_network_receive_bytes_total[5m]) * 8 rate(node_network_transmit_bytes_total[5m]) * 8 # Packet errors rate(node_network_receive_errs_total[5m]) rate(node_network_transmit_errs_total[5m]) ``` ### Filesystem ```promql # Filesystem usage percentage (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 # Inode usage (1 - node_filesystem_files_free / node_filesystem_files) * 100 ``` --- ## Alerting Examples ```yaml groups: - name: node rules: - alert: HostHighCpuLoad expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: "High CPU load on {{ $labels.instance }}" - alert: HostOutOfMemory expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10 for: 5m labels: severity: critical annotations: summary: "Host {{ $labels.instance }} is running out of memory" - alert: HostOutOfDiskSpace expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10 for: 5m labels: severity: critical annotations: summary: "Host {{ $labels.instance }} disk space is below 10%" - alert: HostHighDiskIO expr: rate(node_disk_io_time_seconds_total[5m]) > 0.8 for: 5m labels: severity: warning annotations: summary: "High disk I/O on {{ $labels.instance }}" ``` --- ## Next Steps - {doc}`auto-discovery` - Automatic service detection - {doc}`distributed-tracing` - Application tracing - {doc}`../configuration/agent-mode` - Full configuration