Performance Tuning

Optimize Telegen for your environment and workload.

Resource Guidelines

Default Resource Requirements

Component	CPU	Memory
Agent (minimal)	0.1 cores	128MB
Agent (full features)	0.5 cores	512MB
Agent (high volume)	1.0 cores	1GB
Collector (SNMP)	0.2 cores	256MB
Collector (storage)	0.3 cores	384MB

Kubernetes Resources

resources:
  requests:
    cpu: "100m"
    memory: "256Mi"
  limits:
    cpu: "1000m"
    memory: "1Gi"

Ring Buffer Tuning

The ring buffer is the primary channel for eBPF events.

Sizing

Buffer Size	Use Case	Event Capacity
4MB	Low traffic, testing	~40K events
16MB	Default, balanced	~160K events
64MB	High traffic	~640K events
256MB	Very high volume	~2.5M events

Configuration

agent:
  ebpf:
    ringbuf_size: 16777216  # 16MB (default)

Signs You Need Larger Buffer

# High loss rate
rate(telegen_ebpf_ringbuf_lost_total[5m]) > 100

If events are being lost, increase buffer size:

agent:
  ebpf:
    ringbuf_size: 67108864  # 64MB

CPU Optimization

Reduce Collection Overhead

Limit traced ports

agent:
  ebpf:
    network:
      include_ports:
        - 80
        - 443
        - 8080
      exclude_ports:
        - 22
        - 2379

Reduce syscall tracing

agent:
  ebpf:
    syscalls:
      enabled: false  # Disable if not needed

Limit profiling frequency

agent:
  profiling:
    sample_rate: 49  # Lower than default 99 Hz

Parallel Processing

agent:
  processing:
    workers: 4  # Match available CPU cores

Memory Optimization

Queue Limits

queues:
  traces:
    mem_limit: "128Mi"
    max_age: "1h"
    batch_size: 256
  
  metrics:
    mem_limit: "64Mi"
    max_age: "5m"
    batch_size: 500
  
  logs:
    mem_limit: "128Mi"
    max_age: "6h"
    batch_size: 500

Reduce Cardinality

High cardinality labels increase memory:

agent:
  kubernetes:
    # Only essential labels
    label_allowlist:
      - "app"
      - "version"
    # NOT: "*"

Limit Active Connections Tracked

agent:
  ebpf:
    network:
      # Limit tracked connections
      max_connections: 50000  # Default: 100000

Network/Export Optimization

Compression

otlp:
  compression: gzip  # Reduce bandwidth

Batching

queues:
  traces:
    batch_size: 512     # Larger batches = fewer requests
    flush_interval: 5s  # Don't wait too long

Connection Pooling

otlp:
  max_connections: 10  # Connection pool size
  idle_timeout: 60s

Sampling

Head-Based Sampling

Sample at collection time:

otlp:
  traces:
    sample_rate: 0.1  # 10% of traces

Tail-Based Sampling

For more intelligent sampling, configure your OTel Collector:

# OTel Collector config
processors:
  tail_sampling:
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 1000 }
      - name: sample
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

Per-Feature Tuning

Profiling

agent:
  profiling:
    # Lower sample rate for less overhead
    sample_rate: 49  # Hz
    
    # Longer upload interval
    upload_interval: 120s
    
    # Disable unused profile types
    mutex: false
    block: false
    goroutine: false

Security Monitoring

agent:
  security:
    # Focus on critical syscalls only
    syscall_audit:
      syscalls:
        - execve
        - setuid
        - ptrace
      # NOT all syscalls
    
    # Limit file paths
    file_integrity:
      paths:
        - /etc/passwd
        - /etc/shadow
      # NOT: /var/**

Network Monitoring

agent:
  network:
    # Use sampling for high-volume
    tcp:
      sample_rate: 10  # 1 in 10 connections
    
    # XDP sampling
    xdp:
      sample_rate: 1000  # 0.1% of packets

High-Volume Environments

Recommended Configuration

For environments with >10K requests/second:

telegen:
  log_level: warn  # Reduce logging

agent:
  ebpf:
    ringbuf_size: 134217728  # 128MB
    perf_buffer_size: 32768  # 32KB per CPU
    
    network:
      exclude_paths:
        - "/health*"
        - "/ready*"
        - "/metrics"
      exclude_ports:
        - 22
        - 2379
        - 2380
        - 10250
  
  resources:
    cpu_limit: 2.0
    memory_limit: "2Gi"
    rate_limit:
      spans_per_second: 100000
      metrics_per_second: 200000

otlp:
  compression: gzip
  
queues:
  traces:
    mem_limit: "512Mi"
    batch_size: 1024

Low-Resource Environments

Minimal Configuration

For resource-constrained environments:

telegen:
  log_level: error

agent:
  ebpf:
    ringbuf_size: 4194304  # 4MB
    
    network:
      enabled: true
      http: true
      grpc: false
      dns: false
    
    syscalls:
      enabled: false
  
  profiling:
    enabled: false
  
  security:
    enabled: false

queues:
  traces:
    mem_limit: "64Mi"
    batch_size: 128

Kubernetes Resources

resources:
  requests:
    cpu: "50m"
    memory: "128Mi"
  limits:
    cpu: "200m"
    memory: "256Mi"

Monitoring Performance

Key Metrics to Watch

# CPU usage
rate(telegen_process_cpu_seconds_total[5m])

# Memory usage
telegen_process_resident_memory_bytes

# Event loss rate
rate(telegen_ebpf_ringbuf_lost_total[5m]) / rate(telegen_ebpf_ringbuf_events_total[5m])

# Export latency
histogram_quantile(0.99, rate(telegen_export_latency_seconds_bucket[5m]))

# Queue depth
telegen_export_queue_size

Performance Alerts

groups:
  - name: telegen-performance
    rules:
      - alert: TelegenHighCPU
        expr: rate(telegen_process_cpu_seconds_total[5m]) > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Telegen using high CPU"
      
      - alert: TelegenHighMemory
        expr: telegen_process_resident_memory_bytes > 1.5e9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Telegen memory above 1.5GB"
      
      - alert: TelegenExportSlow
        expr: histogram_quantile(0.99, rate(telegen_export_latency_seconds_bucket[5m])) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Telegen export P99 latency high"

Benchmarking

Test Configuration

Before deploying changes, benchmark:

# Generate test load
hey -n 10000 -c 100 http://your-app:8080/api/test

# Monitor Telegen metrics
watch -n 1 'curl -s http://localhost:19090/metrics | grep -E "cpu|memory|lost"'

Compare Before/After

Baseline current configuration
Apply changes
Run same load test
Compare metrics

Best Practices Summary

Start conservative - Begin with defaults, tune based on actual needs
Monitor loss rates - If losing events, increase buffers
Use sampling - For high-volume, sample rather than drop
Filter noise - Exclude health checks, internal traffic
Batch efficiently - Larger batches reduce export overhead
Set limits - Protect against runaway memory usage
Test changes - Benchmark before and after tuning

Next Steps

Monitoring Telegen - Set up performance monitoring
Troubleshooting - Diagnose performance issues
Full Configuration Reference - All configuration options