Performance Tuning
Optimize Telegen for your environment and workload.
Resource Guidelines
Default Resource Requirements
Component |
CPU |
Memory |
|---|---|---|
Agent (minimal) |
0.1 cores |
128MB |
Agent (full features) |
0.5 cores |
512MB |
Agent (high volume) |
1.0 cores |
1GB |
Collector (SNMP) |
0.2 cores |
256MB |
Collector (storage) |
0.3 cores |
384MB |
Kubernetes Resources
resources:
requests:
cpu: "100m"
memory: "256Mi"
limits:
cpu: "1000m"
memory: "1Gi"
Ring Buffer Tuning
The ring buffer is the primary channel for eBPF events.
Sizing
Buffer Size |
Use Case |
Event Capacity |
|---|---|---|
4MB |
Low traffic, testing |
~40K events |
16MB |
Default, balanced |
~160K events |
64MB |
High traffic |
~640K events |
256MB |
Very high volume |
~2.5M events |
Configuration
agent:
ebpf:
ringbuf_size: 16777216 # 16MB (default)
Signs You Need Larger Buffer
# High loss rate
rate(telegen_ebpf_ringbuf_lost_total[5m]) > 100
If events are being lost, increase buffer size:
agent:
ebpf:
ringbuf_size: 67108864 # 64MB
CPU Optimization
Reduce Collection Overhead
Limit traced ports
agent: ebpf: network: include_ports: - 80 - 443 - 8080 exclude_ports: - 22 - 2379
Reduce syscall tracing
agent: ebpf: syscalls: enabled: false # Disable if not needed
Limit profiling frequency
agent: profiling: sample_rate: 49 # Lower than default 99 Hz
Parallel Processing
agent:
processing:
workers: 4 # Match available CPU cores
Memory Optimization
Queue Limits
queues:
traces:
mem_limit: "128Mi"
max_age: "1h"
batch_size: 256
metrics:
mem_limit: "64Mi"
max_age: "5m"
batch_size: 500
logs:
mem_limit: "128Mi"
max_age: "6h"
batch_size: 500
Reduce Cardinality
High cardinality labels increase memory:
agent:
kubernetes:
# Only essential labels
label_allowlist:
- "app"
- "version"
# NOT: "*"
Limit Active Connections Tracked
agent:
ebpf:
network:
# Limit tracked connections
max_connections: 50000 # Default: 100000
Network/Export Optimization
Compression
otlp:
compression: gzip # Reduce bandwidth
Batching
queues:
traces:
batch_size: 512 # Larger batches = fewer requests
flush_interval: 5s # Don't wait too long
Connection Pooling
otlp:
max_connections: 10 # Connection pool size
idle_timeout: 60s
Sampling
Head-Based Sampling
Sample at collection time:
otlp:
traces:
sample_rate: 0.1 # 10% of traces
Tail-Based Sampling
For more intelligent sampling, configure your OTel Collector:
# OTel Collector config
processors:
tail_sampling:
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow
type: latency
latency: { threshold_ms: 1000 }
- name: sample
type: probabilistic
probabilistic: { sampling_percentage: 10 }
Per-Feature Tuning
Profiling
agent:
profiling:
# Lower sample rate for less overhead
sample_rate: 49 # Hz
# Longer upload interval
upload_interval: 120s
# Disable unused profile types
mutex: false
block: false
goroutine: false
Security Monitoring
agent:
security:
# Focus on critical syscalls only
syscall_audit:
syscalls:
- execve
- setuid
- ptrace
# NOT all syscalls
# Limit file paths
file_integrity:
paths:
- /etc/passwd
- /etc/shadow
# NOT: /var/**
Network Monitoring
agent:
network:
# Use sampling for high-volume
tcp:
sample_rate: 10 # 1 in 10 connections
# XDP sampling
xdp:
sample_rate: 1000 # 0.1% of packets
High-Volume Environments
Recommended Configuration
For environments with >10K requests/second:
telegen:
log_level: warn # Reduce logging
agent:
ebpf:
ringbuf_size: 134217728 # 128MB
perf_buffer_size: 32768 # 32KB per CPU
network:
exclude_paths:
- "/health*"
- "/ready*"
- "/metrics"
exclude_ports:
- 22
- 2379
- 2380
- 10250
resources:
cpu_limit: 2.0
memory_limit: "2Gi"
rate_limit:
spans_per_second: 100000
metrics_per_second: 200000
otlp:
compression: gzip
queues:
traces:
mem_limit: "512Mi"
batch_size: 1024
Low-Resource Environments
Minimal Configuration
For resource-constrained environments:
telegen:
log_level: error
agent:
ebpf:
ringbuf_size: 4194304 # 4MB
network:
enabled: true
http: true
grpc: false
dns: false
syscalls:
enabled: false
profiling:
enabled: false
security:
enabled: false
queues:
traces:
mem_limit: "64Mi"
batch_size: 128
Kubernetes Resources
resources:
requests:
cpu: "50m"
memory: "128Mi"
limits:
cpu: "200m"
memory: "256Mi"
Monitoring Performance
Key Metrics to Watch
# CPU usage
rate(telegen_process_cpu_seconds_total[5m])
# Memory usage
telegen_process_resident_memory_bytes
# Event loss rate
rate(telegen_ebpf_ringbuf_lost_total[5m]) / rate(telegen_ebpf_ringbuf_events_total[5m])
# Export latency
histogram_quantile(0.99, rate(telegen_export_latency_seconds_bucket[5m]))
# Queue depth
telegen_export_queue_size
Performance Alerts
groups:
- name: telegen-performance
rules:
- alert: TelegenHighCPU
expr: rate(telegen_process_cpu_seconds_total[5m]) > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Telegen using high CPU"
- alert: TelegenHighMemory
expr: telegen_process_resident_memory_bytes > 1.5e9
for: 5m
labels:
severity: warning
annotations:
summary: "Telegen memory above 1.5GB"
- alert: TelegenExportSlow
expr: histogram_quantile(0.99, rate(telegen_export_latency_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Telegen export P99 latency high"
Benchmarking
Test Configuration
Before deploying changes, benchmark:
# Generate test load
hey -n 10000 -c 100 http://your-app:8080/api/test
# Monitor Telegen metrics
watch -n 1 'curl -s http://localhost:19090/metrics | grep -E "cpu|memory|lost"'
Compare Before/After
Baseline current configuration
Apply changes
Run same load test
Compare metrics
Best Practices Summary
Start conservative - Begin with defaults, tune based on actual needs
Monitor loss rates - If losing events, increase buffers
Use sampling - For high-volume, sample rather than drop
Filter noise - Exclude health checks, internal traffic
Batch efficiently - Larger batches reduce export overhead
Set limits - Protect against runaway memory usage
Test changes - Benchmark before and after tuning
Next Steps
Monitoring Telegen - Set up performance monitoring
Troubleshooting - Diagnose performance issues
Full Configuration Reference - All configuration options