# Performance Tuning Optimize Telegen for your environment and workload. ## Resource Guidelines ### Default Resource Requirements | Component | CPU | Memory | |-----------|-----|--------| | **Agent (minimal)** | 0.1 cores | 128MB | | **Agent (full features)** | 0.5 cores | 512MB | | **Agent (high volume)** | 1.0 cores | 1GB | | **Collector (SNMP)** | 0.2 cores | 256MB | | **Collector (storage)** | 0.3 cores | 384MB | ### Kubernetes Resources ```yaml resources: requests: cpu: "100m" memory: "256Mi" limits: cpu: "1000m" memory: "1Gi" ``` --- ## Ring Buffer Tuning The ring buffer is the primary channel for eBPF events. ### Sizing | Buffer Size | Use Case | Event Capacity | |-------------|----------|----------------| | 4MB | Low traffic, testing | ~40K events | | 16MB | Default, balanced | ~160K events | | 64MB | High traffic | ~640K events | | 256MB | Very high volume | ~2.5M events | ### Configuration ```yaml agent: ebpf: ringbuf_size: 16777216 # 16MB (default) ``` ### Signs You Need Larger Buffer ```promql # High loss rate rate(telegen_ebpf_ringbuf_lost_total[5m]) > 100 ``` If events are being lost, increase buffer size: ```yaml agent: ebpf: ringbuf_size: 67108864 # 64MB ``` --- ## CPU Optimization ### Reduce Collection Overhead 1. **Limit traced ports** ```yaml agent: ebpf: network: include_ports: - 80 - 443 - 8080 exclude_ports: - 22 - 2379 ``` 2. **Reduce syscall tracing** ```yaml agent: ebpf: syscalls: enabled: false # Disable if not needed ``` 3. **Limit profiling frequency** ```yaml agent: profiling: sample_rate: 49 # Lower than default 99 Hz ``` ### Parallel Processing ```yaml agent: processing: workers: 4 # Match available CPU cores ``` --- ## Memory Optimization ### Queue Limits ```yaml queues: traces: mem_limit: "128Mi" max_age: "1h" batch_size: 256 metrics: mem_limit: "64Mi" max_age: "5m" batch_size: 500 logs: mem_limit: "128Mi" max_age: "6h" batch_size: 500 ``` ### Reduce Cardinality High cardinality labels increase memory: ```yaml agent: kubernetes: # Only essential labels label_allowlist: - "app" - "version" # NOT: "*" ``` ### Limit Active Connections Tracked ```yaml agent: ebpf: network: # Limit tracked connections max_connections: 50000 # Default: 100000 ``` --- ## Network/Export Optimization ### Compression ```yaml otlp: compression: gzip # Reduce bandwidth ``` ### Batching ```yaml queues: traces: batch_size: 512 # Larger batches = fewer requests flush_interval: 5s # Don't wait too long ``` ### Connection Pooling ```yaml otlp: max_connections: 10 # Connection pool size idle_timeout: 60s ``` --- ## Sampling ### Head-Based Sampling Sample at collection time: ```yaml otlp: traces: sample_rate: 0.1 # 10% of traces ``` ### Tail-Based Sampling For more intelligent sampling, configure your OTel Collector: ```yaml # OTel Collector config processors: tail_sampling: policies: - name: errors type: status_code status_code: { status_codes: [ERROR] } - name: slow type: latency latency: { threshold_ms: 1000 } - name: sample type: probabilistic probabilistic: { sampling_percentage: 10 } ``` --- ## Per-Feature Tuning ### Profiling ```yaml agent: profiling: # Lower sample rate for less overhead sample_rate: 49 # Hz # Longer upload interval upload_interval: 120s # Disable unused profile types mutex: false block: false goroutine: false ``` ### Security Monitoring ```yaml agent: security: # Focus on critical syscalls only syscall_audit: syscalls: - execve - setuid - ptrace # NOT all syscalls # Limit file paths file_integrity: paths: - /etc/passwd - /etc/shadow # NOT: /var/** ``` ### Network Monitoring ```yaml agent: network: # Use sampling for high-volume tcp: sample_rate: 10 # 1 in 10 connections # XDP sampling xdp: sample_rate: 1000 # 0.1% of packets ``` --- ## High-Volume Environments ### Recommended Configuration For environments with >10K requests/second: ```yaml telegen: log_level: warn # Reduce logging agent: ebpf: ringbuf_size: 134217728 # 128MB perf_buffer_size: 32768 # 32KB per CPU network: exclude_paths: - "/health*" - "/ready*" - "/metrics" exclude_ports: - 22 - 2379 - 2380 - 10250 resources: cpu_limit: 2.0 memory_limit: "2Gi" rate_limit: spans_per_second: 100000 metrics_per_second: 200000 otlp: compression: gzip queues: traces: mem_limit: "512Mi" batch_size: 1024 ``` --- ## Low-Resource Environments ### Minimal Configuration For resource-constrained environments: ```yaml telegen: log_level: error agent: ebpf: ringbuf_size: 4194304 # 4MB network: enabled: true http: true grpc: false dns: false syscalls: enabled: false profiling: enabled: false security: enabled: false queues: traces: mem_limit: "64Mi" batch_size: 128 ``` ### Kubernetes Resources ```yaml resources: requests: cpu: "50m" memory: "128Mi" limits: cpu: "200m" memory: "256Mi" ``` --- ## Monitoring Performance ### Key Metrics to Watch ```promql # CPU usage rate(telegen_process_cpu_seconds_total[5m]) # Memory usage telegen_process_resident_memory_bytes # Event loss rate rate(telegen_ebpf_ringbuf_lost_total[5m]) / rate(telegen_ebpf_ringbuf_events_total[5m]) # Export latency histogram_quantile(0.99, rate(telegen_export_latency_seconds_bucket[5m])) # Queue depth telegen_export_queue_size ``` ### Performance Alerts ```yaml groups: - name: telegen-performance rules: - alert: TelegenHighCPU expr: rate(telegen_process_cpu_seconds_total[5m]) > 0.8 for: 10m labels: severity: warning annotations: summary: "Telegen using high CPU" - alert: TelegenHighMemory expr: telegen_process_resident_memory_bytes > 1.5e9 for: 5m labels: severity: warning annotations: summary: "Telegen memory above 1.5GB" - alert: TelegenExportSlow expr: histogram_quantile(0.99, rate(telegen_export_latency_seconds_bucket[5m])) > 5 for: 5m labels: severity: warning annotations: summary: "Telegen export P99 latency high" ``` --- ## Benchmarking ### Test Configuration Before deploying changes, benchmark: ```bash # Generate test load hey -n 10000 -c 100 http://your-app:8080/api/test # Monitor Telegen metrics watch -n 1 'curl -s http://localhost:19090/metrics | grep -E "cpu|memory|lost"' ``` ### Compare Before/After 1. Baseline current configuration 2. Apply changes 3. Run same load test 4. Compare metrics --- ## Best Practices Summary 1. **Start conservative** - Begin with defaults, tune based on actual needs 2. **Monitor loss rates** - If losing events, increase buffers 3. **Use sampling** - For high-volume, sample rather than drop 4. **Filter noise** - Exclude health checks, internal traffic 5. **Batch efficiently** - Larger batches reduce export overhead 6. **Set limits** - Protect against runaway memory usage 7. **Test changes** - Benchmark before and after tuning --- ## Next Steps - {doc}`monitoring` - Set up performance monitoring - {doc}`troubleshooting` - Diagnose performance issues - {doc}`../configuration/full-reference` - All configuration options