Troubleshooting
Common issues and solutions for Telegen.
Quick Diagnostics
Check Telegen Status
# Health check
curl http://localhost:19090/healthz
# Readiness
curl http://localhost:19090/ready
# Full status
curl http://localhost:19090/status
Check eBPF Programs
# List loaded programs
bpftool prog list | grep -i telegen
# Check if eBPF is working
cat /sys/kernel/debug/tracing/trace_pipe | head -20
Check Logs
# Kubernetes
kubectl logs -l app=telegen -n monitoring --tail=100
# Docker
docker logs telegen --tail=100
# Systemd
journalctl -u telegen -f
Common Issues
eBPF Program Load Failures
Symptom: Telegen starts but shows “eBPF program load failed”
Causes and Solutions:
Kernel too old
Minimum: Linux 4.18
Recommended: Linux 5.8+
uname -r # Check kernel version
Missing capabilities
# Docker docker run --privileged ... # Kubernetes securityContext: privileged: true
BPF filesystem not mounted
mount | grep bpf # Should show: bpffs on /sys/fs/bpf type bpf # Mount if missing mount -t bpf bpf /sys/fs/bpf
BTF not available
ls /sys/kernel/btf/vmlinux # Should exist for CO-RE support
No Traces Being Collected
Symptom: Telegen running but no traces in backend
Diagnostics:
# Check if spans are being collected
curl -s http://localhost:19090/metrics | grep telegen_spans
# Check export status
curl -s http://localhost:19090/metrics | grep telegen_export
Solutions:
OTLP endpoint unreachable
# Test connectivity nc -zv otel-collector 4317 # Check DNS nslookup otel-collector
Network tracing disabled
agent: ebpf: network: enabled: true # Ensure enabled
Wrong port configuration
agent: ebpf: network: include_ports: - 80 - 443 - 8080 # Add your app ports
TLS issues
otlp: endpoint: "otel-collector:4317" insecure: true # Try without TLS first
High Memory Usage
Symptom: Telegen using excessive memory
Diagnostics:
# Check memory metrics
curl -s http://localhost:19090/metrics | grep telegen_process_resident_memory
# Check queue sizes
curl -s http://localhost:19090/metrics | grep telegen_export_queue
Solutions:
Reduce ring buffer size
agent: ebpf: ringbuf_size: 8388608 # 8MB instead of 16MB
Limit queue memory
queues: traces: mem_limit: "128Mi" metrics: mem_limit: "64Mi"
Increase export frequency
Check if backend is slow
Reduce batch sizes
queues: traces: batch_size: 256 # Smaller batches
Event Loss (Ring Buffer)
Symptom: telegen_ebpf_ringbuf_lost_total increasing
Diagnostics:
# Check loss rate
curl -s http://localhost:19090/metrics | grep ringbuf_lost
Solutions:
Increase ring buffer size
agent: ebpf: ringbuf_size: 67108864 # 64MB
Reduce event volume
agent: ebpf: network: exclude_ports: - 22 # SSH - 2379 # etcd syscalls: exclude: - futex - nanosleep
Check CPU bottleneck
Telegen may not be processing fast enough
Increase CPU limits
Export Errors
Symptom: telegen_export_errors_total increasing
Diagnostics:
# Check specific errors
curl -s http://localhost:19090/metrics | grep export_errors
# Check logs
grep -i "export" /var/log/telegen.log | tail -20
Solutions:
Connection refused
# Verify endpoint curl -v http://otel-collector:4317 # Check endpoint config cat /etc/telegen/config.yaml | grep endpoint
TLS certificate errors
otlp: tls: ca_file: "/etc/ssl/certs/ca.crt" insecure_skip_verify: false # Ensure CA is correct
Authentication failures
otlp: headers: Authorization: "Bearer ${OTEL_TOKEN}"
Backend overloaded
Increase retry backoff
Check backend capacity
backoff: initial: "1s" max: "60s"
Missing Kubernetes Metadata
Symptom: Traces lack k8s.pod.name, k8s.namespace labels
Diagnostics:
# Check if running in K8s
kubectl get pods -l app=telegen -n monitoring
# Check RBAC
kubectl auth can-i get pods --as=system:serviceaccount:monitoring:telegen
Solutions:
Missing RBAC permissions
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: telegen rules: - apiGroups: [""] resources: ["pods", "nodes", "services"] verbs: ["get", "list", "watch"]
Downward API not configured
env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name - name: POD_NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace
Discovery disabled
agent: discovery: detect_kubernetes: true
No GPU Metrics
Symptom: GPU metrics not appearing
Diagnostics:
# Check NVML
nvidia-smi
# Check if device is mounted
ls /dev/nvidia*
Solutions:
NVML not available
Ensure NVIDIA drivers installed
Mount NVIDIA device in container
volumes: - /dev/nvidia0:/dev/nvidia0 - /dev/nvidiactl:/dev/nvidiactl
Container not GPU-enabled
spec: runtimeClassName: nvidia containers: - name: telegen resources: limits: nvidia.com/gpu: 0 # Access without allocating
GPU monitoring disabled
agent: gpu: enabled: true nvidia: true
Profiling Not Working
Symptom: No profiles in backend
Diagnostics:
# Check profiling enabled
curl -s http://localhost:19090/metrics | grep profile
# Check perf_event access
cat /proc/sys/kernel/perf_event_paranoid
Solutions:
perf_event_paranoid too restrictive
# Temporary sysctl kernel.perf_event_paranoid=1 # Permanent echo 'kernel.perf_event_paranoid=1' >> /etc/sysctl.conf
Missing capability
securityContext: capabilities: add: - SYS_ADMIN # or PERFMON on newer kernels
Profiling disabled
agent: profiling: enabled: true
Container Not Starting
Symptom: Container exits immediately
Diagnostics:
# Check exit code
docker inspect telegen --format='{{.State.ExitCode}}'
# Check last logs
docker logs telegen 2>&1 | tail -50
Solutions:
Config file error
# Validate config telegen --validate-config /etc/telegen/config.yaml
Required mounts missing
docker run -d \ -v /sys:/sys:ro \ -v /proc:/host/proc:ro \ -v /sys/kernel/debug:/sys/kernel/debug \ -v /sys/fs/bpf:/sys/fs/bpf \ ...
Kernel version mismatch
BTF for wrong kernel
Use
-fno-BTFbuilds or matching kernel
Debug Mode
Enable comprehensive debugging:
telegen:
log_level: debug
agent:
ebpf:
debug: true
Or via environment:
TELEGEN_LOG_LEVEL=debug \
TELEGEN_AGENT_EBPF_DEBUG=true \
telegen
Getting Help
Collect Diagnostics
# Create diagnostic bundle
telegen diagnostics > telegen-diagnostics.tar.gz
Bundle includes:
Configuration (sanitized)
Metrics snapshot
eBPF program list
Kernel info
Recent logs
Log an Issue
When reporting issues, include:
Telegen version:
telegen versionKernel version:
uname -aDistribution:
cat /etc/os-releaseDiagnostic bundle
Steps to reproduce
Next Steps
Monitoring Telegen - Set up monitoring
Performance Tuning - Optimize performance
Full Configuration Reference - Configuration options