Troubleshooting

Common issues and solutions for Telegen.

Quick Diagnostics

Check Telegen Status

# Health check
curl http://localhost:19090/healthz

# Readiness
curl http://localhost:19090/ready

# Full status
curl http://localhost:19090/status

Check eBPF Programs

# List loaded programs
bpftool prog list | grep -i telegen

# Check if eBPF is working
cat /sys/kernel/debug/tracing/trace_pipe | head -20

Check Logs

# Kubernetes
kubectl logs -l app=telegen -n monitoring --tail=100

# Docker
docker logs telegen --tail=100

# Systemd
journalctl -u telegen -f

Common Issues

eBPF Program Load Failures

Symptom: Telegen starts but shows “eBPF program load failed”

Causes and Solutions:

  1. Kernel too old

    • Minimum: Linux 4.18

    • Recommended: Linux 5.8+

    uname -r  # Check kernel version
    
  2. Missing capabilities

    # Docker
    docker run --privileged ...
    
    # Kubernetes
    securityContext:
      privileged: true
    
  3. BPF filesystem not mounted

    mount | grep bpf
    # Should show: bpffs on /sys/fs/bpf type bpf
    
    # Mount if missing
    mount -t bpf bpf /sys/fs/bpf
    
  4. BTF not available

    ls /sys/kernel/btf/vmlinux
    # Should exist for CO-RE support
    

No Traces Being Collected

Symptom: Telegen running but no traces in backend

Diagnostics:

# Check if spans are being collected
curl -s http://localhost:19090/metrics | grep telegen_spans

# Check export status
curl -s http://localhost:19090/metrics | grep telegen_export

Solutions:

  1. OTLP endpoint unreachable

    # Test connectivity
    nc -zv otel-collector 4317
    
    # Check DNS
    nslookup otel-collector
    
  2. Network tracing disabled

    agent:
      ebpf:
        network:
          enabled: true  # Ensure enabled
    
  3. Wrong port configuration

    agent:
      ebpf:
        network:
          include_ports:
            - 80
            - 443
            - 8080  # Add your app ports
    
  4. TLS issues

    otlp:
      endpoint: "otel-collector:4317"
      insecure: true  # Try without TLS first
    

High Memory Usage

Symptom: Telegen using excessive memory

Diagnostics:

# Check memory metrics
curl -s http://localhost:19090/metrics | grep telegen_process_resident_memory

# Check queue sizes
curl -s http://localhost:19090/metrics | grep telegen_export_queue

Solutions:

  1. Reduce ring buffer size

    agent:
      ebpf:
        ringbuf_size: 8388608  # 8MB instead of 16MB
    
  2. Limit queue memory

    queues:
      traces:
        mem_limit: "128Mi"
      metrics:
        mem_limit: "64Mi"
    
  3. Increase export frequency

    • Check if backend is slow

    • Reduce batch sizes

    queues:
      traces:
        batch_size: 256  # Smaller batches
    

Event Loss (Ring Buffer)

Symptom: telegen_ebpf_ringbuf_lost_total increasing

Diagnostics:

# Check loss rate
curl -s http://localhost:19090/metrics | grep ringbuf_lost

Solutions:

  1. Increase ring buffer size

    agent:
      ebpf:
        ringbuf_size: 67108864  # 64MB
    
  2. Reduce event volume

    agent:
      ebpf:
        network:
          exclude_ports:
            - 22     # SSH
            - 2379   # etcd
        syscalls:
          exclude:
            - futex
            - nanosleep
    
  3. Check CPU bottleneck

    • Telegen may not be processing fast enough

    • Increase CPU limits


Export Errors

Symptom: telegen_export_errors_total increasing

Diagnostics:

# Check specific errors
curl -s http://localhost:19090/metrics | grep export_errors

# Check logs
grep -i "export" /var/log/telegen.log | tail -20

Solutions:

  1. Connection refused

    # Verify endpoint
    curl -v http://otel-collector:4317
    
    # Check endpoint config
    cat /etc/telegen/config.yaml | grep endpoint
    
  2. TLS certificate errors

    otlp:
      tls:
        ca_file: "/etc/ssl/certs/ca.crt"
        insecure_skip_verify: false  # Ensure CA is correct
    
  3. Authentication failures

    otlp:
      headers:
        Authorization: "Bearer ${OTEL_TOKEN}"
    
  4. Backend overloaded

    • Increase retry backoff

    • Check backend capacity

    backoff:
      initial: "1s"
      max: "60s"
    

Missing Kubernetes Metadata

Symptom: Traces lack k8s.pod.name, k8s.namespace labels

Diagnostics:

# Check if running in K8s
kubectl get pods -l app=telegen -n monitoring

# Check RBAC
kubectl auth can-i get pods --as=system:serviceaccount:monitoring:telegen

Solutions:

  1. Missing RBAC permissions

    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: telegen
    rules:
      - apiGroups: [""]
        resources: ["pods", "nodes", "services"]
        verbs: ["get", "list", "watch"]
    
  2. Downward API not configured

    env:
      - name: POD_NAME
        valueFrom:
          fieldRef:
            fieldPath: metadata.name
      - name: POD_NAMESPACE
        valueFrom:
          fieldRef:
            fieldPath: metadata.namespace
    
  3. Discovery disabled

    agent:
      discovery:
        detect_kubernetes: true
    

No GPU Metrics

Symptom: GPU metrics not appearing

Diagnostics:

# Check NVML
nvidia-smi

# Check if device is mounted
ls /dev/nvidia*

Solutions:

  1. NVML not available

    • Ensure NVIDIA drivers installed

    • Mount NVIDIA device in container

    volumes:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
    
  2. Container not GPU-enabled

    spec:
      runtimeClassName: nvidia
      containers:
        - name: telegen
          resources:
            limits:
              nvidia.com/gpu: 0  # Access without allocating
    
  3. GPU monitoring disabled

    agent:
      gpu:
        enabled: true
        nvidia: true
    

Profiling Not Working

Symptom: No profiles in backend

Diagnostics:

# Check profiling enabled
curl -s http://localhost:19090/metrics | grep profile

# Check perf_event access
cat /proc/sys/kernel/perf_event_paranoid

Solutions:

  1. perf_event_paranoid too restrictive

    # Temporary
    sysctl kernel.perf_event_paranoid=1
    
    # Permanent
    echo 'kernel.perf_event_paranoid=1' >> /etc/sysctl.conf
    
  2. Missing capability

    securityContext:
      capabilities:
        add:
          - SYS_ADMIN  # or PERFMON on newer kernels
    
  3. Profiling disabled

    agent:
      profiling:
        enabled: true
    

Container Not Starting

Symptom: Container exits immediately

Diagnostics:

# Check exit code
docker inspect telegen --format='{{.State.ExitCode}}'

# Check last logs
docker logs telegen 2>&1 | tail -50

Solutions:

  1. Config file error

    # Validate config
    telegen --validate-config /etc/telegen/config.yaml
    
  2. Required mounts missing

    docker run -d \
      -v /sys:/sys:ro \
      -v /proc:/host/proc:ro \
      -v /sys/kernel/debug:/sys/kernel/debug \
      -v /sys/fs/bpf:/sys/fs/bpf \
      ...
    
  3. Kernel version mismatch

    • BTF for wrong kernel

    • Use -fno-BTF builds or matching kernel


Debug Mode

Enable comprehensive debugging:

telegen:
  log_level: debug
  
agent:
  ebpf:
    debug: true

Or via environment:

TELEGEN_LOG_LEVEL=debug \
TELEGEN_AGENT_EBPF_DEBUG=true \
telegen

Getting Help

Collect Diagnostics

# Create diagnostic bundle
telegen diagnostics > telegen-diagnostics.tar.gz

Bundle includes:

  • Configuration (sanitized)

  • Metrics snapshot

  • eBPF program list

  • Kernel info

  • Recent logs

Log an Issue

When reporting issues, include:

  • Telegen version: telegen version

  • Kernel version: uname -a

  • Distribution: cat /etc/os-release

  • Diagnostic bundle

  • Steps to reproduce


Next Steps