AI/ML Observability

Telegen provides observability for AI/ML workloads, including GPU monitoring and LLM inference metrics.

Overview

AI/ML observability includes:

GPU monitoring - NVIDIA and AMD GPU metrics
LLM inference - Token throughput, latency, TTFT
Model serving - Batch size, queue depth, inference time
Training metrics - Loss, throughput, GPU utilization

GPU Monitoring

NVIDIA GPU Metrics

Telegen collects NVIDIA GPU metrics via NVML (NVIDIA Management Library):

Metric	Description
`gpu_utilization_percent`	GPU compute utilization
`gpu_memory_used_bytes`	GPU memory used
`gpu_memory_total_bytes`	GPU memory total
`gpu_memory_free_bytes`	GPU memory free
`gpu_temperature_celsius`	GPU temperature
`gpu_power_usage_watts`	Current power draw
`gpu_power_limit_watts`	Power limit
`gpu_sm_clock_hz`	Streaming multiprocessor clock
`gpu_memory_clock_hz`	Memory clock
`gpu_pcie_tx_bytes`	PCIe transmit throughput
`gpu_pcie_rx_bytes`	PCIe receive throughput
`gpu_encoder_utilization_percent`	Video encoder utilization
`gpu_decoder_utilization_percent`	Video decoder utilization

Per-Process GPU Metrics

Track GPU usage per process:

Metric	Description
`gpu_process_memory_bytes`	Memory used by process
`gpu_process_sm_utilization_percent`	SM utilization by process

Configuration

agent:
  gpu:
    enabled: true
    
    # NVIDIA support
    nvidia: true
    
    # AMD support (via ROCm SMI)
    amd: false
    
    # Polling interval
    poll_interval: 10s
    
    # Metrics to collect
    metrics:
      utilization: true
      memory: true
      temperature: true
      power: true
      clock: true
      pcie_throughput: true
      encoder_decoder: true
      
    # Per-process tracking
    per_process: true

AMD GPU Metrics

For AMD GPUs, Telegen uses ROCm SMI:

Metric	Description
`gpu_utilization_percent`	GPU utilization
`gpu_memory_used_bytes`	VRAM used
`gpu_memory_total_bytes`	VRAM total
`gpu_temperature_celsius`	GPU temperature
`gpu_power_usage_watts`	Power consumption
`gpu_fan_speed_percent`	Fan speed

Configuration

agent:
  gpu:
    enabled: true
    nvidia: false
    amd: true
    poll_interval: 10s

LLM Inference Metrics

Track LLM inference performance:

Key Metrics

Metric	Description
`llm_request_total`	Total inference requests
`llm_request_duration_seconds`	End-to-end request duration
`llm_time_to_first_token_seconds`	Time to first token (TTFT)
`llm_inter_token_latency_seconds`	Time between tokens
`llm_tokens_generated_total`	Total tokens generated
`llm_tokens_per_second`	Token generation throughput
`llm_prompt_tokens_total`	Input prompt tokens
`llm_queue_depth`	Requests waiting in queue
`llm_batch_size`	Current batch size
`llm_kv_cache_usage_bytes`	KV cache memory usage

Example Metrics

# Average time to first token
histogram_quantile(0.95, 
  sum(rate(llm_time_to_first_token_seconds_bucket[5m])) by (le, model)
)

# Token throughput
sum(rate(llm_tokens_generated_total[5m])) by (model)

# Request rate by model
sum(rate(llm_request_total[5m])) by (model)

# Queue depth
llm_queue_depth{model="llama-3-70b"}

Labels

Label	Description
`model`	Model name/version
`instance`	Server instance
`gpu`	GPU device index

Model Serving Frameworks

Supported Frameworks

Framework	Auto-Instrumentation
vLLM	✅ Full metrics
TGI (Text Generation Inference)	✅ Full metrics
NVIDIA Triton	✅ Full metrics
TensorFlow Serving	✅ Basic metrics
TorchServe	✅ Basic metrics
ONNX Runtime	✅ Basic metrics

vLLM Integration

agent:
  aiml:
    frameworks:
      vllm:
        enabled: true
        # Collect all vLLM metrics
        metrics:
          - request_duration
          - time_to_first_token
          - tokens_per_second
          - kv_cache_usage
          - batch_size

Triton Integration

agent:
  aiml:
    frameworks:
      triton:
        enabled: true
        metrics_endpoint: "http://localhost:8002/metrics"

Training Observability

Monitor ML training jobs:

Metrics

Metric	Description
`training_loss`	Current training loss
`training_step`	Current training step
`training_epoch`	Current epoch
`training_learning_rate`	Current learning rate
`training_throughput_samples_per_second`	Training throughput
`training_gpu_utilization_percent`	GPU utilization during training
`training_gradient_norm`	Gradient norm

Configuration

agent:
  aiml:
    training:
      enabled: true
      
      # Detect common training frameworks
      detect_frameworks:
        - pytorch
        - tensorflow
        - jax
      
      # Log training metrics to OTLP
      export_metrics: true

Multi-GPU Monitoring

GPU Labels

All GPU metrics include device labels:

gpu_utilization_percent{
  device="0",
  name="NVIDIA A100-SXM4-80GB",
  uuid="GPU-abc123",
  k8s_pod="llm-server-abc"
} 85.5

Multi-Node Training

Track distributed training across nodes:

# Total GPU utilization across all training nodes
sum(gpu_utilization_percent{job="distributed-training"})

# GPU memory per node
gpu_memory_used_bytes{job="distributed-training"} by (node)

# Communication overhead (NCCL)
rate(gpu_nccl_send_bytes_total[5m]) + rate(gpu_nccl_recv_bytes_total[5m])

Alerting Examples

GPU Alerts

groups:
  - name: gpu
    rules:
      - alert: GPUHighTemperature
        expr: gpu_temperature_celsius > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU {{ $labels.device }} temperature is {{ $value }}°C"
      
      - alert: GPUOutOfMemory
        expr: gpu_memory_used_bytes / gpu_memory_total_bytes > 0.95
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU {{ $labels.device }} memory is {{ $value | humanizePercentage }}"
      
      - alert: GPULowUtilization
        expr: gpu_utilization_percent < 10
        for: 30m
        labels:
          severity: info
        annotations:
          summary: "GPU {{ $labels.device }} underutilized"

LLM Alerts

groups:
  - name: llm
    rules:
      - alert: LLMHighLatency
        expr: histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m])) > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM P95 latency is {{ $value | humanizeDuration }}"
      
      - alert: LLMHighQueueDepth
        expr: llm_queue_depth > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM queue depth is {{ $value }}"
      
      - alert: LLMSlowTTFT
        expr: histogram_quantile(0.95, rate(llm_time_to_first_token_seconds_bucket[5m])) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM time-to-first-token P95 is {{ $value | humanizeDuration }}"

Kubernetes GPU Support

NVIDIA GPU Operator

When using NVIDIA GPU Operator in Kubernetes:

# DaemonSet config
spec:
  containers:
    - name: telegen
      resources:
        limits:
          nvidia.com/gpu: 0  # Don't request GPU, just monitor
      volumeMounts:
        # Mount NVML socket
        - name: nvidia-mps
          mountPath: /var/run/nvidia
  volumes:
    - name: nvidia-mps
      hostPath:
        path: /var/run/nvidia

MIG (Multi-Instance GPU) Support

Monitor MIG partitions:

gpu_utilization_percent{
  device="0",
  mig_device="mig-1g.5gb-0",
  mig_profile="1g.5gb"
} 75.2

Dashboard Examples

GPU Overview

# GPU fleet summary
sum(gpu_utilization_percent) by (name) / count(gpu_utilization_percent) by (name)

# Memory pressure
sum(gpu_memory_used_bytes) / sum(gpu_memory_total_bytes) * 100

# Power efficiency (tokens per watt)
sum(rate(llm_tokens_generated_total[5m])) / sum(gpu_power_usage_watts)

LLM Performance

# Requests per second
sum(rate(llm_request_total[5m])) by (model)

# Token generation rate
sum(rate(llm_tokens_generated_total[5m])) by (model)

# Latency percentiles
histogram_quantile(0.50, sum(rate(llm_request_duration_seconds_bucket[5m])) by (le, model))
histogram_quantile(0.95, sum(rate(llm_request_duration_seconds_bucket[5m])) by (le, model))
histogram_quantile(0.99, sum(rate(llm_request_duration_seconds_bucket[5m])) by (le, model))

Best Practices

1. Enable Per-Process Tracking

Identify which processes use GPU resources:

agent:
  gpu:
    per_process: true

2. Monitor KV Cache

KV cache is critical for LLM performance:

# Alert when KV cache is near capacity
llm_kv_cache_usage_bytes / llm_kv_cache_capacity_bytes > 0.9

3. Correlate with Traces

Link inference metrics to traces:

agent:
  aiml:
    # Add trace context to LLM metrics
    trace_correlation: true

Next Steps

Continuous Profiling - Profile GPU workloads
Agent Mode Configuration - GPU configuration
Monitoring Telegen - GPU dashboards