AI/ML Observability
Telegen provides observability for AI/ML workloads, including GPU monitoring and LLM inference metrics.
Overview
AI/ML observability includes:
GPU monitoring - NVIDIA and AMD GPU metrics
LLM inference - Token throughput, latency, TTFT
Model serving - Batch size, queue depth, inference time
Training metrics - Loss, throughput, GPU utilization
GPU Monitoring
NVIDIA GPU Metrics
Telegen collects NVIDIA GPU metrics via NVML (NVIDIA Management Library):
Metric |
Description |
|---|---|
|
GPU compute utilization |
|
GPU memory used |
|
GPU memory total |
|
GPU memory free |
|
GPU temperature |
|
Current power draw |
|
Power limit |
|
Streaming multiprocessor clock |
|
Memory clock |
|
PCIe transmit throughput |
|
PCIe receive throughput |
|
Video encoder utilization |
|
Video decoder utilization |
Per-Process GPU Metrics
Track GPU usage per process:
Metric |
Description |
|---|---|
|
Memory used by process |
|
SM utilization by process |
Configuration
agent:
gpu:
enabled: true
# NVIDIA support
nvidia: true
# AMD support (via ROCm SMI)
amd: false
# Polling interval
poll_interval: 10s
# Metrics to collect
metrics:
utilization: true
memory: true
temperature: true
power: true
clock: true
pcie_throughput: true
encoder_decoder: true
# Per-process tracking
per_process: true
AMD GPU Metrics
For AMD GPUs, Telegen uses ROCm SMI:
Metric |
Description |
|---|---|
|
GPU utilization |
|
VRAM used |
|
VRAM total |
|
GPU temperature |
|
Power consumption |
|
Fan speed |
Configuration
agent:
gpu:
enabled: true
nvidia: false
amd: true
poll_interval: 10s
LLM Inference Metrics
Track LLM inference performance:
Key Metrics
Metric |
Description |
|---|---|
|
Total inference requests |
|
End-to-end request duration |
|
Time to first token (TTFT) |
|
Time between tokens |
|
Total tokens generated |
|
Token generation throughput |
|
Input prompt tokens |
|
Requests waiting in queue |
|
Current batch size |
|
KV cache memory usage |
Example Metrics
# Average time to first token
histogram_quantile(0.95,
sum(rate(llm_time_to_first_token_seconds_bucket[5m])) by (le, model)
)
# Token throughput
sum(rate(llm_tokens_generated_total[5m])) by (model)
# Request rate by model
sum(rate(llm_request_total[5m])) by (model)
# Queue depth
llm_queue_depth{model="llama-3-70b"}
Labels
Label |
Description |
|---|---|
|
Model name/version |
|
Server instance |
|
GPU device index |
Model Serving Frameworks
Supported Frameworks
Framework |
Auto-Instrumentation |
|---|---|
vLLM |
✅ Full metrics |
TGI (Text Generation Inference) |
✅ Full metrics |
NVIDIA Triton |
✅ Full metrics |
TensorFlow Serving |
✅ Basic metrics |
TorchServe |
✅ Basic metrics |
ONNX Runtime |
✅ Basic metrics |
vLLM Integration
agent:
aiml:
frameworks:
vllm:
enabled: true
# Collect all vLLM metrics
metrics:
- request_duration
- time_to_first_token
- tokens_per_second
- kv_cache_usage
- batch_size
Triton Integration
agent:
aiml:
frameworks:
triton:
enabled: true
metrics_endpoint: "http://localhost:8002/metrics"
Training Observability
Monitor ML training jobs:
Metrics
Metric |
Description |
|---|---|
|
Current training loss |
|
Current training step |
|
Current epoch |
|
Current learning rate |
|
Training throughput |
|
GPU utilization during training |
|
Gradient norm |
Configuration
agent:
aiml:
training:
enabled: true
# Detect common training frameworks
detect_frameworks:
- pytorch
- tensorflow
- jax
# Log training metrics to OTLP
export_metrics: true
Multi-GPU Monitoring
GPU Labels
All GPU metrics include device labels:
gpu_utilization_percent{
device="0",
name="NVIDIA A100-SXM4-80GB",
uuid="GPU-abc123",
k8s_pod="llm-server-abc"
} 85.5
Multi-Node Training
Track distributed training across nodes:
# Total GPU utilization across all training nodes
sum(gpu_utilization_percent{job="distributed-training"})
# GPU memory per node
gpu_memory_used_bytes{job="distributed-training"} by (node)
# Communication overhead (NCCL)
rate(gpu_nccl_send_bytes_total[5m]) + rate(gpu_nccl_recv_bytes_total[5m])
Alerting Examples
GPU Alerts
groups:
- name: gpu
rules:
- alert: GPUHighTemperature
expr: gpu_temperature_celsius > 85
for: 5m
labels:
severity: warning
annotations:
summary: "GPU {{ $labels.device }} temperature is {{ $value }}°C"
- alert: GPUOutOfMemory
expr: gpu_memory_used_bytes / gpu_memory_total_bytes > 0.95
for: 5m
labels:
severity: warning
annotations:
summary: "GPU {{ $labels.device }} memory is {{ $value | humanizePercentage }}"
- alert: GPULowUtilization
expr: gpu_utilization_percent < 10
for: 30m
labels:
severity: info
annotations:
summary: "GPU {{ $labels.device }} underutilized"
LLM Alerts
groups:
- name: llm
rules:
- alert: LLMHighLatency
expr: histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m])) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "LLM P95 latency is {{ $value | humanizeDuration }}"
- alert: LLMHighQueueDepth
expr: llm_queue_depth > 100
for: 5m
labels:
severity: warning
annotations:
summary: "LLM queue depth is {{ $value }}"
- alert: LLMSlowTTFT
expr: histogram_quantile(0.95, rate(llm_time_to_first_token_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "LLM time-to-first-token P95 is {{ $value | humanizeDuration }}"
Kubernetes GPU Support
NVIDIA GPU Operator
When using NVIDIA GPU Operator in Kubernetes:
# DaemonSet config
spec:
containers:
- name: telegen
resources:
limits:
nvidia.com/gpu: 0 # Don't request GPU, just monitor
volumeMounts:
# Mount NVML socket
- name: nvidia-mps
mountPath: /var/run/nvidia
volumes:
- name: nvidia-mps
hostPath:
path: /var/run/nvidia
MIG (Multi-Instance GPU) Support
Monitor MIG partitions:
gpu_utilization_percent{
device="0",
mig_device="mig-1g.5gb-0",
mig_profile="1g.5gb"
} 75.2
Dashboard Examples
GPU Overview
# GPU fleet summary
sum(gpu_utilization_percent) by (name) / count(gpu_utilization_percent) by (name)
# Memory pressure
sum(gpu_memory_used_bytes) / sum(gpu_memory_total_bytes) * 100
# Power efficiency (tokens per watt)
sum(rate(llm_tokens_generated_total[5m])) / sum(gpu_power_usage_watts)
LLM Performance
# Requests per second
sum(rate(llm_request_total[5m])) by (model)
# Token generation rate
sum(rate(llm_tokens_generated_total[5m])) by (model)
# Latency percentiles
histogram_quantile(0.50, sum(rate(llm_request_duration_seconds_bucket[5m])) by (le, model))
histogram_quantile(0.95, sum(rate(llm_request_duration_seconds_bucket[5m])) by (le, model))
histogram_quantile(0.99, sum(rate(llm_request_duration_seconds_bucket[5m])) by (le, model))
Best Practices
1. Enable Per-Process Tracking
Identify which processes use GPU resources:
agent:
gpu:
per_process: true
2. Monitor KV Cache
KV cache is critical for LLM performance:
# Alert when KV cache is near capacity
llm_kv_cache_usage_bytes / llm_kv_cache_capacity_bytes > 0.9
3. Correlate with Traces
Link inference metrics to traces:
agent:
aiml:
# Add trace context to LLM metrics
trace_correlation: true
Next Steps
Continuous Profiling - Profile GPU workloads
Agent Mode Configuration - GPU configuration
Monitoring Telegen - GPU dashboards