# AI/ML Observability Telegen provides observability for AI/ML workloads, including GPU monitoring and LLM inference metrics. ## Overview AI/ML observability includes: - **GPU monitoring** - NVIDIA and AMD GPU metrics - **LLM inference** - Token throughput, latency, TTFT - **Model serving** - Batch size, queue depth, inference time - **Training metrics** - Loss, throughput, GPU utilization --- ## GPU Monitoring ### NVIDIA GPU Metrics Telegen collects NVIDIA GPU metrics via NVML (NVIDIA Management Library): | Metric | Description | |--------|-------------| | `gpu_utilization_percent` | GPU compute utilization | | `gpu_memory_used_bytes` | GPU memory used | | `gpu_memory_total_bytes` | GPU memory total | | `gpu_memory_free_bytes` | GPU memory free | | `gpu_temperature_celsius` | GPU temperature | | `gpu_power_usage_watts` | Current power draw | | `gpu_power_limit_watts` | Power limit | | `gpu_sm_clock_hz` | Streaming multiprocessor clock | | `gpu_memory_clock_hz` | Memory clock | | `gpu_pcie_tx_bytes` | PCIe transmit throughput | | `gpu_pcie_rx_bytes` | PCIe receive throughput | | `gpu_encoder_utilization_percent` | Video encoder utilization | | `gpu_decoder_utilization_percent` | Video decoder utilization | ### Per-Process GPU Metrics Track GPU usage per process: | Metric | Description | |--------|-------------| | `gpu_process_memory_bytes` | Memory used by process | | `gpu_process_sm_utilization_percent` | SM utilization by process | ### Configuration ```yaml agent: gpu: enabled: true # NVIDIA support nvidia: true # AMD support (via ROCm SMI) amd: false # Polling interval poll_interval: 10s # Metrics to collect metrics: utilization: true memory: true temperature: true power: true clock: true pcie_throughput: true encoder_decoder: true # Per-process tracking per_process: true ``` --- ## AMD GPU Metrics For AMD GPUs, Telegen uses ROCm SMI: | Metric | Description | |--------|-------------| | `gpu_utilization_percent` | GPU utilization | | `gpu_memory_used_bytes` | VRAM used | | `gpu_memory_total_bytes` | VRAM total | | `gpu_temperature_celsius` | GPU temperature | | `gpu_power_usage_watts` | Power consumption | | `gpu_fan_speed_percent` | Fan speed | ### Configuration ```yaml agent: gpu: enabled: true nvidia: false amd: true poll_interval: 10s ``` --- ## LLM Inference Metrics Track LLM inference performance: ### Key Metrics | Metric | Description | |--------|-------------| | `llm_request_total` | Total inference requests | | `llm_request_duration_seconds` | End-to-end request duration | | `llm_time_to_first_token_seconds` | Time to first token (TTFT) | | `llm_inter_token_latency_seconds` | Time between tokens | | `llm_tokens_generated_total` | Total tokens generated | | `llm_tokens_per_second` | Token generation throughput | | `llm_prompt_tokens_total` | Input prompt tokens | | `llm_queue_depth` | Requests waiting in queue | | `llm_batch_size` | Current batch size | | `llm_kv_cache_usage_bytes` | KV cache memory usage | ### Example Metrics ```promql # Average time to first token histogram_quantile(0.95, sum(rate(llm_time_to_first_token_seconds_bucket[5m])) by (le, model) ) # Token throughput sum(rate(llm_tokens_generated_total[5m])) by (model) # Request rate by model sum(rate(llm_request_total[5m])) by (model) # Queue depth llm_queue_depth{model="llama-3-70b"} ``` ### Labels | Label | Description | |-------|-------------| | `model` | Model name/version | | `instance` | Server instance | | `gpu` | GPU device index | --- ## Model Serving Frameworks ### Supported Frameworks | Framework | Auto-Instrumentation | |-----------|---------------------| | **vLLM** | ✅ Full metrics | | **TGI (Text Generation Inference)** | ✅ Full metrics | | **NVIDIA Triton** | ✅ Full metrics | | **TensorFlow Serving** | ✅ Basic metrics | | **TorchServe** | ✅ Basic metrics | | **ONNX Runtime** | ✅ Basic metrics | ### vLLM Integration ```yaml agent: aiml: frameworks: vllm: enabled: true # Collect all vLLM metrics metrics: - request_duration - time_to_first_token - tokens_per_second - kv_cache_usage - batch_size ``` ### Triton Integration ```yaml agent: aiml: frameworks: triton: enabled: true metrics_endpoint: "http://localhost:8002/metrics" ``` --- ## Training Observability Monitor ML training jobs: ### Metrics | Metric | Description | |--------|-------------| | `training_loss` | Current training loss | | `training_step` | Current training step | | `training_epoch` | Current epoch | | `training_learning_rate` | Current learning rate | | `training_throughput_samples_per_second` | Training throughput | | `training_gpu_utilization_percent` | GPU utilization during training | | `training_gradient_norm` | Gradient norm | ### Configuration ```yaml agent: aiml: training: enabled: true # Detect common training frameworks detect_frameworks: - pytorch - tensorflow - jax # Log training metrics to OTLP export_metrics: true ``` --- ## Multi-GPU Monitoring ### GPU Labels All GPU metrics include device labels: ```yaml gpu_utilization_percent{ device="0", name="NVIDIA A100-SXM4-80GB", uuid="GPU-abc123", k8s_pod="llm-server-abc" } 85.5 ``` ### Multi-Node Training Track distributed training across nodes: ```promql # Total GPU utilization across all training nodes sum(gpu_utilization_percent{job="distributed-training"}) # GPU memory per node gpu_memory_used_bytes{job="distributed-training"} by (node) # Communication overhead (NCCL) rate(gpu_nccl_send_bytes_total[5m]) + rate(gpu_nccl_recv_bytes_total[5m]) ``` --- ## Alerting Examples ### GPU Alerts ```yaml groups: - name: gpu rules: - alert: GPUHighTemperature expr: gpu_temperature_celsius > 85 for: 5m labels: severity: warning annotations: summary: "GPU {{ $labels.device }} temperature is {{ $value }}°C" - alert: GPUOutOfMemory expr: gpu_memory_used_bytes / gpu_memory_total_bytes > 0.95 for: 5m labels: severity: warning annotations: summary: "GPU {{ $labels.device }} memory is {{ $value | humanizePercentage }}" - alert: GPULowUtilization expr: gpu_utilization_percent < 10 for: 30m labels: severity: info annotations: summary: "GPU {{ $labels.device }} underutilized" ``` ### LLM Alerts ```yaml groups: - name: llm rules: - alert: LLMHighLatency expr: histogram_quantile(0.95, rate(llm_request_duration_seconds_bucket[5m])) > 30 for: 5m labels: severity: warning annotations: summary: "LLM P95 latency is {{ $value | humanizeDuration }}" - alert: LLMHighQueueDepth expr: llm_queue_depth > 100 for: 5m labels: severity: warning annotations: summary: "LLM queue depth is {{ $value }}" - alert: LLMSlowTTFT expr: histogram_quantile(0.95, rate(llm_time_to_first_token_seconds_bucket[5m])) > 5 for: 5m labels: severity: warning annotations: summary: "LLM time-to-first-token P95 is {{ $value | humanizeDuration }}" ``` --- ## Kubernetes GPU Support ### NVIDIA GPU Operator When using NVIDIA GPU Operator in Kubernetes: ```yaml # DaemonSet config spec: containers: - name: telegen resources: limits: nvidia.com/gpu: 0 # Don't request GPU, just monitor volumeMounts: # Mount NVML socket - name: nvidia-mps mountPath: /var/run/nvidia volumes: - name: nvidia-mps hostPath: path: /var/run/nvidia ``` ### MIG (Multi-Instance GPU) Support Monitor MIG partitions: ```yaml gpu_utilization_percent{ device="0", mig_device="mig-1g.5gb-0", mig_profile="1g.5gb" } 75.2 ``` --- ## Dashboard Examples ### GPU Overview ```promql # GPU fleet summary sum(gpu_utilization_percent) by (name) / count(gpu_utilization_percent) by (name) # Memory pressure sum(gpu_memory_used_bytes) / sum(gpu_memory_total_bytes) * 100 # Power efficiency (tokens per watt) sum(rate(llm_tokens_generated_total[5m])) / sum(gpu_power_usage_watts) ``` ### LLM Performance ```promql # Requests per second sum(rate(llm_request_total[5m])) by (model) # Token generation rate sum(rate(llm_tokens_generated_total[5m])) by (model) # Latency percentiles histogram_quantile(0.50, sum(rate(llm_request_duration_seconds_bucket[5m])) by (le, model)) histogram_quantile(0.95, sum(rate(llm_request_duration_seconds_bucket[5m])) by (le, model)) histogram_quantile(0.99, sum(rate(llm_request_duration_seconds_bucket[5m])) by (le, model)) ``` --- ## Best Practices ### 1. Enable Per-Process Tracking Identify which processes use GPU resources: ```yaml agent: gpu: per_process: true ``` ### 2. Monitor KV Cache KV cache is critical for LLM performance: ```promql # Alert when KV cache is near capacity llm_kv_cache_usage_bytes / llm_kv_cache_capacity_bytes > 0.9 ``` ### 3. Correlate with Traces Link inference metrics to traces: ```yaml agent: aiml: # Add trace context to LLM metrics trace_correlation: true ``` --- ## Next Steps - {doc}`continuous-profiling` - Profile GPU workloads - {doc}`../configuration/agent-mode` - GPU configuration - {doc}`../operations/monitoring` - GPU dashboards