# Distributed Tracing

Telegen provides zero-configuration distributed tracing using eBPF.

## Overview

Telegen automatically traces:

- **HTTP/HTTPS** - All HTTP/1.1 and HTTP/2 traffic
- **gRPC** - All gRPC calls
- **Database queries** - PostgreSQL, MySQL, MongoDB, Redis
- **Message queues** - Kafka, RabbitMQ
- **Internal function calls** - For supported runtimes

No code changes or SDK integration required.

:::{tip}
For targeted tracing, use **port-based discovery** to instrument only specific services.
See {doc}`auto-discovery` for details.

```yaml
discovery:
  instrument:
    - open_ports: "8080-8089"  # Only trace these ports
```
:::

---

## How It Works

```{mermaid}
flowchart TB
    subgraph Kernel["Linux Kernel"]
        K["eBPF Programs"]
    end
    
    subgraph App["Application"]
        A["HTTP Handler"]
        B["gRPC Client"]
        C["DB Query"]
    end
    
    subgraph Telegen["Telegen Agent"]
        T["Trace Correlator"]
        E["OTLP Exporter"]
    end
    
    K -->|"Intercept"| A
    K -->|"Intercept"| B
    K -->|"Intercept"| C
    A --> K
    B --> K
    C --> K
    K --> T
    T --> E
    E -->|"OTLP"| OC["OTel Collector"]
```

### Trace Context Propagation

Telegen automatically extracts and propagates trace context:

1. **Incoming requests** - Extract `traceparent`/`tracestate` from headers
2. **Outgoing requests** - Inject trace context into outgoing calls
3. **Cross-service correlation** - Link spans across service boundaries

---

## Protocol Support

### HTTP Tracing

```yaml
# Automatically captured for every HTTP request
span:
  name: "GET /api/users/{id}"
  kind: SERVER
  attributes:
    http.method: GET
    http.url: "https://api.example.com/api/users/123"
    http.route: "/api/users/{id}"
    http.status_code: 200
    http.request_content_length: 0
    http.response_content_length: 1234
    http.user_agent: "curl/7.88.0"
    net.peer.ip: "10.0.1.50"
    net.peer.port: 45678
    net.host.ip: "10.0.1.100"
    net.host.port: 8080
```

### gRPC Tracing

```yaml
span:
  name: "/users.UserService/GetUser"
  kind: SERVER
  attributes:
    rpc.system: grpc
    rpc.service: users.UserService
    rpc.method: GetUser
    rpc.grpc.status_code: 0
    net.peer.ip: "10.0.1.50"
    net.peer.port: 45678
```

### Database Tracing

```yaml
span:
  name: "SELECT users"
  kind: CLIENT
  attributes:
    db.system: postgresql
    db.name: mydb
    db.user: appuser
    db.statement: "SELECT * FROM users WHERE id = $1"
    db.operation: SELECT
    db.sql.table: users
    net.peer.ip: "10.0.2.100"
    net.peer.port: 5432
```

### Message Queue Tracing

```yaml
# Kafka produce
span:
  name: "orders send"
  kind: PRODUCER
  attributes:
    messaging.system: kafka
    messaging.destination.name: orders
    messaging.kafka.partition: 3
    messaging.kafka.message.offset: 12345
    messaging.message.payload_size_bytes: 256

# Kafka consume
span:
  name: "orders receive"
  kind: CONSUMER
  attributes:
    messaging.system: kafka
    messaging.destination.name: orders
    messaging.kafka.consumer.group: order-processor
    messaging.kafka.partition: 3
    messaging.kafka.message.offset: 12345
```

---

## Runtime-Specific Tracing

### Go Applications

Telegen traces Go applications at the runtime level:

- **Goroutine tracking** - Track execution across goroutines
- **HTTP handlers** - `net/http`, Gin, Echo, Chi, Fiber
- **gRPC** - All gRPC calls
- **Database drivers** - `database/sql`, pgx, go-redis

### Java Applications

Integration with JFR (Java Flight Recorder):

- **Method tracing** - Hot methods and stack traces
- **GC events** - Garbage collection correlation
- **Lock contention** - Synchronized blocks and locks
- **Thread events** - Thread creation, blocking

### Python Applications

- **ASGI/WSGI** - FastAPI, Django, Flask
- **asyncio** - Async operation tracking
- **Database** - psycopg2, SQLAlchemy, pymongo

### Node.js Applications

- **HTTP** - Express, Fastify, Koa
- **Async hooks** - Promise and callback tracking
- **Database** - pg, mysql2, mongodb, redis

---

## Trace Correlation

### Automatic Signal Linking

Telegen automatically correlates:

```{mermaid}
flowchart LR
    subgraph Request["Single Request"]
        T["Trace\n(span_id: abc123)"]
        M["Metrics\n(labeled: span_id=abc123)"]
        L["Logs\n(trace_id, span_id)"]
        P["Profile\n(span_id: abc123)"]
    end
    
    T --- M
    T --- L
    T --- P
```

### Log Correlation

Logs are automatically enriched with trace context:

```json
{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "info",
  "message": "User created successfully",
  "trace_id": "a1b2c3d4e5f6789012345678",
  "span_id": "abc123def456",
  "service.name": "user-service",
  "k8s.pod.name": "user-service-xyz"
}
```

### Metric Exemplars

Metrics include exemplars linking to traces:

```yaml
http_server_duration:
  type: histogram
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
  exemplars:
    - value: 0.045
      trace_id: "a1b2c3d4e5f6789012345678"
      span_id: "abc123def456"
```

---

## Configuration

### Basic Configuration

```yaml
otlp:
  endpoint: "otel-collector:4317"
  traces:
    enabled: true
    sample_rate: 1.0  # 100% sampling
```

### Sampling

```yaml
otlp:
  traces:
    enabled: true
    # Sample 10% of traces
    sample_rate: 0.1
    
    # Head-based sampling (default)
    sampler: parent_based_traceidratio
```

### Network Filtering

```yaml
agent:
  ebpf:
    network:
      enabled: true
      http: true
      grpc: true
      
      # Exclude noisy endpoints
      exclude_paths:
        - "/health"
        - "/healthz"
        - "/ready"
        - "/metrics"
      
      # Exclude by port
      exclude_ports:
        - 22    # SSH
        - 2379  # etcd
```

### Database Query Settings

```yaml
agent:
  database:
    # Capture full query text
    capture_queries: true
    
    # Sanitize sensitive data
    sanitize_queries: true
    
    # Max query length
    max_query_length: 1024
    
    # Capture query parameters
    capture_parameters: false  # Privacy consideration
```

---

## Span Enrichment

### Automatic Enrichment

All spans are automatically enriched with:

| Attribute | Source |
|-----------|--------|
| `service.name` | Discovery or config |
| `service.version` | Binary analysis |
| `host.name` | System |
| `k8s.pod.name` | Kubernetes |
| `cloud.region` | Cloud metadata |
| `process.pid` | System |

### Custom Attributes

Add custom attributes via environment variables:

```yaml
# Kubernetes deployment
env:
  - name: OTEL_RESOURCE_ATTRIBUTES
    value: "team=platform,cost_center=engineering"
```

---

## Performance Impact

Telegen is designed for minimal overhead:

| Metric | Overhead |
|--------|----------|
| **Latency** | < 100μs per request |
| **CPU** | < 1% additional |
| **Memory** | ~50MB for trace buffers |
| **Network** | Compressed OTLP batches |

### Optimizations

- **Ring buffers** - Efficient kernel-to-userspace transfer
- **Batching** - Spans batched before export
- **Compression** - gzip compression by default
- **Sampling** - Configurable head-based sampling

---

## Troubleshooting

### Missing Traces

1. **Check eBPF status**:
   ```bash
   # Verify eBPF programs loaded
   bpftool prog list | grep telegen
   ```

2. **Check OTLP connectivity**:
   ```bash
   # Verify endpoint is reachable
   curl -v http://otel-collector:4317
   ```

3. **Check sampling rate**:
   ```yaml
   otlp:
     traces:
       sample_rate: 1.0  # Ensure 100% for debugging
   ```

### Missing Span Correlation

1. **Verify trace context propagation**:
   - Check incoming requests have `traceparent` header
   - Verify W3C Trace Context format

2. **Check time synchronization**:
   - Ensure NTP is configured
   - Spans may appear out of order with clock drift

---

## Next Steps

- {doc}`continuous-profiling` - Link profiles to traces
- {doc}`database-tracing` - Deep database tracing
- {doc}`../configuration/agent-mode` - Trace configuration options