Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

OpenTelemetry Observability Stack

This guide explains how to use the OpenTelemetry (OTEL) observability stack included with the MADSci example lab.

Overview

The MADSci example lab includes a complete observability stack that provides:

Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                        MADSci Example Lab                            │
├──────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐               │
│  │ EventManager │  │ WorkcellMgr  │  │ DataManager  │  ...          │
│  │ (traces)     │  │ (traces)     │  │ (traces)     │               │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘               │
│         │                  │                  │                      │
│         └──────────────────┼──────────────────┘                      │
│                            │ OTLP (gRPC :4317)                       │
│                            ▼                                         │
│  ┌──────────────────────────────────────────────────────────────────┐│
│  │                    OTEL Collector                                ││
│  │  Receives → Processes → Exports to backends                     ││
│  └──────────────────────────────────────────────────────────────────┘│
│                            │                                         │
│         ┌──────────────────┼──────────────────┐                      │
│         ▼                  ▼                  ▼                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐               │
│  │    Jaeger    │  │  Prometheus  │  │    Loki      │               │
│  │  (Traces)    │  │  (Metrics)   │  │   (Logs)     │               │
│  │  :16686      │  │  :9090       │  │  :3100       │               │
│  └──────────────┘  └──────────────┘  └──────────────┘               │
│                            │                                         │
│                            ▼                                         │
│  ┌──────────────────────────────────────────────────────────────────┐│
│  │                      Grafana                                     ││
│  │           Unified Dashboard (Traces + Metrics + Logs)            ││
│  │                        :3000                                     ││
│  └──────────────────────────────────────────────────────────────────┘│
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

Quick Start

Basic Mode (Collector + Debug Output)

The base compose.yaml includes the OTEL collector with debug output. Traces and metrics are received but only logged to the collector’s stdout.

docker compose up

Full Observability Stack

To enable the full observability stack with Jaeger, Prometheus, Loki, and Grafana:

docker compose --profile otel up

Accessing the UIs

All services use host network mode:

ServiceURLDescription
Grafanahttp://localhost:3000Unified dashboards (admin/admin)
Jaegerhttp://localhost:16686Distributed tracing UI
Prometheushttp://localhost:9090Metrics querying
Lokihttp://localhost:3100Log aggregation API

Note: Jaeger’s OTLP receiver is configured on ports 14317 (gRPC) and 14318 (HTTP) to avoid conflicts with the OTEL collector’s receiver ports (4317/4318).

Default Credentials

To set a different Grafana password, set the environment variable before starting:

export GRAFANA_ADMIN_PASSWORD=your-secure-password
docker compose --profile otel up

What’s Included

Pre-configured Grafana

OTEL Collector Pipelines

  1. Traces Pipeline: OTLP → Batch Processing → Jaeger

  2. Metrics Pipeline: OTLP → Batch Processing → Prometheus Remote Write

  3. Logs Pipeline: OTLP → Batch Processing → Loki

Configuration

Enabling OTEL in Managers

OTEL is enabled per-manager via environment variables. The example lab’s .env file has these enabled by default:

# Event Manager
EVENT_OTEL_ENABLED=true
EVENT_OTEL_SERVICE_NAME="madsci.event"
EVENT_OTEL_EXPORTER="otlp"
EVENT_OTEL_ENDPOINT="http://localhost:4317"
EVENT_OTEL_PROTOCOL="grpc"

# Similar for other managers...

Disabling OTEL

To disable OTEL for a specific manager, set:

EVENT_OTEL_ENABLED=false

Custom Configuration

Configuration files are located in examples/example_lab/otel/:

Viewing Traces

In Jaeger

  1. Open http://localhost:16686

  2. Select a service from the dropdown (e.g., madsci.event)

  3. Click “Find Traces”

  4. Click on a trace to see the full request flow

In Grafana

  1. Open http://localhost:3000

  2. Navigate to “Explore”

  3. Select “Jaeger” datasource

  4. Search for traces by service or trace ID

Viewing Metrics

In Prometheus

  1. Open http://localhost:9090

  2. Use PromQL queries like:

    • sum(rate(otelcol_receiver_accepted_spans[5m])) - Spans per second

    • sum(rate(otelcol_receiver_accepted_metric_points[5m])) - Metrics per second

In Grafana

  1. Open http://localhost:3000

  2. Go to the “MADSci Lab Overview” dashboard

  3. View telemetry rates and trends

Viewing Logs

In Grafana

  1. Open http://localhost:3000

  2. Navigate to “Explore”

  3. Select “Loki” datasource

  4. Use LogQL queries like:

    • {job=~".+"} - All logs

    • {service_name="madsci.event"} - Logs from Event Manager

Log Bridge

When OTEL is enabled, MADSci automatically bridges Python’s standard logging module to the OTEL log pipeline using LoggingInstrumentor from the opentelemetry-instrumentation-logging package. This means:

This replaces the older LoggingHandler approach (deprecated in the OpenTelemetry SDK) and is included as a dependency of madsci_common.

Trace Correlation

When OTEL is enabled, MADSci automatically:

  1. Propagates trace context across HTTP requests between services

  2. Includes trace_id/span_id in structured log events (via the LoggingInstrumentor bridge)

  3. Links logs to traces via Grafana’s derived fields

This allows you to:

Troubleshooting

No traces appearing in Jaeger

  1. Check that OTEL is enabled in the manager:

    docker compose logs event_manager | grep -i otel
  2. Verify the collector is receiving data:

    docker compose logs otel_collector | grep -i span
  3. Ensure the collector is running:

    docker compose ps otel_collector

Collector showing export errors

Check that all backend services are running:

docker compose --profile otel ps

Grafana can’t connect to datasources

The datasources use localhost URLs which work with network_mode: host. If you’re not using host networking, update the datasource URLs in grafana/provisioning/datasources/datasources.yaml.

Data Persistence

Telemetry data is stored in .madsci/ subdirectories:

To reset all observability data:

docker compose --profile otel down
rm -rf .madsci/jaeger .madsci/prometheus .madsci/loki .madsci/grafana

Production Considerations

This observability stack is designed for development and demonstration. For production:

  1. Security: Use proper authentication, don’t expose UIs publicly without protection

  2. Storage: Configure appropriate retention periods and storage backends

  3. Sampling: Enable trace sampling to reduce data volume

  4. High Availability: Consider distributed deployments for Jaeger, Prometheus, and Loki

  5. Secrets: Use proper secrets management for credentials

Further Reading