This guide explains how to use the OpenTelemetry (OTEL) observability stack included with the MADSci example lab.
Overview¶
The MADSci example lab includes a complete observability stack that provides:
Traces: Distributed tracing to understand request flow across services
Metrics: Performance and operational metrics from all managers
Logs: Centralized log aggregation with trace correlation
Architecture¶
┌──────────────────────────────────────────────────────────────────────┐
│ MADSci Example Lab │
├──────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ EventManager │ │ WorkcellMgr │ │ DataManager │ ... │
│ │ (traces) │ │ (traces) │ │ (traces) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └──────────────────┼──────────────────┘ │
│ │ OTLP (gRPC :4317) │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐│
│ │ OTEL Collector ││
│ │ Receives → Processes → Exports to backends ││
│ └──────────────────────────────────────────────────────────────────┘│
│ │ │
│ ┌──────────────────┼──────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Jaeger │ │ Prometheus │ │ Loki │ │
│ │ (Traces) │ │ (Metrics) │ │ (Logs) │ │
│ │ :16686 │ │ :9090 │ │ :3100 │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐│
│ │ Grafana ││
│ │ Unified Dashboard (Traces + Metrics + Logs) ││
│ │ :3000 ││
│ └──────────────────────────────────────────────────────────────────┘│
│ │
└──────────────────────────────────────────────────────────────────────┘Quick Start¶
Basic Mode (Collector + Debug Output)¶
The base compose.yaml includes the OTEL collector with debug output. Traces and metrics are received but only logged to the collector’s stdout.
docker compose upFull Observability Stack¶
To enable the full observability stack with Jaeger, Prometheus, Loki, and Grafana:
docker compose --profile otel upAccessing the UIs¶
All services use host network mode:
| Service | URL | Description |
|---|---|---|
| Grafana | http:// | Unified dashboards (admin/admin) |
| Jaeger | http:// | Distributed tracing UI |
| Prometheus | http:// | Metrics querying |
| Loki | http:// | Log aggregation API |
Note: Jaeger’s OTLP receiver is configured on ports 14317 (gRPC) and 14318 (HTTP) to avoid conflicts with the OTEL collector’s receiver ports (4317/4318).
Default Credentials¶
Grafana: admin / admin (you’ll be prompted to change on first login)
To set a different Grafana password, set the environment variable before starting:
export GRAFANA_ADMIN_PASSWORD=your-secure-password
docker compose --profile otel upWhat’s Included¶
Pre-configured Grafana¶
Datasources: Jaeger, Prometheus, and Loki are automatically configured
Dashboards: MADSci Lab Overview dashboard is pre-installed
Correlations: Trace IDs in logs link directly to Jaeger traces
OTEL Collector Pipelines¶
Traces Pipeline: OTLP → Batch Processing → Jaeger
Metrics Pipeline: OTLP → Batch Processing → Prometheus Remote Write
Logs Pipeline: OTLP → Batch Processing → Loki
Configuration¶
Enabling OTEL in Managers¶
OTEL is enabled per-manager via environment variables. The example lab’s .env file has these enabled by default:
# Event Manager
EVENT_OTEL_ENABLED=true
EVENT_OTEL_SERVICE_NAME="madsci.event"
EVENT_OTEL_EXPORTER="otlp"
EVENT_OTEL_ENDPOINT="http://localhost:4317"
EVENT_OTEL_PROTOCOL="grpc"
# Similar for other managers...Disabling OTEL¶
To disable OTEL for a specific manager, set:
EVENT_OTEL_ENABLED=falseCustom Configuration¶
Configuration files are located in examples/example_lab/otel/:
otel-collector-full.yaml: Full collector config with all exportersprometheus.yaml: Prometheus configurationloki.yaml: Loki configurationgrafana/provisioning/: Grafana auto-provisioning configs
Viewing Traces¶
In Jaeger¶
Select a service from the dropdown (e.g.,
madsci.event)Click “Find Traces”
Click on a trace to see the full request flow
In Grafana¶
Navigate to “Explore”
Select “Jaeger” datasource
Search for traces by service or trace ID
Viewing Metrics¶
In Prometheus¶
Use PromQL queries like:
sum(rate(otelcol_receiver_accepted_spans[5m]))- Spans per secondsum(rate(otelcol_receiver_accepted_metric_points[5m]))- Metrics per second
In Grafana¶
Go to the “MADSci Lab Overview” dashboard
View telemetry rates and trends
Viewing Logs¶
In Grafana¶
Navigate to “Explore”
Select “Loki” datasource
Use LogQL queries like:
{job=~".+"}- All logs{service_name="madsci.event"}- Logs from Event Manager
Log Bridge¶
When OTEL is enabled, MADSci automatically bridges Python’s standard logging module to the OTEL log pipeline using LoggingInstrumentor from the opentelemetry-instrumentation-logging package. This means:
All stdlib log records are automatically exported to the OTEL collector
Trace context (
trace_id,span_id) is injected into every log recordNo manual log handler configuration is needed — the bridge is set up during OTEL bootstrap
This replaces the older LoggingHandler approach (deprecated in the OpenTelemetry SDK) and is included as a dependency of madsci_common.
Trace Correlation¶
When OTEL is enabled, MADSci automatically:
Propagates trace context across HTTP requests between services
Includes trace_id/span_id in structured log events (via the
LoggingInstrumentorbridge)Links logs to traces via Grafana’s derived fields
This allows you to:
Click on a trace ID in logs to jump to the full trace
See which logs were generated during a specific request
Understand the full flow of a workflow across all managers
Troubleshooting¶
No traces appearing in Jaeger¶
Check that OTEL is enabled in the manager:
docker compose logs event_manager | grep -i otelVerify the collector is receiving data:
docker compose logs otel_collector | grep -i spanEnsure the collector is running:
docker compose ps otel_collector
Collector showing export errors¶
Check that all backend services are running:
docker compose --profile otel psGrafana can’t connect to datasources¶
The datasources use localhost URLs which work with network_mode: host. If you’re not using host networking, update the datasource URLs in grafana/provisioning/datasources/datasources.yaml.
Data Persistence¶
Telemetry data is stored in .madsci/ subdirectories:
.madsci/jaeger/- Trace data.madsci/prometheus/- Metrics data.madsci/loki/- Log data.madsci/grafana/- Grafana settings and dashboards
To reset all observability data:
docker compose --profile otel down
rm -rf .madsci/jaeger .madsci/prometheus .madsci/loki .madsci/grafanaProduction Considerations¶
This observability stack is designed for development and demonstration. For production:
Security: Use proper authentication, don’t expose UIs publicly without protection
Storage: Configure appropriate retention periods and storage backends
Sampling: Enable trace sampling to reduce data volume
High Availability: Consider distributed deployments for Jaeger, Prometheus, and Loki
Secrets: Use proper secrets management for credentials