Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Monitoring

Audience: Lab Operator Prerequisites: Daily Operations Time: ~25 minutes

Overview

Effective monitoring helps you detect issues before they affect experiments. MADSci provides multiple monitoring layers: the CLI, the TUI, structured event logs, and an optional observability stack (OpenTelemetry + Jaeger + Prometheus + Grafana).

CLI Monitoring

Service Status

# One-shot status check
madsci status

# Continuous monitoring (refreshes every 5 seconds)
madsci status --watch

# Faster refresh
madsci status --watch --interval 2

# JSON for scripting/alerting
madsci status --json | python -c "
import sys, json
data = json.load(sys.stdin)
unhealthy = [s for s in data['services'] if s['status'] != 'healthy']
if unhealthy:
    print(f'ALERT: {len(unhealthy)} unhealthy services')
    for s in unhealthy:
        print(f'  {s[\"name\"]}: {s[\"status\"]}')
    sys.exit(1)
"

Log Monitoring

# Follow all logs in real time
madsci logs --follow

# Only errors and critical
madsci logs --follow --level ERROR

# Filter to specific component
madsci logs --follow --grep "workcell"

# Show timestamps
madsci logs --follow --timestamps

TUI Monitoring

The TUI provides a live dashboard for monitoring:

madsci tui

Dashboard Screen (d)

Shows an overview of all services with color-coded health status:

Status Screen (s)

Detailed table view of each service including:

Logs Screen (l)

Live log viewer with:

Keyboard Shortcuts

KeyAction
dDashboard screen
sStatus screen
lLogs screen
rRefresh data
qQuit

Event System

MADSci’s Event Manager captures structured events from all components. This is the primary audit trail for your lab.

Querying Events

from madsci.client import EventClient

ec = EventClient()

# Recent events
events = ec.get_events(number=50)
for event_id, event in events.items():
    print(f"[{event.log_level}] {event.event_timestamp}: {event.event_data}")

# Query by type
from madsci.common.types.event_types import EventType

workflow_events = ec.query_events({
    "event_type": EventType.WORKFLOW_COMPLETE.value
})

# Query by time range
error_events = ec.query_events({
    "log_level": {"$gte": 40},  # ERROR and above
    "event_timestamp": {"$gte": "2026-02-09T00:00:00Z"}
})

Event Types

Key event types to monitor:

Event TypeSignificance
WORKFLOW_COMPLETEWorkflow finished successfully
WORKFLOW_ABORTWorkflow failed or was cancelled
WORKFLOW_STEP_FAILEDIndividual step failed
NODE_ERRORNode reported an error
NODE_STATUS_UPDATENode status changed
EXPERIMENT_FAILEDExperiment failed
MANAGER_ERRORManager service error
BACKUP_CREATEBackup was created

Utilization Reports

The Event Manager provides utilization analytics:

from madsci.client import EventClient

ec = EventClient()

# Daily utilization summary
daily = ec.get_utilization_periods(
    analysis_type="daily",
    start_time="2026-02-01T00:00:00Z",
    end_time="2026-02-09T00:00:00Z",
)

# Session-based utilization
sessions = ec.get_session_utilization(
    start_time="2026-02-01T00:00:00Z",
    end_time="2026-02-09T00:00:00Z",
)

# Export as CSV
csv_data = ec.get_user_utilization_report(
    csv_export=True,
    start_time="2026-02-01T00:00:00Z",
)

Observability Stack (OpenTelemetry)

For production labs, MADSci integrates with OpenTelemetry for distributed tracing, metrics, and log correlation.

Architecture

MADSci Services
      │
      ▼ (OTLP gRPC/HTTP)
┌─────────────────────┐
│  OTEL Collector      │  Port 4317
│  (receive, process,  │
│   export)            │
└─────────────────────┘
      │
      ├──► Jaeger (traces)      Port 16686
      ├──► Prometheus (metrics)  Port 9090
      ├──► Loki (logs)          Port 3100
      └──► Grafana (dashboards) Port 3000

Enabling OTEL

Add the observability stack to your Docker Compose:

The OTEL stack is included automatically via the root compose.yaml. Enable it with the otel profile:

docker compose --profile otel up

Enable OTEL per manager via environment variables:

# .env
EVENT_OTEL_ENABLED=true
EVENT_OTEL_SERVICE_NAME=madsci.event
EVENT_OTEL_EXPORTER=otlp
EVENT_OTEL_ENDPOINT=http://otel-collector:4317
EVENT_OTEL_PROTOCOL=grpc

WORKCELL_OTEL_ENABLED=true
WORKCELL_OTEL_SERVICE_NAME=madsci.workcell
WORKCELL_OTEL_EXPORTER=otlp
WORKCELL_OTEL_ENDPOINT=http://otel-collector:4317
WORKCELL_OTEL_PROTOCOL=grpc

Jaeger (Traces)

Access at http://localhost:16686

Use Jaeger to:

Prometheus (Metrics)

Access at http://localhost:9090

Key metrics to monitor:

Grafana (Dashboards)

Access at http://localhost:3000 (default: admin/admin)

Pre-configured dashboards include:

Loki (Log Aggregation)

Access through Grafana’s Explore view.

Query logs across all services:

{service_name="madsci.workcell"} |= "error"

Automated Alerting

Simple Script-Based Alerting

#!/usr/bin/env python3
"""Simple health check alerter. Run via cron."""

import httpx
import sys

SERVICES = {
    "Lab Manager": "http://localhost:8000/health",
    "Event Manager": "http://localhost:8001/health",
    "Workcell Manager": "http://localhost:8005/health",
}

failures = []
for name, url in SERVICES.items():
    try:
        r = httpx.get(url, timeout=10)
        if r.status_code != 200:
            failures.append(f"{name}: HTTP {r.status_code}")
    except Exception as e:
        failures.append(f"{name}: {e}")

if failures:
    print("ALERT: Service health check failures:")
    for f in failures:
        print(f"  - {f}")
    sys.exit(1)
else:
    print("All services healthy")

Cron-Based Monitoring

# Check every 5 minutes
*/5 * * * * /path/to/health_check.py >> /var/log/madsci_health.log 2>&1

Prometheus Alerting

For production, configure Prometheus alerting rules:

# prometheus_alerts.yml
groups:
  - name: madsci
    rules:
      - alert: ServiceDown
        expr: up{job="madsci"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "MADSci service {{ $labels.instance }} is down"

      - alert: HighErrorRate
        expr: rate(madsci_events_total{level="ERROR"}[5m]) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"

What’s Next?