Audience: Lab Operator Prerequisites: Daily Operations Time: ~25 minutes
Overview¶
Effective monitoring helps you detect issues before they affect experiments. MADSci provides multiple monitoring layers: the CLI, the TUI, structured event logs, and an optional observability stack (OpenTelemetry + Jaeger + Prometheus + Grafana).
CLI Monitoring¶
Service Status¶
# One-shot status check
madsci status
# Continuous monitoring (refreshes every 5 seconds)
madsci status --watch
# Faster refresh
madsci status --watch --interval 2
# JSON for scripting/alerting
madsci status --json | python -c "
import sys, json
data = json.load(sys.stdin)
unhealthy = [s for s in data['services'] if s['status'] != 'healthy']
if unhealthy:
print(f'ALERT: {len(unhealthy)} unhealthy services')
for s in unhealthy:
print(f' {s[\"name\"]}: {s[\"status\"]}')
sys.exit(1)
"Log Monitoring¶
# Follow all logs in real time
madsci logs --follow
# Only errors and critical
madsci logs --follow --level ERROR
# Filter to specific component
madsci logs --follow --grep "workcell"
# Show timestamps
madsci logs --follow --timestampsTUI Monitoring¶
The TUI provides a live dashboard for monitoring:
madsci tuiDashboard Screen (d)¶
Shows an overview of all services with color-coded health status:
Green: Service is healthy
Yellow: Service is responding but degraded
Red: Service is offline or unhealthy
Status Screen (s)¶
Detailed table view of each service including:
Service name and URL
Health status
Response time
Version information
Logs Screen (l)¶
Live log viewer with:
Level filtering (DEBUG through CRITICAL)
Text search
Auto-scroll with follow mode
Keyboard Shortcuts¶
| Key | Action |
|---|---|
d | Dashboard screen |
s | Status screen |
l | Logs screen |
r | Refresh data |
q | Quit |
Event System¶
MADSci’s Event Manager captures structured events from all components. This is the primary audit trail for your lab.
Querying Events¶
from madsci.client import EventClient
ec = EventClient()
# Recent events
events = ec.get_events(number=50)
for event_id, event in events.items():
print(f"[{event.log_level}] {event.event_timestamp}: {event.event_data}")
# Query by type
from madsci.common.types.event_types import EventType
workflow_events = ec.query_events({
"event_type": EventType.WORKFLOW_COMPLETE.value
})
# Query by time range
error_events = ec.query_events({
"log_level": {"$gte": 40}, # ERROR and above
"event_timestamp": {"$gte": "2026-02-09T00:00:00Z"}
})Event Types¶
Key event types to monitor:
| Event Type | Significance |
|---|---|
WORKFLOW_COMPLETE | Workflow finished successfully |
WORKFLOW_ABORT | Workflow failed or was cancelled |
WORKFLOW_STEP_FAILED | Individual step failed |
NODE_ERROR | Node reported an error |
NODE_STATUS_UPDATE | Node status changed |
EXPERIMENT_FAILED | Experiment failed |
MANAGER_ERROR | Manager service error |
BACKUP_CREATE | Backup was created |
Utilization Reports¶
The Event Manager provides utilization analytics:
from madsci.client import EventClient
ec = EventClient()
# Daily utilization summary
daily = ec.get_utilization_periods(
analysis_type="daily",
start_time="2026-02-01T00:00:00Z",
end_time="2026-02-09T00:00:00Z",
)
# Session-based utilization
sessions = ec.get_session_utilization(
start_time="2026-02-01T00:00:00Z",
end_time="2026-02-09T00:00:00Z",
)
# Export as CSV
csv_data = ec.get_user_utilization_report(
csv_export=True,
start_time="2026-02-01T00:00:00Z",
)Observability Stack (OpenTelemetry)¶
For production labs, MADSci integrates with OpenTelemetry for distributed tracing, metrics, and log correlation.
Architecture¶
MADSci Services
│
▼ (OTLP gRPC/HTTP)
┌─────────────────────┐
│ OTEL Collector │ Port 4317
│ (receive, process, │
│ export) │
└─────────────────────┘
│
├──► Jaeger (traces) Port 16686
├──► Prometheus (metrics) Port 9090
├──► Loki (logs) Port 3100
└──► Grafana (dashboards) Port 3000Enabling OTEL¶
Add the observability stack to your Docker Compose:
The OTEL stack is included automatically via the root compose.yaml. Enable it with the otel profile:
docker compose --profile otel upEnable OTEL per manager via environment variables:
# .env
EVENT_OTEL_ENABLED=true
EVENT_OTEL_SERVICE_NAME=madsci.event
EVENT_OTEL_EXPORTER=otlp
EVENT_OTEL_ENDPOINT=http://otel-collector:4317
EVENT_OTEL_PROTOCOL=grpc
WORKCELL_OTEL_ENABLED=true
WORKCELL_OTEL_SERVICE_NAME=madsci.workcell
WORKCELL_OTEL_EXPORTER=otlp
WORKCELL_OTEL_ENDPOINT=http://otel-collector:4317
WORKCELL_OTEL_PROTOCOL=grpcJaeger (Traces)¶
Access at http://localhost:16686
Use Jaeger to:
Trace workflow execution: See the full timeline of a workflow from submission through each step
Identify bottlenecks: Find slow steps or services
Debug failures: See where in the chain a failure occurred
Correlate events: Link traces to logs and metrics
Prometheus (Metrics)¶
Access at http://localhost:9090
Key metrics to monitor:
madsci_events_total: Total events by type and levelmadsci_event_send_latency_seconds: Event send latencymadsci_event_buffer_size: Event buffer size (indicates backpressure)madsci_event_retries_total: Event send retries (indicates connectivity issues)
Grafana (Dashboards)¶
Access at http://localhost:3000 (default: admin/admin)
Pre-configured dashboards include:
MADSci Overview: Service health, event rates, error rates
Workflow Performance: Step durations, success rates
Resource Utilization: Node busy/idle time
Loki (Log Aggregation)¶
Access through Grafana’s Explore view.
Query logs across all services:
{service_name="madsci.workcell"} |= "error"Automated Alerting¶
Simple Script-Based Alerting¶
#!/usr/bin/env python3
"""Simple health check alerter. Run via cron."""
import httpx
import sys
SERVICES = {
"Lab Manager": "http://localhost:8000/health",
"Event Manager": "http://localhost:8001/health",
"Workcell Manager": "http://localhost:8005/health",
}
failures = []
for name, url in SERVICES.items():
try:
r = httpx.get(url, timeout=10)
if r.status_code != 200:
failures.append(f"{name}: HTTP {r.status_code}")
except Exception as e:
failures.append(f"{name}: {e}")
if failures:
print("ALERT: Service health check failures:")
for f in failures:
print(f" - {f}")
sys.exit(1)
else:
print("All services healthy")Cron-Based Monitoring¶
# Check every 5 minutes
*/5 * * * * /path/to/health_check.py >> /var/log/madsci_health.log 2>&1Prometheus Alerting¶
For production, configure Prometheus alerting rules:
# prometheus_alerts.yml
groups:
- name: madsci
rules:
- alert: ServiceDown
expr: up{job="madsci"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "MADSci service {{ $labels.instance }} is down"
- alert: HighErrorRate
expr: rate(madsci_events_total{level="ERROR"}[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"What’s Next?¶
Backup & Recovery - Database backup strategies
Troubleshooting - Common issues and solutions