Troubleshooting - MADSci Documentation

Audience: Lab Operator Prerequisites: Daily Operations Time: ~20 minutes (reference guide)

Overview¶

This guide covers common issues you may encounter when operating a MADSci lab, organized by symptom. Use the diagnostic tools first, then consult the specific issue sections.

First Steps: Diagnostics¶

Before diving into specific issues, gather information:

# 1. Check all service health
madsci status

# 2. Run system diagnostics
madsci doctor

# 3. Check recent errors
madsci logs --level ERROR --tail 20

# 4. Check Docker container status
docker compose ps

Service Issues¶

Service Won’t Start¶

Symptom: docker compose up fails or service exits immediately.

Diagnosis:

# Check container logs
docker compose logs <service_name>

# Check exit code
docker compose ps <service_name>

Common Causes:

Cause	Log Message	Fix
Port in use	`Address already in use`	Stop conflicting process: `lsof -i :<port>`
Database not ready	`Connection refused` to FerretDB/PostgreSQL	Wait for DB startup or check DB health
Missing env vars	`ValidationError` or `Field required`	Check `.env` file has all required variables
Image not built	`No such image`	Run `docker compose build <service>`
Volume permissions	`Permission denied`	Check volume ownership: `ls -la`

Service Unhealthy¶

Symptom: madsci status shows UNHEALTHY or OFFLINE.

Diagnosis:

# Check the specific health endpoint
curl -v http://localhost:<port>/health

# Check container resource usage
docker stats <container_name>

Common Causes:

Cause	Fix
Database connection lost	Restart the database: `docker compose restart madsci_ferretdb`
Out of memory	Increase Docker memory limit or reduce workload
Deadlocked process	Restart the service: `docker compose restart <service>`
Network issue	Check Docker network: `docker network inspect madsci_default`

Service Crash Loop¶

Symptom: Service keeps restarting (visible in docker compose ps).

# Check restart count
docker inspect <container> --format '{{.RestartCount}}'

# View logs from the crash
docker compose logs --tail 50 <service_name>

# Temporarily disable restart to debug
docker compose stop <service_name>
docker compose run --rm <service_name>  # Run interactively

Database Issues¶

FerretDB Connection Failures¶

Symptom: Managers report document database connection errors.

# Check FerretDB is running
docker compose ps madsci_ferretdb

# Test connection (FerretDB uses the MongoDB wire protocol)
docker compose exec madsci_ferretdb mongosh --eval "db.runCommand({ping: 1})"

# Check FerretDB logs
docker compose logs madsci_ferretdb

Fixes:

Restart FerretDB: docker compose restart madsci_ferretdb
Check disk space: df -h (FerretDB needs free space for its backing store)
Check connection string in environment variables

PostgreSQL Connection Failures¶

Symptom: Resource Manager reports database errors.

# Check PostgreSQL is running
docker compose ps postgres

# Test connection
docker compose exec postgres pg_isready

# Check PostgreSQL logs
docker compose logs postgres

Valkey Connection Failures¶

Symptom: Workcell Manager can’t queue workflows.

# Check Valkey
docker compose ps madsci_valkey
docker compose exec madsci_valkey valkey-cli ping
# Should return: PONG

Workflow Issues¶

Workflow Stuck¶

Symptom: Workflow shows as active but no progress.

from madsci.client import WorkcellClient

wc = WorkcellClient(workcell_server_url="http://localhost:8005/")

# Check workflow status
wf = wc.query_workflow("workflow_id_here")
print(f"Status: {wf.status}")
print(f"Current step: {wf.status.current_step_index}")

# Check the current step
step = wf.steps[wf.status.current_step_index]
print(f"Step: {step.name}")
print(f"Step status: {step.status}")
print(f"Node: {step.node}")

Common Causes:

Cause	Fix
Node is locked	Unlock the node: `curl -X POST http://<node>:2000/admin/unlock`
Node is offline	Restart the node: `docker compose restart <node>`
Action timed out	Check node logs, restart node if needed
Resource condition not met	Check resource state in Resource Manager

Workflow Fails on Specific Step¶

# Check the workcell logs around the failure time
madsci logs --grep "workflow_id" --tail 50

# Check the specific node's logs
docker compose logs <node_name> --tail 50

All Workflows Queued But Not Running¶

Symptom: Workflows are submitted but never start.

Check:

Is the workcell in a paused state?
Are all required nodes registered and healthy?
Is Valkey accessible?

# Check workcell state
curl http://localhost:8005/state | python -m json.tool

Node Issues¶

Node Not Responding¶

# Check if the node process is running
docker compose ps <node_name>

# Check node health directly
curl http://localhost:2000/health

# Restart the node
docker compose restart <node_name>

Action Fails¶

# Check the node's action history
curl http://localhost:2000/actions/history | python -m json.tool

# Check node logs
docker compose logs <node_name> --tail 50

# Try the action directly (bypassing workcell)
curl -X POST http://localhost:2000/actions/<action_name> \
  -H "Content-Type: application/json" \
  -d '{"param": "value"}'

Hardware Communication Error¶

See the Debugging Guide for detailed hardware troubleshooting.

Quick checks:

# Check if device is visible
ls -la /dev/ttyUSB*  # Serial devices
lsusb               # USB devices

# Check device permissions
groups $USER  # Should include 'dialout' for serial access

# Check if Docker has device access
docker compose exec <node_name> ls -la /dev/ttyUSB0

Network Issues¶

Services Can’t Find Each Other¶

Symptom: Services report connection errors to other services.

# Check Docker network
docker network ls
docker network inspect madsci_default

# Test connectivity between containers
docker compose exec workcell_manager curl http://event_manager:8001/health

# Check DNS resolution
docker compose exec workcell_manager nslookup event_manager

Port Conflicts¶

# Find what's using a port
lsof -i :8000  # macOS/Linux
netstat -tlnp | grep 8000  # Linux

# Kill the conflicting process
kill <PID>

# Or change the MADSci port in .env
EVENT_SERVER_PORT=8011  # Use a different port

Performance Issues¶

Slow Workflow Execution¶

Check node response times:
```
time curl http://localhost:2000/health
```
Check database performance:
```
docker stats  # Watch CPU/memory usage
```
Check disk I/O:
```
iostat -x 1  # Linux
```
If using OTEL, check Jaeger traces for bottlenecks at http://localhost:16686

High Memory Usage¶

# Check container memory usage
docker stats --no-stream

# Check FerretDB memory
docker compose exec madsci_ferretdb mongosh --eval "db.serverStatus().mem"

# Restart memory-heavy service
docker compose restart <service_name>

Disk Space Issues¶

# Check disk usage
df -h

# Check Docker disk usage
docker system df

# Clean up unused Docker resources
docker system prune  # Remove stopped containers, unused images

# Check log file sizes
du -sh /var/lib/docker/containers/*/

Recovery Procedures¶

Full Lab Reset (Preserving Data)¶

# Stop everything
docker compose down

# Restart everything
docker compose up -d

# Verify
madsci status

Full Lab Reset (Clean Slate)¶

# WARNING: This deletes all data!
docker compose down -v
docker compose up -d

Restore from Backup¶

See Backup & Recovery for detailed restore procedures.

Getting Help¶

If you can’t resolve an issue:

Collect diagnostic information:

madsci doctor --json > diagnostics.json
madsci status --json > status.json
madsci logs --tail 200 --json > recent_logs.json

Check the MADSci documentation
Open an issue on GitHub with:
- MADSci version (madsci version)
- Operating system and Docker version
- Steps to reproduce
- Diagnostic output from step 1

What’s Next?¶

Updates & Maintenance - Upgrading MADSci
Monitoring - Set up proactive monitoring