Operational Runbook¶
Day-to-day tasks for maintaining the Zygy platform.
Checking System Health¶
All services status¶
All services should show Up or Up (healthy). Any service in Exit or Restarting state needs investigation.
View logs for a service¶
View all logs together¶
Check monitoring dashboards¶
Visit grafana.zygy.com (credentials in .env as GF_SECURITY_ADMIN_PASSWORD).
- Container metrics — cAdvisor panels (CPU, memory per container)
- System metrics — Node Exporter panels (disk, host memory)
- Logs — Loki panels (search logs from all services)
Deploying a Code Change¶
Normal path (automatic): push to main, master, or newbranch — GitHub Actions handles it.
Manual deploy via SSH (use when you need to force a rebuild without a code change, or Actions is unavailable):
ssh zygy@172.237.81.37
cd $PROJECT_PATH
git pull
# Rebuild one service (always use explicit docker build, not restart)
docker-compose stop backend-streamsearch
docker build -t backend-streamsearch -f ./backend-streamsearch/Dockerfile .
docker-compose up -d backend-streamsearch
# Check it started correctly
docker-compose logs -f backend-streamsearch
Never use docker-compose restart to deploy a code change
restart reuses the existing image. You must rebuild the image for your code change to take effect.
Changing the LLM Model¶
See MODEL_CHANGER_GUIDE.md in the project root for full steps. Summary:
- Update
MODEL_NAME(and/orCHART_MODEL_NAME,LLM_MODEL_NAME) in.env - Stop, rebuild, and restart the affected service — always pass
--build
docker-compose stop backend-streamsearch
docker build -t backend-streamsearch -f ./backend-streamsearch/Dockerfile .
docker-compose up -d backend-streamsearch
Adding a New LLM Provider¶
- Update
LLM_PROVIDERin.envto the new provider value (openai,zai,openrouter, etc.) - Add the corresponding API key variable (
OPENAI_API_KEY, etc.) - Rebuild and restart
backend-streamsearch,backend-generatereport, andbackend-vectorindexing
Restarting the Entire Stack¶
Warning
docker-compose down stops all containers but does not delete volumes. Data in /mnt/blockstorage/zygy-data/ and Docker named volumes (Prometheus, Grafana, etc.) is preserved.
To also remove containers and networks (but still keep volumes):
Disk Space Cleanup¶
Docker accumulates unused images over time. Clean up with:
docker system prune -f # removes stopped containers, unused networks, dangling images
docker image prune -a -f # removes ALL unused images (more aggressive)
Check disk usage:
Restoring a Database Backup¶
See Scheduled Tasks — Restoring from Backup.
Updating Caddy IP Whitelist¶
The services streamsearch, vectorindexing, generatereport, and pageindex only accept requests from two IP addresses. To update this list:
- Edit
caddy/Caddyfileon the VPS - Find the
@blocked not remote_ip ...lines for the relevant services - Update the IP addresses
- Reload Caddy:
docker-compose exec caddy caddy reload --config /etc/caddy/Caddyfile
Or if you changed the file on the local machine, commit and push — GitHub Actions will redeploy Caddy automatically.
Elasticsearch Operations¶
Take a snapshot (backup)¶
Full guide in ELASTICSEARCH_BACKUP_GUIDE.md. Quick reference:
# Using the kubeconfig in the project root:
export KUBECONFIG=$PROJECT_PATH/elastic-kubeconfig.yaml
# Or via the Kibana UI at 172.236.132.82:5601
# Stack Management → Snapshot and Restore → Snapshots → Take snapshot
Check cluster health¶
Mongo Tunnel Recovery¶
If mongo-tunnel is down and workflow features are broken:
ssh zygy@172.237.81.37
cd $PROJECT_PATH
docker-compose logs mongo-tunnel # check why it's failing
docker-compose restart mongo-tunnel
docker-compose logs -f mongo-tunnel # watch for "Connection established"
If the tunnel cannot connect after restart, see Mongo Tunnel — Troubleshooting.
Checking Backup Health¶
Expected output pattern:
[2025-12-01 02:00:05] Backup completed successfully
[2025-12-01 02:00:05] Backup uploaded to Object Storage
Environment Variable Changes¶
After editing .env:
- Services do not pick up
.envchanges automatically - You must stop and start (not just restart) affected services for the new values to load:
If API_KEYS or REDIS_PASSWORD changed, restart all services.
Useful One-Liners¶
# Follow logs for multiple services
docker-compose logs -f backend-workflow backend-accounts backend-agent
# See which services are unhealthy
docker-compose ps | grep -v "Up"
# Exec into a running container
docker exec -it backend-streamsearch bash
# See real-time resource usage
docker stats
# Check Caddy TLS certificate status
docker exec caddy caddy list-modules
docker exec caddy curl -s http://localhost:2019/config/ | python3 -m json.tool