Skip to content

Operational Runbook

Day-to-day tasks for maintaining the Zygy platform.


Checking System Health

All services status

ssh zygy@172.237.81.37
cd $PROJECT_PATH
docker-compose ps

All services should show Up or Up (healthy). Any service in Exit or Restarting state needs investigation.

View logs for a service

docker-compose logs -f backend-streamsearch
docker-compose logs --tail=100 backend-workflow

View all logs together

docker-compose logs -f --tail=50

Check monitoring dashboards

Visit grafana.zygy.com (credentials in .env as GF_SECURITY_ADMIN_PASSWORD).

  • Container metrics — cAdvisor panels (CPU, memory per container)
  • System metrics — Node Exporter panels (disk, host memory)
  • Logs — Loki panels (search logs from all services)

Deploying a Code Change

Normal path (automatic): push to main, master, or newbranch — GitHub Actions handles it.

Manual deploy via SSH (use when you need to force a rebuild without a code change, or Actions is unavailable):

ssh zygy@172.237.81.37
cd $PROJECT_PATH
git pull

# Rebuild one service (always use explicit docker build, not restart)
docker-compose stop backend-streamsearch
docker build -t backend-streamsearch -f ./backend-streamsearch/Dockerfile .
docker-compose up -d backend-streamsearch

# Check it started correctly
docker-compose logs -f backend-streamsearch

Never use docker-compose restart to deploy a code change

restart reuses the existing image. You must rebuild the image for your code change to take effect.


Changing the LLM Model

See MODEL_CHANGER_GUIDE.md in the project root for full steps. Summary:

  1. Update MODEL_NAME (and/or CHART_MODEL_NAME, LLM_MODEL_NAME) in .env
  2. Stop, rebuild, and restart the affected service — always pass --build
docker-compose stop backend-streamsearch
docker build -t backend-streamsearch -f ./backend-streamsearch/Dockerfile .
docker-compose up -d backend-streamsearch

Adding a New LLM Provider

  1. Update LLM_PROVIDER in .env to the new provider value (openai, zai, openrouter, etc.)
  2. Add the corresponding API key variable (OPENAI_API_KEY, etc.)
  3. Rebuild and restart backend-streamsearch, backend-generatereport, and backend-vectorindexing

Restarting the Entire Stack

ssh zygy@172.237.81.37
cd $PROJECT_PATH
docker-compose down
docker-compose up -d

Warning

docker-compose down stops all containers but does not delete volumes. Data in /mnt/blockstorage/zygy-data/ and Docker named volumes (Prometheus, Grafana, etc.) is preserved.

To also remove containers and networks (but still keep volumes):

docker-compose down --remove-orphans
docker-compose up -d

Disk Space Cleanup

Docker accumulates unused images over time. Clean up with:

docker system prune -f          # removes stopped containers, unused networks, dangling images
docker image prune -a -f        # removes ALL unused images (more aggressive)

Check disk usage:

df -h /
docker system df

Restoring a Database Backup

See Scheduled Tasks — Restoring from Backup.


Updating Caddy IP Whitelist

The services streamsearch, vectorindexing, generatereport, and pageindex only accept requests from two IP addresses. To update this list:

  1. Edit caddy/Caddyfile on the VPS
  2. Find the @blocked not remote_ip ... lines for the relevant services
  3. Update the IP addresses
  4. Reload Caddy: docker-compose exec caddy caddy reload --config /etc/caddy/Caddyfile

Or if you changed the file on the local machine, commit and push — GitHub Actions will redeploy Caddy automatically.


Elasticsearch Operations

Take a snapshot (backup)

Full guide in ELASTICSEARCH_BACKUP_GUIDE.md. Quick reference:

# Using the kubeconfig in the project root:
export KUBECONFIG=$PROJECT_PATH/elastic-kubeconfig.yaml

# Or via the Kibana UI at 172.236.132.82:5601
# Stack Management → Snapshot and Restore → Snapshots → Take snapshot

Check cluster health

curl -u "$ES_USER:$ES_PASSWORD" https://172.236.132.81:9200/_cluster/health

Mongo Tunnel Recovery

If mongo-tunnel is down and workflow features are broken:

ssh zygy@172.237.81.37
cd $PROJECT_PATH
docker-compose logs mongo-tunnel    # check why it's failing
docker-compose restart mongo-tunnel
docker-compose logs -f mongo-tunnel  # watch for "Connection established"

If the tunnel cannot connect after restart, see Mongo Tunnel — Troubleshooting.


Checking Backup Health

ssh zygy@172.237.81.37
docker-compose logs sqlite-backup | grep -E "(completed|ERROR|uploaded)"

Expected output pattern:

[2025-12-01 02:00:05] Backup completed successfully
[2025-12-01 02:00:05] Backup uploaded to Object Storage


Environment Variable Changes

After editing .env:

  1. Services do not pick up .env changes automatically
  2. You must stop and start (not just restart) affected services for the new values to load:
docker-compose stop backend-streamsearch
docker-compose up -d backend-streamsearch

If API_KEYS or REDIS_PASSWORD changed, restart all services.


Useful One-Liners

# Follow logs for multiple services
docker-compose logs -f backend-workflow backend-accounts backend-agent

# See which services are unhealthy
docker-compose ps | grep -v "Up"

# Exec into a running container
docker exec -it backend-streamsearch bash

# See real-time resource usage
docker stats

# Check Caddy TLS certificate status
docker exec caddy caddy list-modules
docker exec caddy curl -s http://localhost:2019/config/ | python3 -m json.tool