6.1 Cluster Health Monitoring

Node Status Verification

Check Cluster Health:

# Overall cluster status
docker node ls

# Expected output:
# ID            HOSTNAME   STATUS   AVAILABILITY   MANAGER STATUS   ENGINE VERSION
# [SECRET]      p0         Ready    Active         Leader           28.4.0
# [SECRET]      p1         Ready    Active                          28.4.0
# [SECRET]      p2         Ready    Active                          28.4.0
# [SECRET]      p3         Ready    Active                          28.4.0

# Detailed node information
docker node inspect p0 --pretty

# Node resource usage
docker system df
docker system info

Node Health Indicators:

Status: Should be “Ready”
Availability: Should be “Active”
Manager Status: Leader on p0, blank on workers
Engine Version: Consistent across all nodes (28.4.0)

Troubleshooting Node Issues:

# Check node connectivity
docker node inspect node-name | grep -i state

# View node events
docker system events --filter type=node

# Rejoin failed node
docker swarm leave --force  # On failed node
docker swarm join --token [worker-token] manager-ip:2377

Service Health Checks

Service Status Overview:

# All services status
docker service ls

# Expected output shows all services with matching REPLICAS:
# NAME                            REPLICAS   IMAGE
# adminer_adminer                 1/1        adminer:latest
# auth_authentik_redis            1/1        redis:alpine
# auth_authentik_server           1/1        ghcr.io/goauthentik/server:latest
# [... continue for all 18 services]

# Detailed service information
docker service ps service-name

# Service configuration
docker service inspect service-name --pretty

Health Check Patterns:

# Services with health checks
docker service ps uptime_uptime-kuma
docker service ps paperless_paperless_webserver
docker service ps auth_authentik_server

# Look for "(healthy)" status in STATE column

Service Recovery Procedures:

# Restart failed service
docker service update --force service-name

# Scale down and up
docker service scale service-name=0
docker service scale service-name=1

# Check service constraints
docker service inspect service-name | grep -A 5 Constraints

Log Analysis and Troubleshooting

Service Log Analysis:

# Real-time service logs
docker service logs -f service-name

# Historical logs with timestamps
docker service logs --since 24h --timestamps service-name

# Filter logs by keyword
docker service logs service-name 2>&1 | grep ERROR

# Container-specific logs
docker logs container-id

Common Log Locations:

# System logs
journalctl -u docker.service -f

# Application logs within containers
docker exec -it container-name tail -f /var/log/app.log

# Swarm orchestration logs
docker service logs traefik_traefik | grep -i error

Log Rotation and Management:

# Check log sizes
docker system df

# Container log configuration
# Add to service configuration:
logging:
  driver: "json-file"
  options:
    max-size: "10m"
    max-file: "3"

Troubleshooting Checklist:

Network Connectivity: Verify overlay networks
Resource Availability: Check CPU/memory usage
Storage Access: Verify bind mount accessibility
Service Dependencies: Ensure dependent services are running
Configuration: Validate environment variables and secrets