Node Status Verification
Check Cluster Health:
# Overall cluster status
docker node ls
# Expected output:
# ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
# [SECRET] p0 Ready Active Leader 28.4.0
# [SECRET] p1 Ready Active 28.4.0
# [SECRET] p2 Ready Active 28.4.0
# [SECRET] p3 Ready Active 28.4.0
# Detailed node information
docker node inspect p0 --pretty
# Node resource usage
docker system df
docker system info
Node Health Indicators:
- Status: Should be “Ready”
- Availability: Should be “Active”
- Manager Status: Leader on p0, blank on workers
- Engine Version: Consistent across all nodes (28.4.0)
Troubleshooting Node Issues:
# Check node connectivity
docker node inspect node-name | grep -i state
# View node events
docker system events --filter type=node
# Rejoin failed node
docker swarm leave --force # On failed node
docker swarm join --token [worker-token] manager-ip:2377
Service Health Checks
Service Status Overview:
# All services status
docker service ls
# Expected output shows all services with matching REPLICAS:
# NAME REPLICAS IMAGE
# adminer_adminer 1/1 adminer:latest
# auth_authentik_redis 1/1 redis:alpine
# auth_authentik_server 1/1 ghcr.io/goauthentik/server:latest
# [... continue for all 18 services]
# Detailed service information
docker service ps service-name
# Service configuration
docker service inspect service-name --pretty
Health Check Patterns:
# Services with health checks
docker service ps uptime_uptime-kuma
docker service ps paperless_paperless_webserver
docker service ps auth_authentik_server
# Look for "(healthy)" status in STATE column
Service Recovery Procedures:
# Restart failed service
docker service update --force service-name
# Scale down and up
docker service scale service-name=0
docker service scale service-name=1
# Check service constraints
docker service inspect service-name | grep -A 5 Constraints
Log Analysis and Troubleshooting
Service Log Analysis:
# Real-time service logs
docker service logs -f service-name
# Historical logs with timestamps
docker service logs --since 24h --timestamps service-name
# Filter logs by keyword
docker service logs service-name 2>&1 | grep ERROR
# Container-specific logs
docker logs container-id
Common Log Locations:
# System logs
journalctl -u docker.service -f
# Application logs within containers
docker exec -it container-name tail -f /var/log/app.log
# Swarm orchestration logs
docker service logs traefik_traefik | grep -i error
Log Rotation and Management:
# Check log sizes
docker system df
# Container log configuration
# Add to service configuration:
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
Troubleshooting Checklist:
- Network Connectivity: Verify overlay networks
- Resource Availability: Check CPU/memory usage
- Storage Access: Verify bind mount accessibility
- Service Dependencies: Ensure dependent services are running
- Configuration: Validate environment variables and secrets