Common Issues and Solutions
Service Won’t Start:
# Check service status
docker service ps service-name
# Common causes and solutions:
# 1. Image not found
docker service update --image correct-image:tag service-name
# 2. Resource constraints
docker service inspect service-name | grep -A 10 Resources
# 3. Network issues
docker network ls
docker network inspect homelab
# 4. Volume mount failures
docker service logs service-name | grep -i "mount\|volume"
ls -la /mnt/swarm-data/service-path/
Network Connectivity Issues:
# Test service-to-service connectivity
docker exec -it container-name ping service-name
docker exec -it container-name nslookup service-name
# Check overlay network health
docker network inspect homelab | grep -A 20 "Containers"
# Verify VXLAN connectivity
ip -d link show | grep vx
# Test external connectivity
docker exec -it container-name curl -I https://google.com
Performance Issues:
# Identify resource bottlenecks
docker stats --no-stream
htop
iotop -o
# Check service resource usage
docker service ps service-name
docker inspect container-id | grep -A 10 Resources
# Database performance issues
# PostgreSQL
docker exec postgres-container psql -U admin -c "SELECT * FROM pg_stat_activity;"
# MariaDB
docker exec mariadb-container mysql -u root -p[SECRET] -e "SHOW PROCESSLIST;"
Storage Issues:
# Check disk space
df -h /mnt/swarm-data/
# Check for mount issues
mount | grep swarm-data
findmnt /mnt
# Verify permissions
ls -la /mnt/swarm-data/
# Should show appropriate ownership (usually 1000:1000)
# Test I/O performance
dd if=/dev/zero of=/mnt/swarm-data/test bs=1M count=100
rm /mnt/swarm-data/test
Diagnostic Commands
Comprehensive Health Check Script:
#!/bin/bash
# Swarm Health Check Script
echo "=== Docker Swarm Health Check ==="
echo "Date: $(date)"
echo
echo "=== Node Status ==="
docker node ls
echo
echo "=== Service Status ==="
docker service ls
echo
echo "=== Failed Services ==="
docker service ls --filter "desired-state=running" | grep -v "1/1\|2/2\|3/3\|4/4"
echo
echo "=== Resource Usage ==="
docker system df
echo
echo "=== Network Status ==="
docker network ls
echo
echo "=== Storage Status ==="
df -h /mnt/swarm-data/
echo
echo "=== Recent Errors ==="
docker service logs --since 1h traefik_traefik 2>&1 | grep -i error | tail -5
echo
echo "=== System Load ==="
uptime
free -h
echo
echo "Health check completed."
Network Diagnostic Script:
#!/bin/bash
# Network Connectivity Test
echo "=== Network Connectivity Test ==="
# Test internal service resolution
services=("postgres" "mariadb" "authentik_redis" "traefik")
for service in "${services[@]}"; do
echo "Testing $service..."
docker run --rm --network homelab alpine nslookup "$service" || echo "FAILED: $service"
done
# Test external connectivity
echo "Testing external connectivity..."
docker run --rm alpine ping -c 3 8.8.8.8 || echo "FAILED: External connectivity"
# Test HTTP services
urls=("https://auth.bitfrost.me" "https://home.bitfrost.me" "https://docs.bitfrost.me")
for url in "${urls[@]}"; do
echo "Testing $url..."
curl -s -o /dev/null -w "%{http_code}" "$url" || echo "FAILED: $url"
done
echo "Network test completed."
Emergency Procedures
Complete Service Recovery:
#!/bin/bash
# Emergency service recovery procedure
echo "Starting emergency recovery..."
# Stop all services
echo "Stopping all services..."
for stack in $(docker stack ls --format "{{.Name}}"); do
docker stack rm "$stack"
sleep 30
done
# Verify all services stopped
docker service ls
# Clean up networks (except defaults)
docker network prune -f
# Recreate homelab network
docker network create --driver overlay --attachable homelab
# Restart core infrastructure
echo "Restarting infrastructure..."
docker stack deploy -c /mnt/docker-configs/swarm/traefik/traefik-stack.yml traefik
sleep 60
docker stack deploy -c /mnt/docker-configs/swarm/database/master-db.yml postgresql17
sleep 60
docker stack deploy -c /mnt/docker-configs/swarm/database/mariab-service.yml mariadb
sleep 60
# Wait for databases to be ready
echo "Waiting for databases..."
sleep 120
# Restart authentication
docker stack deploy -c /mnt/docker-configs/swarm/authentication/authentik-stack.yml auth
sleep 60
# Restart applications
echo "Restarting applications..."
docker stack deploy -c /mnt/docker-configs/swarm/applications/nextcloud-stack.yml nextcloud
docker stack deploy -c /mnt/docker-configs/swarm/applications/paperless-stack.yml paperless
docker stack deploy -c /mnt/docker-configs/swarm/applications/vikunja-stack.yml vikunja
docker stack deploy -c /mnt/docker-configs/swarm/applications/bookstack-stack.yml books
# Restart monitoring and management
docker stack deploy -c /mnt/docker-configs/swarm/monitoring/uptime-kuma-stack.yml uptime
docker stack deploy -c /mnt/docker-configs/swarm/monitoring/homarr-service.yml homarr
docker stack deploy -c /mnt/docker-configs/swarm/management/portainer-stack.yml portainer
docker stack deploy -c /mnt/docker-configs/swarm/database/adminer.yml adminer
# Restart web services
docker stack deploy -c /mnt/docker-configs/swarm/webservers/taylors-tracker-prod.yml tracker-prod
echo "Emergency recovery completed. Verify services:"
docker service ls
Node Failure Recovery:
#!/bin/bash
# Node failure recovery procedure
failed_node="$1"
if [ -z "$failed_node" ]; then
echo "Usage: $0 <node-name>"
exit 1
fi
echo "Recovering from node failure: $failed_node"
# Remove failed node from swarm
docker node rm "$failed_node" --force
# Check service distribution
echo "Checking affected services..."
docker service ls
docker service ps $(docker service ls -q) | grep "$failed_node"
# For worker node failure - services should redistribute automatically
# For manager node failure - promote worker to manager
if [ "$failed_node" = "p0" ]; then
echo "Manager node failed! Promoting worker node..."
docker node promote p1 # Promote first available worker
fi
echo "Node recovery completed. Monitor service health."
Data Corruption Recovery:
#!/bin/bash
# Data corruption recovery procedure
service_name="$1"
backup_date="$2"
if [ -z "$service_name" ] || [ -z "$backup_date" ]; then
echo "Usage: $0 <service-name> <backup-date>"
echo "Example: $0 paperless 20241201"
exit 1
fi
echo "Recovering $service_name from backup $backup_date"
# Stop affected service
docker service scale "${service_name}"=0
# Backup current corrupted data
mv "/mnt/swarm-data/$service_name" "/mnt/swarm-data/${service_name}_corrupted_$(date +%Y%m%d)"
# Restore from backup
tar -xzf "/mnt/backups/swarm_backup_${backup_date}.tar.gz" -C /tmp/
cp -r "/tmp/swarm_backup_${backup_date}/swarm-data/$service_name" "/mnt/swarm-data/"
# Restore database if applicable
if [ "$service_name" = "paperless" ] || [ "$service_name" = "vikunja" ] || [ "$service_name" = "nextcloud" ]; then
echo "Restoring database for $service_name..."
postgres_container=$(docker ps --filter "name=postgresql17_postgres" --format "{{.ID}}")
docker exec "$postgres_container" dropdb -U admin "$service_name" --if-exists
docker exec "$postgres_container" createdb -U admin "$service_name"
docker exec -i "$postgres_container" psql -U admin "$service_name" < "/tmp/swarm_backup_${backup_date}/postgres_${service_name}_${backup_date}.sql"
fi
# Restart service
docker service scale "${service_name}"=1
# Verify recovery
sleep 60
docker service ps "${service_name}"
echo "Data recovery completed for $service_name"
Critical System Failure:
#!/bin/bash
# Critical system failure - nuclear option
echo "WARNING: This will destroy and rebuild the entire swarm!"
echo "Press Ctrl+C to cancel, or Enter to continue..."
read
# Document current state
docker node ls > /tmp/pre-failure-nodes.txt
docker service ls > /tmp/pre-failure-services.txt
docker stack ls > /tmp/pre-failure-stacks.txt
# Leave swarm on all nodes
for node in p1 p2 p3; do
ssh "$node" "docker swarm leave --force"
done
docker swarm leave --force
# Reinitialize swarm
docker swarm init --advertise-addr 10.0.4.11
# Get new join tokens
worker_token=$(docker swarm join-token worker -q)
# Rejoin worker nodes
for node in p1 p2 p3; do
ssh "$node" "docker swarm join --token $worker_token 10.0.4.11:2377"
done
# Recreate networks
docker network create --driver overlay --attachable homelab
# Follow emergency service recovery procedure
echo "Swarm rebuilt. Run emergency service recovery script."
Performance Monitoring and Alerting
Monitoring Script for Critical Metrics:
#!/bin/bash
# Performance monitoring script
# CPU and Memory thresholds
CPU_THRESHOLD=80
MEM_THRESHOLD=85
DISK_THRESHOLD=90
# Check system resources
cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | awk -F'%' '{print $1}')
mem_usage=$(free | grep Mem | awk '{printf("%.0f", $3/$2 * 100.0)}')
disk_usage=$(df /mnt | tail -1 | awk '{print $5}' | sed 's/%//')
echo "=== System Performance Report ==="
echo "CPU Usage: ${cpu_usage}%"
echo "Memory Usage: ${mem_usage}%"
echo "Disk Usage: ${disk_usage}%"
# Check for alerts
if (( $(echo "$cpu_usage > $CPU_THRESHOLD" | bc -l) )); then
echo "ALERT: High CPU usage detected!"
fi
if [ "$mem_usage" -gt "$MEM_THRESHOLD" ]; then
echo "ALERT: High memory usage detected!"
fi
if [ "$disk_usage" -gt "$DISK_THRESHOLD" ]; then
echo "ALERT: High disk usage detected!"
fi
# Check service health
failed_services=$(docker service ls | grep -v "1/1\|2/2\|3/3\|4/4" | wc -l)
if [ "$failed_services" -gt 1 ]; then # Subtract header line
echo "ALERT: Failed services detected!"
docker service ls | grep -v "1/1\|2/2\|3/3\|4/4"
fi
echo "Performance check completed at $(date)"
This completes the comprehensive technical handbook for your Docker Swarm homelab. The handbook provides:
- Complete architecture overviewwith your specific hardware setup
- Detailed networking configurationincluding Traefik and SSL management
- Storage strategiesoptimized for your centralized NVMe setup
- Service catalogcovering all 18 deployed services
- Deployment procedureswith real commands and examples
- Operations and maintenanceincluding backup strategies
- Advanced troubleshootingand emergency procedures
All sensitive information has been marked as [SECRET] while preserving the technical value of the documentation. The handbook serves as both a reference guide and operational manual for managing your sophisticated homelab infrastructure.