Common Issues and Solutions

Service Won’t Start:

# Check service status
docker service ps service-name

# Common causes and solutions:
# 1. Image not found
docker service update --image correct-image:tag service-name

# 2. Resource constraints
docker service inspect service-name | grep -A 10 Resources

# 3. Network issues
docker network ls
docker network inspect homelab

# 4. Volume mount failures
docker service logs service-name | grep -i "mount\|volume"
ls -la /mnt/swarm-data/service-path/

Network Connectivity Issues:

# Test service-to-service connectivity
docker exec -it container-name ping service-name
docker exec -it container-name nslookup service-name

# Check overlay network health
docker network inspect homelab | grep -A 20 "Containers"

# Verify VXLAN connectivity
ip -d link show | grep vx

# Test external connectivity
docker exec -it container-name curl -I https://google.com

Performance Issues:

# Identify resource bottlenecks
docker stats --no-stream
htop
iotop -o

# Check service resource usage
docker service ps service-name
docker inspect container-id | grep -A 10 Resources

# Database performance issues
# PostgreSQL
docker exec postgres-container psql -U admin -c "SELECT * FROM pg_stat_activity;"

# MariaDB
docker exec mariadb-container mysql -u root -p[SECRET] -e "SHOW PROCESSLIST;"

Storage Issues:

# Check disk space
df -h /mnt/swarm-data/

# Check for mount issues
mount | grep swarm-data
findmnt /mnt

# Verify permissions
ls -la /mnt/swarm-data/
# Should show appropriate ownership (usually 1000:1000)

# Test I/O performance
dd if=/dev/zero of=/mnt/swarm-data/test bs=1M count=100
rm /mnt/swarm-data/test

Diagnostic Commands

Comprehensive Health Check Script:

#!/bin/bash
# Swarm Health Check Script

echo "=== Docker Swarm Health Check ==="
echo "Date: $(date)"
echo

echo "=== Node Status ==="
docker node ls
echo

echo "=== Service Status ==="
docker service ls
echo

echo "=== Failed Services ==="
docker service ls --filter "desired-state=running" | grep -v "1/1\|2/2\|3/3\|4/4"
echo

echo "=== Resource Usage ==="
docker system df
echo

echo "=== Network Status ==="
docker network ls
echo

echo "=== Storage Status ==="
df -h /mnt/swarm-data/
echo

echo "=== Recent Errors ==="
docker service logs --since 1h traefik_traefik 2>&1 | grep -i error | tail -5
echo

echo "=== System Load ==="
uptime
free -h
echo

echo "Health check completed."

Network Diagnostic Script:

#!/bin/bash
# Network Connectivity Test

echo "=== Network Connectivity Test ==="

# Test internal service resolution
services=("postgres" "mariadb" "authentik_redis" "traefik")
for service in "${services[@]}"; do
    echo "Testing $service..."
    docker run --rm --network homelab alpine nslookup "$service" || echo "FAILED: $service"
done

# Test external connectivity
echo "Testing external connectivity..."
docker run --rm alpine ping -c 3 8.8.8.8 || echo "FAILED: External connectivity"

# Test HTTP services
urls=("https://auth.bitfrost.me" "https://home.bitfrost.me" "https://docs.bitfrost.me")
for url in "${urls[@]}"; do
    echo "Testing $url..."
    curl -s -o /dev/null -w "%{http_code}" "$url" || echo "FAILED: $url"
done

echo "Network test completed."

Emergency Procedures

Complete Service Recovery:

#!/bin/bash
# Emergency service recovery procedure

echo "Starting emergency recovery..."

# Stop all services
echo "Stopping all services..."
for stack in $(docker stack ls --format "{{.Name}}"); do
    docker stack rm "$stack"
    sleep 30
done

# Verify all services stopped
docker service ls

# Clean up networks (except defaults)
docker network prune -f

# Recreate homelab network
docker network create --driver overlay --attachable homelab

# Restart core infrastructure
echo "Restarting infrastructure..."
docker stack deploy -c /mnt/docker-configs/swarm/traefik/traefik-stack.yml traefik
sleep 60

docker stack deploy -c /mnt/docker-configs/swarm/database/master-db.yml postgresql17
sleep 60

docker stack deploy -c /mnt/docker-configs/swarm/database/mariab-service.yml mariadb
sleep 60

# Wait for databases to be ready
echo "Waiting for databases..."
sleep 120

# Restart authentication
docker stack deploy -c /mnt/docker-configs/swarm/authentication/authentik-stack.yml auth
sleep 60

# Restart applications
echo "Restarting applications..."
docker stack deploy -c /mnt/docker-configs/swarm/applications/nextcloud-stack.yml nextcloud
docker stack deploy -c /mnt/docker-configs/swarm/applications/paperless-stack.yml paperless
docker stack deploy -c /mnt/docker-configs/swarm/applications/vikunja-stack.yml vikunja
docker stack deploy -c /mnt/docker-configs/swarm/applications/bookstack-stack.yml books

# Restart monitoring and management
docker stack deploy -c /mnt/docker-configs/swarm/monitoring/uptime-kuma-stack.yml uptime
docker stack deploy -c /mnt/docker-configs/swarm/monitoring/homarr-service.yml homarr
docker stack deploy -c /mnt/docker-configs/swarm/management/portainer-stack.yml portainer
docker stack deploy -c /mnt/docker-configs/swarm/database/adminer.yml adminer

# Restart web services
docker stack deploy -c /mnt/docker-configs/swarm/webservers/taylors-tracker-prod.yml tracker-prod

echo "Emergency recovery completed. Verify services:"
docker service ls

Node Failure Recovery:

#!/bin/bash
# Node failure recovery procedure

failed_node="$1"
if [ -z "$failed_node" ]; then
    echo "Usage: $0 <node-name>"
    exit 1
fi

echo "Recovering from node failure: $failed_node"

# Remove failed node from swarm
docker node rm "$failed_node" --force

# Check service distribution
echo "Checking affected services..."
docker service ls
docker service ps $(docker service ls -q) | grep "$failed_node"

# For worker node failure - services should redistribute automatically
# For manager node failure - promote worker to manager
if [ "$failed_node" = "p0" ]; then
    echo "Manager node failed! Promoting worker node..."
    docker node promote p1  # Promote first available worker
fi

echo "Node recovery completed. Monitor service health."

Data Corruption Recovery:

#!/bin/bash
# Data corruption recovery procedure

service_name="$1"
backup_date="$2"

if [ -z "$service_name" ] || [ -z "$backup_date" ]; then
    echo "Usage: $0 <service-name> <backup-date>"
    echo "Example: $0 paperless 20241201"
    exit 1
fi

echo "Recovering $service_name from backup $backup_date"

# Stop affected service
docker service scale "${service_name}"=0

# Backup current corrupted data
mv "/mnt/swarm-data/$service_name" "/mnt/swarm-data/${service_name}_corrupted_$(date +%Y%m%d)"

# Restore from backup
tar -xzf "/mnt/backups/swarm_backup_${backup_date}.tar.gz" -C /tmp/
cp -r "/tmp/swarm_backup_${backup_date}/swarm-data/$service_name" "/mnt/swarm-data/"

# Restore database if applicable
if [ "$service_name" = "paperless" ] || [ "$service_name" = "vikunja" ] || [ "$service_name" = "nextcloud" ]; then
    echo "Restoring database for $service_name..."
    postgres_container=$(docker ps --filter "name=postgresql17_postgres" --format "{{.ID}}")
    docker exec "$postgres_container" dropdb -U admin "$service_name" --if-exists
    docker exec "$postgres_container" createdb -U admin "$service_name"
    docker exec -i "$postgres_container" psql -U admin "$service_name" < "/tmp/swarm_backup_${backup_date}/postgres_${service_name}_${backup_date}.sql"
fi

# Restart service
docker service scale "${service_name}"=1

# Verify recovery
sleep 60
docker service ps "${service_name}"

echo "Data recovery completed for $service_name"

Critical System Failure:

#!/bin/bash
# Critical system failure - nuclear option

echo "WARNING: This will destroy and rebuild the entire swarm!"
echo "Press Ctrl+C to cancel, or Enter to continue..."
read

# Document current state
docker node ls > /tmp/pre-failure-nodes.txt
docker service ls > /tmp/pre-failure-services.txt
docker stack ls > /tmp/pre-failure-stacks.txt

# Leave swarm on all nodes
for node in p1 p2 p3; do
    ssh "$node" "docker swarm leave --force"
done
docker swarm leave --force

# Reinitialize swarm
docker swarm init --advertise-addr 10.0.4.11

# Get new join tokens
worker_token=$(docker swarm join-token worker -q)

# Rejoin worker nodes
for node in p1 p2 p3; do
    ssh "$node" "docker swarm join --token $worker_token 10.0.4.11:2377"
done

# Recreate networks
docker network create --driver overlay --attachable homelab

# Follow emergency service recovery procedure
echo "Swarm rebuilt. Run emergency service recovery script."

Performance Monitoring and Alerting

Monitoring Script for Critical Metrics:

#!/bin/bash
# Performance monitoring script

# CPU and Memory thresholds
CPU_THRESHOLD=80
MEM_THRESHOLD=85
DISK_THRESHOLD=90

# Check system resources
cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | awk -F'%' '{print $1}')
mem_usage=$(free | grep Mem | awk '{printf("%.0f", $3/$2 * 100.0)}')
disk_usage=$(df /mnt | tail -1 | awk '{print $5}' | sed 's/%//')

echo "=== System Performance Report ==="
echo "CPU Usage: ${cpu_usage}%"
echo "Memory Usage: ${mem_usage}%"
echo "Disk Usage: ${disk_usage}%"

# Check for alerts
if (( $(echo "$cpu_usage > $CPU_THRESHOLD" | bc -l) )); then
    echo "ALERT: High CPU usage detected!"
fi

if [ "$mem_usage" -gt "$MEM_THRESHOLD" ]; then
    echo "ALERT: High memory usage detected!"
fi

if [ "$disk_usage" -gt "$DISK_THRESHOLD" ]; then
    echo "ALERT: High disk usage detected!"
fi

# Check service health
failed_services=$(docker service ls | grep -v "1/1\|2/2\|3/3\|4/4" | wc -l)
if [ "$failed_services" -gt 1 ]; then  # Subtract header line
    echo "ALERT: Failed services detected!"
    docker service ls | grep -v "1/1\|2/2\|3/3\|4/4"
fi

echo "Performance check completed at $(date)"

This completes the comprehensive technical handbook for your Docker Swarm homelab. The handbook provides:

  • Complete architecture overviewwith your specific hardware setup
  • Detailed networking configurationincluding Traefik and SSL management
  • Storage strategiesoptimized for your centralized NVMe setup
  • Service catalogcovering all 18 deployed services
  • Deployment procedureswith real commands and examples
  • Operations and maintenanceincluding backup strategies
  • Advanced troubleshootingand emergency procedures

All sensitive information has been marked as [SECRET] while preserving the technical value of the documentation. The handbook serves as both a reference guide and operational manual for managing your sophisticated homelab infrastructure.