Skip to content

Latest commit

 

History

History
817 lines (618 loc) · 15.9 KB

File metadata and controls

817 lines (618 loc) · 15.9 KB

Troubleshooting Guide

Common issues and solutions for itsup infrastructure.

General Troubleshooting Steps

1. Check Status

itsup status              # Infrastructure status
docker ps -a              # All containers
docker network ls         # Networks

2. Check Logs

itsup proxy logs traefik  # Traefik logs
itsup svc {project} logs  # Project logs
tail -f logs/*.log        # System logs (access, api, monitor)

3. Validate Configuration

itsup validate            # All projects
itsup validate {project}  # Specific project

4. Restart Services

itsup svc {project} restart     # Restart project
itsup proxy restart traefik     # Restart Traefik
itsup run                       # Restart everything

Infrastructure Issues

DNS Stack Won't Start

Symptom: itsup dns up fails with network error.

Possible Causes:

  1. Port 53 already in use
  2. Network conflict
  3. Docker daemon not running

Solutions:

Check port 53:

sudo netstat -tlnp | grep :53
# Or
sudo lsof -i :53

If systemd-resolved using port 53:

# Disable stub resolver
sudo sed -i 's/#DNSStubListener=yes/DNSStubListener=no/' /etc/systemd/resolved.conf
sudo systemctl restart systemd-resolved

Check Docker daemon:

sudo systemctl status docker
sudo systemctl start docker  # If not running

Check for conflicting networks:

docker network ls | grep proxynet
docker network rm proxynet  # If exists but corrupted
itsup dns up  # Will recreate

Proxy Stack Won't Start

Symptom: itsup proxy up fails or Traefik not responding.

Check Traefik logs:

itsup proxy logs traefik

Common errors and fixes:

"Error while creating certificate":

# Let's Encrypt rate limit hit
# Wait 1 hour or use staging environment
vim projects/traefik.yml
# Add:
certificatesResolvers:
  letsencrypt:
    acme:
      caServer: https://acme-staging-v02.api.letsencrypt.org/directory

"Cannot connect to Docker daemon":

# dockerproxy not running or misconfigured
itsup proxy logs dockerproxy

# Check dockerproxy is accessible
curl http://localhost:2375/version  # Should return JSON

# Restart dockerproxy
itsup proxy restart dockerproxy

"Address already in use :80":

# Another service using port 80
sudo netstat -tlnp | grep :80

# Stop conflicting service
sudo systemctl stop nginx  # Or apache2, etc.

API Won't Start

Symptom: bin/start-api.sh fails or API not responding.

Check API logs:

tail -f logs/api.log

Common issues:

"ModuleNotFoundError":

# Missing dependencies
source .venv/bin/activate
pip install -r requirements.txt

"Port 8080 already in use":

# Find process using port
sudo lsof -i :8080
# Kill or change API port

"Permission denied":

# Check file permissions
ls -l bin/start-api.sh
chmod +x bin/start-api.sh

Monitor Won't Start

Symptom: itsup monitor start fails.

Common causes:

"Permission denied (eBPF)":

# Monitor requires root for eBPF
sudo itsup monitor start

"OpenSnitch database not found":

# Check OpenSnitch is installed
sudo systemctl status opensnitch

# Check database exists
ls -l /var/lib/opensnitch/opensnitch.sqlite3

# Start without OpenSnitch integration
itsup monitor start  # Without --use-opensnitch flag

Project Deployment Issues

Deployment Fails

Symptom: itsup apply {project} fails with error.

Check deployment logs:

itsup apply {project} --verbose

Common errors:

"Service '{service}' failed to build":

# Build context issue or Dockerfile error
# Check Dockerfile syntax
docker build projects/{project}/

# Check build context
ls projects/{project}/

"Cannot start service: port already allocated":

# Port conflict with another container
docker ps | grep {port}

# Change port in docker-compose.yml or stop conflicting container

"Network 'proxynet' not found":

# DNS stack not running
itsup dns up

# Verify network exists
docker network ls | grep proxynet

"Error while fetching server API version":

# Docker daemon not running or not accessible
sudo systemctl status docker
sudo systemctl start docker

Container Keeps Restarting

Symptom: Container starts but immediately exits and restarts.

Check container logs:

itsup svc {project} logs {service}

Common causes:

Application crash on startup:

  • Check logs for error messages
  • Verify environment variables are set correctly
  • Test image manually: docker run -it {image} sh

Health check failing:

# Check health check status
docker inspect {container} | jq '.[0].State.Health'

# Disable health check temporarily (for debugging)
vim projects/{project}/docker-compose.yml
# Comment out healthcheck section
itsup apply {project}

Missing volume or file:

# Check volume mounts
docker inspect {container} | jq '.[0].Mounts'

# Verify host paths exist
ls -l /path/to/volume

Service Not Reachable

Symptom: Container running but domain returns 404 or connection refused.

Check step-by-step:

1. Verify container is running:

itsup svc {project} ps
docker ps | grep {project}

2. Check container network:

docker inspect {container} | jq '.[0].NetworkSettings.Networks'
# Should show connection to proxynet

3. Check Traefik sees the service:

itsup proxy logs traefik | grep {project}
# Should show "Adding route" or "Server added"

4. Check Traefik labels:

docker inspect {container} | jq '.[0].Config.Labels' | grep traefik
# Should show traefik.enable=true and routing labels

5. Test direct access (bypass Traefik):

# Find container IP
docker inspect {container} | jq -r '.[0].NetworkSettings.Networks.proxynet.IPAddress'

# Test directly
curl http://{container-ip}:{port}

6. Test via Traefik:

# Test with Host header
curl -H "Host: {domain}" http://localhost/

# Should return service response

Common fixes:

Missing Traefik labels:

# Regenerate config
itsup apply {project}

# Verify labels in generated file
grep "traefik.enable" upstream/{project}/docker-compose.yml

Wrong domain in ingress.yml:

vim projects/{project}/ingress.yml
# Verify domain matches DNS/host file
itsup apply {project}

Service not listening on configured port:

# Check what port service actually uses
docker exec {container} netstat -tlnp

# Update ingress.yml to match actual port

Network Issues

Cannot Reach External Services

Symptom: Container can't reach internet or external APIs.

Check container connectivity:

# Test DNS resolution
docker exec {container} nslookup google.com

# Test internet connectivity
docker exec {container} ping -c 3 8.8.8.8

# Test HTTPS
docker exec {container} curl https://www.google.com

Common causes:

DNS not working:

# Check container's DNS config
docker inspect {container} | jq '.[0].HostConfig.Dns'

# Use Docker's default DNS
vim projects/{project}/docker-compose.yml
# Remove any custom DNS settings

Firewall blocking:

# Check iptables rules
sudo iptables -L DOCKER-USER -n -v

# Check if monitor blocked the connection
itsup monitor logs | grep {container}

# Whitelist destination
echo "destination-ip-or-domain" >> config/monitor-whitelist.txt
itsup monitor restart

Network isolation:

# Verify container has internet access
docker run --rm --network {network} alpine ping -c 3 8.8.8.8

# If fails, check Docker network configuration
docker network inspect {network}

Inter-Container Communication Fails

Symptom: Container A can't reach container B.

Check both containers are on same network:

docker network inspect proxynet
# Should show both containers

Test connectivity:

# From container A
docker exec {container-a} ping {container-b}
docker exec {container-a} curl http://{container-b}:{port}

Common fixes:

Not on same network:

# In docker-compose.yml
services:
  app:
    networks:
      - proxynet
      - backend
  db:
    networks:
      - backend  # Add proxynet if needed

Wrong hostname:

# Use service name as hostname (not container name)
# Correct: http://db:5432
# Wrong: http://project-db-1:5432

TLS/HTTPS Issues

Certificate Not Issued

Symptom: HTTPS returns "certificate not valid" or "NET::ERR_CERT_AUTHORITY_INVALID".

Check certificate status:

itsup proxy logs traefik | grep -i certificate

Common causes:

Rate limit hit:

Error while obtaining certificate: too many certificates already issued

Fix: Wait 1 hour or use staging server (see Proxy Stack Won't Start).

Challenge failed:

Error while obtaining certificate: challenge failed

Fix:

# Verify domain DNS points to server
nslookup {domain}

# Verify port 80 is accessible from internet
curl http://{domain}

# Check Traefik logs for specific challenge error
itsup proxy logs traefik | grep -i challenge

Fix by forcing renewal:

# Remove certificate (forces re-issue)
rm proxy/traefik/acme.json
itsup proxy restart traefik

# Watch certificate issuance
itsup proxy logs traefik | grep -i certificate

Certificate Expired

Symptom: HTTPS works but browser shows "certificate expired".

Check certificate expiry:

echo | openssl s_client -connect {domain}:443 2>/dev/null | openssl x509 -noout -dates

Auto-renewal should handle this. If not:

Force renewal:

rm proxy/traefik/acme.json
itsup proxy restart traefik

Check renewal is working:

# Traefik should log renewal attempts
itsup proxy logs traefik | grep -i renew

Mixed Content Warnings

Symptom: HTTPS site loads but browser shows "mixed content" warnings.

Cause: Site serving HTTP resources on HTTPS page.

Fix in application:

  • Use protocol-relative URLs: //cdn.example.com/script.js
  • Or force HTTPS: https://cdn.example.com/script.js
  • Add middleware to Traefik to enforce HTTPS headers

Add security headers:

# In projects/traefik.yml
http:
  middlewares:
    security-headers:
      headers:
        forceSTSHeader: true
        stsSeconds: 31536000
        stsIncludeSubdomains: true
        contentSecurityPolicy: "upgrade-insecure-requests"
# In ingress.yml
ingress:
  - service: web
    middleware: [security-headers]

Secret Issues

Secrets Not Loading

Symptom: Container starts but environment variables are empty or undefined.

Check secrets file exists:

ls -l secrets/{project}.txt
cat secrets/{project}.txt | grep {VAR}

Decrypt if encrypted:

itsup decrypt {project}

Verify variable in compose file:

grep {VAR} projects/{project}/docker-compose.yml
# Should show: - VAR=${VAR}

Check container environment:

docker exec {container} env | grep {VAR}

Force reload:

itsup svc {project} down
itsup apply {project}

Encryption/Decryption Fails

Symptom: itsup encrypt or itsup decrypt fails.

Check SOPS is installed:

sops --version

Check SOPS configuration:

cat .sops.yaml

Check GPG/age keys:

# For GPG
gpg --list-secret-keys

# For age
ls -l ~/.config/sops/age/keys.txt

Manual decryption (debug):

sops -d secrets/{project}.enc.txt

If corrupt, restore from git:

git checkout HEAD -- secrets/{project}.enc.txt
itsup decrypt {project}

Performance Issues

High CPU Usage

Symptom: Server CPU constantly high.

Check which container:

docker stats --no-stream
# Shows CPU usage per container

Inspect container:

# Check process list
docker exec {container} ps aux

# Check logs for errors
itsup svc {project} logs {service}

Common causes:

Restart loop: Container crashing and restarting constantly

  • Fix: Check logs, fix application error

Infinite loop: Application bug causing CPU spin

  • Fix: Stop container, fix bug, redeploy

Resource exhaustion: Container needs more CPU

  • Fix: Add resource limits or increase host capacity

High Memory Usage

Symptom: Server memory constantly high or OOM errors.

Check which container:

docker stats --no-stream
# Shows memory usage per container

Add memory limits (prevent one container from hogging all memory):

# In docker-compose.yml
services:
  app:
    deploy:
      resources:
        limits:
          memory: 512M
        reservations:
          memory: 256M

Check for memory leaks:

# Monitor over time
watch -n 5 "docker stats --no-stream | grep {container}"

Slow Response Times

Symptom: Application responds slowly or times out.

Check Traefik logs:

tail -f logs/access.log
# Look for response times (last column in CLF format)

Test direct vs through Traefik:

# Direct (should be fast)
time curl http://{container-ip}:{port}

# Through Traefik (compare)
time curl https://{domain}

If Traefik is slow:

  • Check middleware (auth, rate limiting can slow requests)
  • Check Traefik logs for errors
  • Check Traefik resource usage

If application is slow:

  • Check application logs
  • Check database connection
  • Profile application

Docker Issues

Docker Daemon Not Responding

Symptom: Any docker command hangs or fails.

Check daemon status:

sudo systemctl status docker

Restart daemon:

sudo systemctl restart docker

Check logs:

sudo journalctl -u docker -n 100

Disk Space Issues

Symptom: "no space left on device" errors.

Check disk usage:

df -h
docker system df  # Docker-specific disk usage

Clean up Docker:

# Remove stopped containers
docker container prune -f

# Remove unused images
docker image prune -a -f

# Remove unused volumes
docker volume prune -f

# Remove unused networks
docker network prune -f

# All-in-one cleanup
docker system prune -a --volumes -f

For itsup containers specifically:

itsup down --clean  # Removes stopped itsup containers

Cannot Remove Container

Symptom: docker rm fails with "container is running" or "device or resource busy".

Force stop and remove:

docker stop -t 1 {container}  # Stop with 1s timeout
docker rm -f {container}      # Force remove

If still fails:

# Check if container is being recreated
docker events | grep {container}

# Restart Docker daemon
sudo systemctl restart docker

Getting Help

Collecting Debug Information

Before asking for help, collect:

  1. System information:
uname -a
docker --version
docker compose version
  1. itsup version:
itsup --version
git log -1
  1. Status:
itsup status
docker ps -a
docker network ls
  1. Logs (with verbose output):
itsup apply {project} --verbose > debug.log 2>&1
  1. Configuration (redact secrets):
cat projects/{project}/docker-compose.yml
cat projects/{project}/ingress.yml

Where to Get Help

Common Pitfalls

  1. Editing upstream/ instead of projects/

    • Always edit source (projects/), not generated artifacts
  2. Forgetting to decrypt secrets

    • Run itsup decrypt {project} after cloning repo
  3. Not loading secrets at deployment

    • itsup apply loads secrets automatically
    • Manual docker compose commands need env parameter
  4. Committing plaintext secrets

    • Only commit .enc.txt files
    • .txt files are gitignored
  5. Not restarting after config changes

    • itsup apply regenerates and restarts
    • Manual changes to upstream/ are lost on next apply