Files
spotlightcam/docs/MONITORING.md
Radosław Gierwiało 642c8f6d6f feat: add production operations scripts and monitoring guide
Add comprehensive tooling for production deployment:

Scripts (scripts/):
- backup-db.sh: Automated database backups with 7-day retention
- restore-db.sh: Safe database restore with confirmation prompts
- health-check.sh: Complete service health monitoring
- README.md: Operational scripts documentation

Monitoring (docs/MONITORING.md):
- Application health monitoring
- Docker container monitoring
- External monitoring setup (UptimeRobot, Pingdom)
- Log monitoring and rotation
- Alerting configuration
- Incident response procedures
- SLA targets and metrics

All scripts include:
- Environment support (dev/prod)
- Error handling and validation
- Detailed status reporting
- Safety confirmations where needed
2025-11-20 22:22:22 +01:00

7.9 KiB

Monitoring Guide - spotlight.cam

Complete guide for monitoring spotlight.cam in production.

📊 Monitoring Strategy

Three-Layer Approach

  1. Application Monitoring - Health checks, logs, metrics
  2. Infrastructure Monitoring - Docker containers, system resources
  3. External Monitoring - Uptime, response times, SSL certificates

🏥 Application Monitoring

Built-in Health Check

Endpoint: GET /api/health

Response (healthy):

{
  "status": "ok",
  "timestamp": "2025-11-20T12:00:00.000Z",
  "uptime": 3600,
  "environment": "production"
}

Usage:

# Check health
curl https://spotlight.cam/api/health

# Automated check (exit code 0 = healthy)
curl -f -s https://spotlight.cam/api/health > /dev/null

Health Check Script

Use built-in health check script:

# Check all services
./scripts/health-check.sh prod

# Output:
# ✅ nginx: Running
# ✅ Frontend: Running
# ✅ Backend: Running
# ✅ Database: Running
# ✅ API responding
# ✅ Database accepting connections

🐳 Docker Container Monitoring

Check Container Status

# List all containers
docker compose --profile prod ps

# Check specific container
docker inspect slc-backend-prod --format='{{.State.Status}}'

# View resource usage
docker stats --no-stream

Container Health Checks

Built into docker-compose.yml:

  • Backend: curl localhost:3000/api/health
  • Database: pg_isready -U spotlightcam
# View health status
docker compose --profile prod ps
# Look for "(healthy)" in STATUS column

📝 Log Monitoring

View Logs

# All services
docker compose --profile prod logs -f

# Specific service
docker logs -f slc-backend-prod

# Last 100 lines
docker logs --tail 100 slc-backend-prod

# With timestamps
docker logs -f --timestamps slc-backend-prod

# Filter errors only
docker logs slc-backend-prod 2>&1 | grep -i error

Log Rotation

Configured in docker-compose.yml:

logging:
  driver: "json-file"
  options:
    max-size: "10m"
    max-file: "3"

Important Log Patterns

Authentication errors:

docker logs slc-backend-prod | grep "401\|403\|locked"

Database errors:

docker logs slc-backend-prod | grep -i "prisma\|database"

Rate limiting:

docker logs slc-backend-prod | grep "Too many requests"

Email failures:

docker logs slc-backend-prod | grep "Failed to send.*email"

🌐 External Monitoring

1. UptimeRobot (Free)

Setup:

  1. Create account
  2. Add HTTP monitor: https://spotlight.cam
  3. Add HTTP monitor: https://spotlight.cam/api/health
  4. Set alert contacts
  5. Create public status page (optional)

2. Pingdom

  • URL: https://pingdom.com
  • Features:
    • 1-minute checks
    • Transaction monitoring
    • Real user monitoring
    • SSL monitoring

3. Better Uptime

Monitor These Endpoints

Endpoint Check Type Expected
https://spotlight.cam HTTP 200 OK
https://spotlight.cam/api/health HTTP + JSON {"status":"ok"}
spotlight.cam SSL Valid, not expiring
spotlight.cam DNS Resolves correctly

📈 Metrics to Track

Application Metrics

  1. Response Times

    • API endpoints: < 200ms
    • Frontend load: < 1s
  2. Error Rates

    • 4xx errors: < 1%
    • 5xx errors: < 0.1%
  3. Authentication

    • Failed logins
    • Account lockouts
    • Password resets
  4. WebRTC

    • Connection success rate
    • File transfer completions
    • Peer connection failures

Infrastructure Metrics

  1. CPU Usage

    docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}"
    
  2. Memory Usage

    docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}"
    
  3. Disk Space

    df -h
    du -sh /var/lib/docker
    
  4. Database Size

    docker exec slc-db-prod psql -U spotlightcam -c "SELECT pg_size_pretty(pg_database_size('spotlightcam'));"
    

🚨 Alerting Setup

Email Alerts (Simple)

Create alert script:

#!/bin/bash
# /usr/local/bin/alert-spotlight.sh

SUBJECT="⚠️ spotlight.cam Alert"
RECIPIENT="admin@example.com"

# Run health check
if ! /path/to/spotlightcam/scripts/health-check.sh prod; then
    echo "Health check failed at $(date)" | mail -s "$SUBJECT" "$RECIPIENT"
fi

Add to crontab:

*/5 * * * * /usr/local/bin/alert-spotlight.sh

Slack Alerts (Advanced)

#!/bin/bash
# /usr/local/bin/alert-slack.sh

SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

if ! /path/to/spotlightcam/scripts/health-check.sh prod; then
    curl -X POST "$SLACK_WEBHOOK" \
         -H 'Content-Type: application/json' \
         -d '{
           "text": "🚨 spotlight.cam health check failed",
           "username": "Monitoring Bot"
         }'
fi

📊 Dashboard (Optional)

Simple Dashboard with Grafana

  1. Setup Prometheus:
# docker-compose.monitoring.yml
services:
  prometheus:
    image: prom/prometheus
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana
    ports:
      - "3001:3000"
    volumes:
      - grafana_data:/var/lib/grafana
  1. Add metrics endpoint to backend (optional enhancement)

🔍 Troubleshooting Monitoring

Health Check Always Fails

# Test API manually
curl -v https://spotlight.cam/api/health

# Check nginx logs
docker logs slc-proxy-prod

# Check backend logs
docker logs slc-backend-prod

# Test from within container
docker exec slc-proxy-prod curl localhost:80/api/health

High CPU/Memory Usage

# Identify problematic container
docker stats --no-stream

# Check container logs
docker logs --tail 100 slc-backend-prod

# Restart if needed
docker compose --profile prod restart backend-prod

Logs Not Rotating

# Check Docker log files
ls -lh /var/lib/docker/containers/*/*-json.log

# Manual cleanup (careful!)
docker compose --profile prod down
docker system prune -af
docker compose --profile prod up -d

Monitoring Checklist

Daily Checks (Automated)

  • Health check endpoint responding
  • All containers running
  • Database accepting connections
  • No critical errors in logs

Weekly Checks (Manual)

  • Review error logs
  • Check disk space
  • Verify backups are running
  • Test restore from backup
  • Review failed login attempts

Monthly Checks

  • SSL certificate expiry (renew if < 30 days)
  • Update dependencies
  • Review and rotate secrets
  • Performance review
  • Security audit

📞 Incident Response

When Alert Triggers

  1. Check severity

    ./scripts/health-check.sh prod
    docker compose --profile prod ps
    
  2. Check logs

    docker logs --tail 100 slc-backend-prod
    docker logs --tail 100 slc-db-prod
    
  3. Attempt automatic recovery

    docker compose --profile prod restart
    
  4. If still down, investigate

    • Database connection issues
    • Disk space full
    • Memory exhaustion
    • Network issues
  5. Document incident

    • Time of failure
    • Symptoms observed
    • Actions taken
    • Resolution

🎯 SLA Targets

Uptime

  • Target: 99.9% (43 minutes downtime/month)
  • Measurement: External monitoring (UptimeRobot)

Performance

  • API Response: < 200ms (95th percentile)
  • Page Load: < 2s (95th percentile)

Recovery

  • Detection: < 5 minutes
  • Response: < 15 minutes
  • Resolution: < 1 hour (non-critical)

Last Updated: 2025-11-20