# Monitoring Guide - spotlight.cam Complete guide for monitoring spotlight.cam in production. ## 📊 Monitoring Strategy ### Three-Layer Approach 1. **Application Monitoring** - Health checks, logs, metrics 2. **Infrastructure Monitoring** - Docker containers, system resources 3. **External Monitoring** - Uptime, response times, SSL certificates --- ## 🏥 Application Monitoring ### Built-in Health Check **Endpoint:** `GET /api/health` **Response (healthy):** ```json { "status": "ok", "timestamp": "2025-11-20T12:00:00.000Z", "uptime": 3600, "environment": "production" } ``` **Usage:** ```bash # Check health curl https://spotlight.cam/api/health # Automated check (exit code 0 = healthy) curl -f -s https://spotlight.cam/api/health > /dev/null ``` ### Health Check Script Use built-in health check script: ```bash # Check all services ./scripts/health-check.sh prod # Output: # ✅ nginx: Running # ✅ Frontend: Running # ✅ Backend: Running # ✅ Database: Running # ✅ API responding # ✅ Database accepting connections ``` --- ## 🐳 Docker Container Monitoring ### Check Container Status ```bash # List all containers docker compose --profile prod ps # Check specific container docker inspect slc-backend-prod --format='{{.State.Status}}' # View resource usage docker stats --no-stream ``` ### Container Health Checks Built into docker-compose.yml: - **Backend:** `curl localhost:3000/api/health` - **Database:** `pg_isready -U spotlightcam` ```bash # View health status docker compose --profile prod ps # Look for "(healthy)" in STATUS column ``` --- ## 📝 Log Monitoring ### View Logs ```bash # All services docker compose --profile prod logs -f # Specific service docker logs -f slc-backend-prod # Last 100 lines docker logs --tail 100 slc-backend-prod # With timestamps docker logs -f --timestamps slc-backend-prod # Filter errors only docker logs slc-backend-prod 2>&1 | grep -i error ``` ### Log Rotation Configured in docker-compose.yml: ```yaml logging: driver: "json-file" options: max-size: "10m" max-file: "3" ``` ### Important Log Patterns **Authentication errors:** ```bash docker logs slc-backend-prod | grep "401\|403\|locked" ``` **Database errors:** ```bash docker logs slc-backend-prod | grep -i "prisma\|database" ``` **Rate limiting:** ```bash docker logs slc-backend-prod | grep "Too many requests" ``` **Email failures:** ```bash docker logs slc-backend-prod | grep "Failed to send.*email" ``` --- ## 🌐 External Monitoring ### Recommended Services #### 1. UptimeRobot (Free) - **URL:** https://uptimerobot.com - **Features:** - 5-minute checks - Email/SMS alerts - 50 monitors free - Status pages **Setup:** 1. Create account 2. Add HTTP monitor: `https://spotlight.cam` 3. Add HTTP monitor: `https://spotlight.cam/api/health` 4. Set alert contacts 5. Create public status page (optional) #### 2. Pingdom - **URL:** https://pingdom.com - **Features:** - 1-minute checks - Transaction monitoring - Real user monitoring - SSL monitoring #### 3. Better Uptime - **URL:** https://betteruptime.com - **Features:** - Free tier available - Incident management - On-call scheduling - Status pages ### Monitor These Endpoints | Endpoint | Check Type | Expected | |----------|-----------|----------| | `https://spotlight.cam` | HTTP | 200 OK | | `https://spotlight.cam/api/health` | HTTP + JSON | `{"status":"ok"}` | | `spotlight.cam` | SSL | Valid, not expiring | | `spotlight.cam` | DNS | Resolves correctly | --- ## 📈 Metrics to Track ### Application Metrics 1. **Response Times** - API endpoints: < 200ms - Frontend load: < 1s 2. **Error Rates** - 4xx errors: < 1% - 5xx errors: < 0.1% 3. **Authentication** - Failed logins - Account lockouts - Password resets 4. **WebRTC** - Connection success rate - File transfer completions - Peer connection failures ### Infrastructure Metrics 1. **CPU Usage** ```bash docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}" ``` 2. **Memory Usage** ```bash docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}" ``` 3. **Disk Space** ```bash df -h du -sh /var/lib/docker ``` 4. **Database Size** ```bash docker exec slc-db-prod psql -U spotlightcam -c "SELECT pg_size_pretty(pg_database_size('spotlightcam'));" ``` --- ## 🚨 Alerting Setup ### Email Alerts (Simple) Create alert script: ```bash #!/bin/bash # /usr/local/bin/alert-spotlight.sh SUBJECT="⚠️ spotlight.cam Alert" RECIPIENT="admin@example.com" # Run health check if ! /path/to/spotlightcam/scripts/health-check.sh prod; then echo "Health check failed at $(date)" | mail -s "$SUBJECT" "$RECIPIENT" fi ``` Add to crontab: ```bash */5 * * * * /usr/local/bin/alert-spotlight.sh ``` ### Slack Alerts (Advanced) ```bash #!/bin/bash # /usr/local/bin/alert-slack.sh SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL" if ! /path/to/spotlightcam/scripts/health-check.sh prod; then curl -X POST "$SLACK_WEBHOOK" \ -H 'Content-Type: application/json' \ -d '{ "text": "🚨 spotlight.cam health check failed", "username": "Monitoring Bot" }' fi ``` --- ## 📊 Dashboard (Optional) ### Simple Dashboard with Grafana 1. **Setup Prometheus:** ```yaml # docker-compose.monitoring.yml services: prometheus: image: prom/prometheus volumes: - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml ports: - "9090:9090" grafana: image: grafana/grafana ports: - "3001:3000" volumes: - grafana_data:/var/lib/grafana ``` 2. **Add metrics endpoint to backend** (optional enhancement) --- ## 🔍 Troubleshooting Monitoring ### Health Check Always Fails ```bash # Test API manually curl -v https://spotlight.cam/api/health # Check nginx logs docker logs slc-proxy-prod # Check backend logs docker logs slc-backend-prod # Test from within container docker exec slc-proxy-prod curl localhost:80/api/health ``` ### High CPU/Memory Usage ```bash # Identify problematic container docker stats --no-stream # Check container logs docker logs --tail 100 slc-backend-prod # Restart if needed docker compose --profile prod restart backend-prod ``` ### Logs Not Rotating ```bash # Check Docker log files ls -lh /var/lib/docker/containers/*/*-json.log # Manual cleanup (careful!) docker compose --profile prod down docker system prune -af docker compose --profile prod up -d ``` --- ## ✅ Monitoring Checklist ### Daily Checks (Automated) - [ ] Health check endpoint responding - [ ] All containers running - [ ] Database accepting connections - [ ] No critical errors in logs ### Weekly Checks (Manual) - [ ] Review error logs - [ ] Check disk space - [ ] Verify backups are running - [ ] Test restore from backup - [ ] Review failed login attempts ### Monthly Checks - [ ] SSL certificate expiry (renew if < 30 days) - [ ] Update dependencies - [ ] Review and rotate secrets - [ ] Performance review - [ ] Security audit --- ## 📞 Incident Response ### When Alert Triggers 1. **Check severity** ```bash ./scripts/health-check.sh prod docker compose --profile prod ps ``` 2. **Check logs** ```bash docker logs --tail 100 slc-backend-prod docker logs --tail 100 slc-db-prod ``` 3. **Attempt automatic recovery** ```bash docker compose --profile prod restart ``` 4. **If still down, investigate** - Database connection issues - Disk space full - Memory exhaustion - Network issues 5. **Document incident** - Time of failure - Symptoms observed - Actions taken - Resolution --- ## 🎯 SLA Targets ### Uptime - **Target:** 99.9% (43 minutes downtime/month) - **Measurement:** External monitoring (UptimeRobot) ### Performance - **API Response:** < 200ms (95th percentile) - **Page Load:** < 2s (95th percentile) ### Recovery - **Detection:** < 5 minutes - **Response:** < 15 minutes - **Resolution:** < 1 hour (non-critical) --- **Last Updated:** 2025-11-20