feat: add production operations scripts and monitoring guide

Add comprehensive tooling for production deployment: Scripts (scripts/): - backup-db.sh: Automated database backups with 7-day retention - restore-db.sh: Safe database restore with confirmation prompts - health-check.sh: Complete service health monitoring - README.md: Operational scripts documentation Monitoring (docs/MONITORING.md): - Application health monitoring - Docker container monitoring - External monitoring setup (UptimeRobot, Pingdom) - Log monitoring and rotation - Alerting configuration - Incident response procedures - SLA targets and metrics All scripts include: - Environment support (dev/prod) - Error handling and validation - Detailed status reporting - Safety confirmations where needed
2025-11-20 22:22:22 +01:00
parent 2e194e1640
commit 642c8f6d6f
5 changed files with 827 additions and 0 deletions
--- a/docs/MONITORING.md
+++ b/docs/MONITORING.md
@@ -0,0 +1,427 @@
+# Monitoring Guide - spotlight.cam
+
+Complete guide for monitoring spotlight.cam in production.
+
+## 📊 Monitoring Strategy
+
+### Three-Layer Approach
+
+1. **Application Monitoring** - Health checks, logs, metrics
+2. **Infrastructure Monitoring** - Docker containers, system resources
+3. **External Monitoring** - Uptime, response times, SSL certificates
+
+---
+
+## 🏥 Application Monitoring
+
+### Built-in Health Check
+
+**Endpoint:** `GET /api/health`
+
+**Response (healthy):**
+```json
+{
+  "status": "ok",
+  "timestamp": "2025-11-20T12:00:00.000Z",
+  "uptime": 3600,
+  "environment": "production"
+}
+```
+
+**Usage:**
+```bash
+# Check health
+curl https://spotlight.cam/api/health
+
+# Automated check (exit code 0 = healthy)
+curl -f -s https://spotlight.cam/api/health > /dev/null
+```
+
+### Health Check Script
+
+Use built-in health check script:
+```bash
+# Check all services
+./scripts/health-check.sh prod
+
+# Output:
+# ✅ nginx: Running
+# ✅ Frontend: Running
+# ✅ Backend: Running
+# ✅ Database: Running
+# ✅ API responding
+# ✅ Database accepting connections
+```
+
+---
+
+## 🐳 Docker Container Monitoring
+
+### Check Container Status
+
+```bash
+# List all containers
+docker compose --profile prod ps
+
+# Check specific container
+docker inspect slc-backend-prod --format='{{.State.Status}}'
+
+# View resource usage
+docker stats --no-stream
+```
+
+### Container Health Checks
+
+Built into docker-compose.yml:
+- **Backend:** `curl localhost:3000/api/health`
+- **Database:** `pg_isready -U spotlightcam`
+
+```bash
+# View health status
+docker compose --profile prod ps
+# Look for "(healthy)" in STATUS column
+```
+
+---
+
+## 📝 Log Monitoring
+
+### View Logs
+
+```bash
+# All services
+docker compose --profile prod logs -f
+
+# Specific service
+docker logs -f slc-backend-prod
+
+# Last 100 lines
+docker logs --tail 100 slc-backend-prod
+
+# With timestamps
+docker logs -f --timestamps slc-backend-prod
+
+# Filter errors only
+docker logs slc-backend-prod 2>&1 | grep -i error
+```
+
+### Log Rotation
+
+Configured in docker-compose.yml:
+```yaml
+logging:
+  driver: "json-file"
+  options:
+    max-size: "10m"
+    max-file: "3"
+```
+
+### Important Log Patterns
+
+**Authentication errors:**
+```bash
+docker logs slc-backend-prod | grep "401\|403\|locked"
+```
+
+**Database errors:**
+```bash
+docker logs slc-backend-prod | grep -i "prisma\|database"
+```
+
+**Rate limiting:**
+```bash
+docker logs slc-backend-prod | grep "Too many requests"
+```
+
+**Email failures:**
+```bash
+docker logs slc-backend-prod | grep "Failed to send.*email"
+```
+
+---
+
+## 🌐 External Monitoring
+
+### Recommended Services
+
+#### 1. UptimeRobot (Free)
+- **URL:** https://uptimerobot.com
+- **Features:**
+  - 5-minute checks
+  - Email/SMS alerts
+  - 50 monitors free
+  - Status pages
+
+**Setup:**
+1. Create account
+2. Add HTTP monitor: `https://spotlight.cam`
+3. Add HTTP monitor: `https://spotlight.cam/api/health`
+4. Set alert contacts
+5. Create public status page (optional)
+
+#### 2. Pingdom
+- **URL:** https://pingdom.com
+- **Features:**
+  - 1-minute checks
+  - Transaction monitoring
+  - Real user monitoring
+  - SSL monitoring
+
+#### 3. Better Uptime
+- **URL:** https://betteruptime.com
+- **Features:**
+  - Free tier available
+  - Incident management
+  - On-call scheduling
+  - Status pages
+
+### Monitor These Endpoints
+
+| Endpoint | Check Type | Expected |
+|----------|-----------|----------|
+| `https://spotlight.cam` | HTTP | 200 OK |
+| `https://spotlight.cam/api/health` | HTTP + JSON | `{"status":"ok"}` |
+| `spotlight.cam` | SSL | Valid, not expiring |
+| `spotlight.cam` | DNS | Resolves correctly |
+
+---
+
+## 📈 Metrics to Track
+
+### Application Metrics
+
+1. **Response Times**
+   - API endpoints: < 200ms
+   - Frontend load: < 1s
+
+2. **Error Rates**
+   - 4xx errors: < 1%
+   - 5xx errors: < 0.1%
+
+3. **Authentication**
+   - Failed logins
+   - Account lockouts
+   - Password resets
+
+4. **WebRTC**
+   - Connection success rate
+   - File transfer completions
+   - Peer connection failures
+
+### Infrastructure Metrics
+
+1. **CPU Usage**
+   ```bash
+   docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}"
+   ```
+
+2. **Memory Usage**
+   ```bash
+   docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}"
+   ```
+
+3. **Disk Space**
+   ```bash
+   df -h
+   du -sh /var/lib/docker
+   ```
+
+4. **Database Size**
+   ```bash
+   docker exec slc-db-prod psql -U spotlightcam -c "SELECT pg_size_pretty(pg_database_size('spotlightcam'));"
+   ```
+
+---
+
+## 🚨 Alerting Setup
+
+### Email Alerts (Simple)
+
+Create alert script:
+```bash
+#!/bin/bash
+# /usr/local/bin/alert-spotlight.sh
+
+SUBJECT="⚠️ spotlight.cam Alert"
+RECIPIENT="admin@example.com"
+
+# Run health check
+if ! /path/to/spotlightcam/scripts/health-check.sh prod; then
+    echo "Health check failed at $(date)" | mail -s "$SUBJECT" "$RECIPIENT"
+fi
+```
+
+Add to crontab:
+```bash
+*/5 * * * * /usr/local/bin/alert-spotlight.sh
+```
+
+### Slack Alerts (Advanced)
+
+```bash
+#!/bin/bash
+# /usr/local/bin/alert-slack.sh
+
+SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
+
+if ! /path/to/spotlightcam/scripts/health-check.sh prod; then
+    curl -X POST "$SLACK_WEBHOOK" \
+         -H 'Content-Type: application/json' \
+         -d '{
+           "text": "🚨 spotlight.cam health check failed",
+           "username": "Monitoring Bot"
+         }'
+fi
+```
+
+---
+
+## 📊 Dashboard (Optional)
+
+### Simple Dashboard with Grafana
+
+1. **Setup Prometheus:**
+```yaml
+# docker-compose.monitoring.yml
+services:
+  prometheus:
+    image: prom/prometheus
+    volumes:
+      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
+    ports:
+      - "9090:9090"
+
+  grafana:
+    image: grafana/grafana
+    ports:
+      - "3001:3000"
+    volumes:
+      - grafana_data:/var/lib/grafana
+```
+
+2. **Add metrics endpoint to backend** (optional enhancement)
+
+---
+
+## 🔍 Troubleshooting Monitoring
+
+### Health Check Always Fails
+
+```bash
+# Test API manually
+curl -v https://spotlight.cam/api/health
+
+# Check nginx logs
+docker logs slc-proxy-prod
+
+# Check backend logs
+docker logs slc-backend-prod
+
+# Test from within container
+docker exec slc-proxy-prod curl localhost:80/api/health
+```
+
+### High CPU/Memory Usage
+
+```bash
+# Identify problematic container
+docker stats --no-stream
+
+# Check container logs
+docker logs --tail 100 slc-backend-prod
+
+# Restart if needed
+docker compose --profile prod restart backend-prod
+```
+
+### Logs Not Rotating
+
+```bash
+# Check Docker log files
+ls -lh /var/lib/docker/containers/*/*-json.log
+
+# Manual cleanup (careful!)
+docker compose --profile prod down
+docker system prune -af
+docker compose --profile prod up -d
+```
+
+---
+
+## ✅ Monitoring Checklist
+
+### Daily Checks (Automated)
+- [ ] Health check endpoint responding
+- [ ] All containers running
+- [ ] Database accepting connections
+- [ ] No critical errors in logs
+
+### Weekly Checks (Manual)
+- [ ] Review error logs
+- [ ] Check disk space
+- [ ] Verify backups are running
+- [ ] Test restore from backup
+- [ ] Review failed login attempts
+
+### Monthly Checks
+- [ ] SSL certificate expiry (renew if < 30 days)
+- [ ] Update dependencies
+- [ ] Review and rotate secrets
+- [ ] Performance review
+- [ ] Security audit
+
+---
+
+## 📞 Incident Response
+
+### When Alert Triggers
+
+1. **Check severity**
+   ```bash
+   ./scripts/health-check.sh prod
+   docker compose --profile prod ps
+   ```
+
+2. **Check logs**
+   ```bash
+   docker logs --tail 100 slc-backend-prod
+   docker logs --tail 100 slc-db-prod
+   ```
+
+3. **Attempt automatic recovery**
+   ```bash
+   docker compose --profile prod restart
+   ```
+
+4. **If still down, investigate**
+   - Database connection issues
+   - Disk space full
+   - Memory exhaustion
+   - Network issues
+
+5. **Document incident**
+   - Time of failure
+   - Symptoms observed
+   - Actions taken
+   - Resolution
+
+---
+
+## 🎯 SLA Targets
+
+### Uptime
+- **Target:** 99.9% (43 minutes downtime/month)
+- **Measurement:** External monitoring (UptimeRobot)
+
+### Performance
+- **API Response:** < 200ms (95th percentile)
+- **Page Load:** < 2s (95th percentile)
+
+### Recovery
+- **Detection:** < 5 minutes
+- **Response:** < 15 minutes
+- **Resolution:** < 1 hour (non-critical)
+
+---
+
+**Last Updated:** 2025-11-20