diff --git a/docs/MONITORING.md b/docs/MONITORING.md new file mode 100644 index 0000000..c36556a --- /dev/null +++ b/docs/MONITORING.md @@ -0,0 +1,427 @@ +# Monitoring Guide - spotlight.cam + +Complete guide for monitoring spotlight.cam in production. + +## ๐Ÿ“Š Monitoring Strategy + +### Three-Layer Approach + +1. **Application Monitoring** - Health checks, logs, metrics +2. **Infrastructure Monitoring** - Docker containers, system resources +3. **External Monitoring** - Uptime, response times, SSL certificates + +--- + +## ๐Ÿฅ Application Monitoring + +### Built-in Health Check + +**Endpoint:** `GET /api/health` + +**Response (healthy):** +```json +{ + "status": "ok", + "timestamp": "2025-11-20T12:00:00.000Z", + "uptime": 3600, + "environment": "production" +} +``` + +**Usage:** +```bash +# Check health +curl https://spotlight.cam/api/health + +# Automated check (exit code 0 = healthy) +curl -f -s https://spotlight.cam/api/health > /dev/null +``` + +### Health Check Script + +Use built-in health check script: +```bash +# Check all services +./scripts/health-check.sh prod + +# Output: +# โœ… nginx: Running +# โœ… Frontend: Running +# โœ… Backend: Running +# โœ… Database: Running +# โœ… API responding +# โœ… Database accepting connections +``` + +--- + +## ๐Ÿณ Docker Container Monitoring + +### Check Container Status + +```bash +# List all containers +docker compose --profile prod ps + +# Check specific container +docker inspect slc-backend-prod --format='{{.State.Status}}' + +# View resource usage +docker stats --no-stream +``` + +### Container Health Checks + +Built into docker-compose.yml: +- **Backend:** `curl localhost:3000/api/health` +- **Database:** `pg_isready -U spotlightcam` + +```bash +# View health status +docker compose --profile prod ps +# Look for "(healthy)" in STATUS column +``` + +--- + +## ๐Ÿ“ Log Monitoring + +### View Logs + +```bash +# All services +docker compose --profile prod logs -f + +# Specific service +docker logs -f slc-backend-prod + +# Last 100 lines +docker logs --tail 100 slc-backend-prod + +# With timestamps +docker logs -f --timestamps slc-backend-prod + +# Filter errors only +docker logs slc-backend-prod 2>&1 | grep -i error +``` + +### Log Rotation + +Configured in docker-compose.yml: +```yaml +logging: + driver: "json-file" + options: + max-size: "10m" + max-file: "3" +``` + +### Important Log Patterns + +**Authentication errors:** +```bash +docker logs slc-backend-prod | grep "401\|403\|locked" +``` + +**Database errors:** +```bash +docker logs slc-backend-prod | grep -i "prisma\|database" +``` + +**Rate limiting:** +```bash +docker logs slc-backend-prod | grep "Too many requests" +``` + +**Email failures:** +```bash +docker logs slc-backend-prod | grep "Failed to send.*email" +``` + +--- + +## ๐ŸŒ External Monitoring + +### Recommended Services + +#### 1. UptimeRobot (Free) +- **URL:** https://uptimerobot.com +- **Features:** + - 5-minute checks + - Email/SMS alerts + - 50 monitors free + - Status pages + +**Setup:** +1. Create account +2. Add HTTP monitor: `https://spotlight.cam` +3. Add HTTP monitor: `https://spotlight.cam/api/health` +4. Set alert contacts +5. Create public status page (optional) + +#### 2. Pingdom +- **URL:** https://pingdom.com +- **Features:** + - 1-minute checks + - Transaction monitoring + - Real user monitoring + - SSL monitoring + +#### 3. Better Uptime +- **URL:** https://betteruptime.com +- **Features:** + - Free tier available + - Incident management + - On-call scheduling + - Status pages + +### Monitor These Endpoints + +| Endpoint | Check Type | Expected | +|----------|-----------|----------| +| `https://spotlight.cam` | HTTP | 200 OK | +| `https://spotlight.cam/api/health` | HTTP + JSON | `{"status":"ok"}` | +| `spotlight.cam` | SSL | Valid, not expiring | +| `spotlight.cam` | DNS | Resolves correctly | + +--- + +## ๐Ÿ“ˆ Metrics to Track + +### Application Metrics + +1. **Response Times** + - API endpoints: < 200ms + - Frontend load: < 1s + +2. **Error Rates** + - 4xx errors: < 1% + - 5xx errors: < 0.1% + +3. **Authentication** + - Failed logins + - Account lockouts + - Password resets + +4. **WebRTC** + - Connection success rate + - File transfer completions + - Peer connection failures + +### Infrastructure Metrics + +1. **CPU Usage** + ```bash + docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}" + ``` + +2. **Memory Usage** + ```bash + docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}" + ``` + +3. **Disk Space** + ```bash + df -h + du -sh /var/lib/docker + ``` + +4. **Database Size** + ```bash + docker exec slc-db-prod psql -U spotlightcam -c "SELECT pg_size_pretty(pg_database_size('spotlightcam'));" + ``` + +--- + +## ๐Ÿšจ Alerting Setup + +### Email Alerts (Simple) + +Create alert script: +```bash +#!/bin/bash +# /usr/local/bin/alert-spotlight.sh + +SUBJECT="โš ๏ธ spotlight.cam Alert" +RECIPIENT="admin@example.com" + +# Run health check +if ! /path/to/spotlightcam/scripts/health-check.sh prod; then + echo "Health check failed at $(date)" | mail -s "$SUBJECT" "$RECIPIENT" +fi +``` + +Add to crontab: +```bash +*/5 * * * * /usr/local/bin/alert-spotlight.sh +``` + +### Slack Alerts (Advanced) + +```bash +#!/bin/bash +# /usr/local/bin/alert-slack.sh + +SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL" + +if ! /path/to/spotlightcam/scripts/health-check.sh prod; then + curl -X POST "$SLACK_WEBHOOK" \ + -H 'Content-Type: application/json' \ + -d '{ + "text": "๐Ÿšจ spotlight.cam health check failed", + "username": "Monitoring Bot" + }' +fi +``` + +--- + +## ๐Ÿ“Š Dashboard (Optional) + +### Simple Dashboard with Grafana + +1. **Setup Prometheus:** +```yaml +# docker-compose.monitoring.yml +services: + prometheus: + image: prom/prometheus + volumes: + - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml + ports: + - "9090:9090" + + grafana: + image: grafana/grafana + ports: + - "3001:3000" + volumes: + - grafana_data:/var/lib/grafana +``` + +2. **Add metrics endpoint to backend** (optional enhancement) + +--- + +## ๐Ÿ” Troubleshooting Monitoring + +### Health Check Always Fails + +```bash +# Test API manually +curl -v https://spotlight.cam/api/health + +# Check nginx logs +docker logs slc-proxy-prod + +# Check backend logs +docker logs slc-backend-prod + +# Test from within container +docker exec slc-proxy-prod curl localhost:80/api/health +``` + +### High CPU/Memory Usage + +```bash +# Identify problematic container +docker stats --no-stream + +# Check container logs +docker logs --tail 100 slc-backend-prod + +# Restart if needed +docker compose --profile prod restart backend-prod +``` + +### Logs Not Rotating + +```bash +# Check Docker log files +ls -lh /var/lib/docker/containers/*/*-json.log + +# Manual cleanup (careful!) +docker compose --profile prod down +docker system prune -af +docker compose --profile prod up -d +``` + +--- + +## โœ… Monitoring Checklist + +### Daily Checks (Automated) +- [ ] Health check endpoint responding +- [ ] All containers running +- [ ] Database accepting connections +- [ ] No critical errors in logs + +### Weekly Checks (Manual) +- [ ] Review error logs +- [ ] Check disk space +- [ ] Verify backups are running +- [ ] Test restore from backup +- [ ] Review failed login attempts + +### Monthly Checks +- [ ] SSL certificate expiry (renew if < 30 days) +- [ ] Update dependencies +- [ ] Review and rotate secrets +- [ ] Performance review +- [ ] Security audit + +--- + +## ๐Ÿ“ž Incident Response + +### When Alert Triggers + +1. **Check severity** + ```bash + ./scripts/health-check.sh prod + docker compose --profile prod ps + ``` + +2. **Check logs** + ```bash + docker logs --tail 100 slc-backend-prod + docker logs --tail 100 slc-db-prod + ``` + +3. **Attempt automatic recovery** + ```bash + docker compose --profile prod restart + ``` + +4. **If still down, investigate** + - Database connection issues + - Disk space full + - Memory exhaustion + - Network issues + +5. **Document incident** + - Time of failure + - Symptoms observed + - Actions taken + - Resolution + +--- + +## ๐ŸŽฏ SLA Targets + +### Uptime +- **Target:** 99.9% (43 minutes downtime/month) +- **Measurement:** External monitoring (UptimeRobot) + +### Performance +- **API Response:** < 200ms (95th percentile) +- **Page Load:** < 2s (95th percentile) + +### Recovery +- **Detection:** < 5 minutes +- **Response:** < 15 minutes +- **Resolution:** < 1 hour (non-critical) + +--- + +**Last Updated:** 2025-11-20 diff --git a/scripts/README.md b/scripts/README.md new file mode 100644 index 0000000..3d7ac83 --- /dev/null +++ b/scripts/README.md @@ -0,0 +1,186 @@ +# spotlight.cam - Operations Scripts + +Utility scripts for managing spotlight.cam deployment, backups, and monitoring. + +## ๐Ÿ“‹ Available Scripts + +### 1. Database Backup (`backup-db.sh`) + +Creates a timestamped backup of the PostgreSQL database. + +```bash +# Backup development database +./scripts/backup-db.sh dev + +# Backup production database +./scripts/backup-db.sh prod +``` + +**Features:** +- Automatic timestamping +- Automatic cleanup (keeps last 7 days) +- Backup size reporting +- Error handling + +**Backups location:** `./backups/` + +**Setup cron job for automatic backups:** +```bash +# Edit crontab +crontab -e + +# Add daily backup at 2 AM (production) +0 2 * * * cd /path/to/spotlightcam && ./scripts/backup-db.sh prod >> logs/backup.log 2>&1 +``` + +--- + +### 2. Database Restore (`restore-db.sh`) + +Restores database from a backup file. + +```bash +# Restore development database +./scripts/restore-db.sh ./backups/backup_dev_20251120_120000.sql dev + +# Restore production database +./scripts/restore-db.sh ./backups/backup_prod_20251120_120000.sql prod +``` + +**Safety features:** +- Confirmation prompt +- File existence check +- Container status validation + +--- + +### 3. Health Check (`health-check.sh`) + +Monitors all services and checks system health. + +```bash +# Check development environment +./scripts/health-check.sh dev + +# Check production environment +./scripts/health-check.sh prod +``` + +**Checks:** +- โœ… nginx container status +- โœ… Frontend container status +- โœ… Backend container status +- โœ… Database container status +- โœ… API health endpoint +- โœ… Database connection + +**Exit codes:** +- `0` - All systems operational +- `1` - One or more services unhealthy + +**Setup monitoring:** +```bash +# Check every 5 minutes and send alerts on failure +*/5 * * * * /path/to/spotlightcam/scripts/health-check.sh prod || /path/to/alert-script.sh +``` + +--- + +## ๐Ÿ”ง Setup Instructions + +### Make scripts executable +```bash +chmod +x scripts/*.sh +``` + +### Create backups directory +```bash +mkdir -p backups +``` + +### Test scripts +```bash +# Test backup +./scripts/backup-db.sh dev + +# Test health check +./scripts/health-check.sh dev +``` + +--- + +## ๐Ÿ“Š Monitoring Setup + +### Option 1: Cron-based monitoring +```bash +# Add to crontab +crontab -e + +# Health check every 5 minutes +*/5 * * * * /path/to/spotlightcam/scripts/health-check.sh prod || echo "Health check failed" | mail -s "spotlight.cam Alert" admin@example.com + +# Daily backup at 2 AM +0 2 * * * /path/to/spotlightcam/scripts/backup-db.sh prod +``` + +### Option 2: External monitoring +Recommended external services: +- **UptimeRobot** - Free, checks every 5 min +- **Pingdom** - Advanced monitoring +- **StatusCake** - Free tier available + +Monitor these endpoints: +- `https://spotlight.cam` - Frontend +- `https://spotlight.cam/api/health` - Backend API + +--- + +## ๐Ÿšจ Troubleshooting + +### Backup fails +```bash +# Check if database container is running +docker ps | grep slc-db + +# Check database logs +docker logs slc-db-prod + +# Manually test connection +docker exec slc-db-prod pg_isready -U spotlightcam +``` + +### Health check fails +```bash +# View detailed container status +docker ps -a + +# Check specific service logs +docker logs slc-backend-prod +docker logs slc-db-prod + +# Restart services +docker compose --profile prod restart +``` + +### Restore fails +```bash +# Check backup file integrity +head -n 10 ./backups/backup_prod_20251120_120000.sql + +# Verify database is accepting connections +docker exec slc-db-prod psql -U spotlightcam -c "SELECT version();" +``` + +--- + +## ๐Ÿ“ Best Practices + +1. **Always test backups** - Regularly test restore process in dev environment +2. **Monitor disk space** - Ensure backups directory has enough space +3. **Secure backups** - Store backups off-server (AWS S3, Backblaze B2) +4. **Regular testing** - Test health checks and disaster recovery procedures +5. **Log rotation** - Use logrotate for script logs + +--- + +**Last Updated:** 2025-11-20 diff --git a/scripts/backup-db.sh b/scripts/backup-db.sh new file mode 100644 index 0000000..f03d13c --- /dev/null +++ b/scripts/backup-db.sh @@ -0,0 +1,61 @@ +#!/bin/bash +# Database backup script for spotlight.cam +# Usage: ./scripts/backup-db.sh [dev|prod] + +set -e + +# Default to development if no argument provided +ENV=${1:-dev} + +# Configuration +DATE=$(date +%Y%m%d_%H%M%S) +BACKUP_DIR="./backups" + +# Create backup directory if it doesn't exist +mkdir -p "$BACKUP_DIR" + +# Set container name based on environment +if [ "$ENV" = "prod" ]; then + DB_CONTAINER="slc-db-prod" + DB_NAME="spotlightcam" + BACKUP_FILE="$BACKUP_DIR/backup_prod_$DATE.sql" +else + DB_CONTAINER="slc-db" + DB_NAME="spotlightcam" + BACKUP_FILE="$BACKUP_DIR/backup_dev_$DATE.sql" +fi + +echo "๐Ÿ”„ Starting database backup..." +echo "๐Ÿ“ฆ Environment: $ENV" +echo "๐Ÿ—„๏ธ Container: $DB_CONTAINER" +echo "๐Ÿ’พ Backup file: $BACKUP_FILE" + +# Check if container is running +if ! docker ps --format '{{.Names}}' | grep -q "^${DB_CONTAINER}$"; then + echo "โŒ Error: Container $DB_CONTAINER is not running" + exit 1 +fi + +# Create backup +docker exec "$DB_CONTAINER" pg_dump -U spotlightcam "$DB_NAME" > "$BACKUP_FILE" + +# Check if backup was successful +if [ $? -eq 0 ]; then + BACKUP_SIZE=$(du -h "$BACKUP_FILE" | cut -f1) + echo "โœ… Backup completed successfully!" + echo "๐Ÿ“Š Backup size: $BACKUP_SIZE" + echo "๐Ÿ“ Location: $BACKUP_FILE" +else + echo "โŒ Backup failed!" + exit 1 +fi + +# Keep only last 7 days of backups +echo "๐Ÿงน Cleaning old backups (keeping last 7 days)..." +find "$BACKUP_DIR" -name "backup_*.sql" -mtime +7 -delete + +# Count remaining backups +BACKUP_COUNT=$(find "$BACKUP_DIR" -name "backup_*.sql" | wc -l) +echo "๐Ÿ“š Total backups: $BACKUP_COUNT" + +echo "โœจ Done!" diff --git a/scripts/health-check.sh b/scripts/health-check.sh new file mode 100644 index 0000000..0756bc5 --- /dev/null +++ b/scripts/health-check.sh @@ -0,0 +1,88 @@ +#!/bin/bash +# Health check script for spotlight.cam +# Usage: ./scripts/health-check.sh [dev|prod] + +set -e + +ENV=${1:-dev} + +# Set service names based on environment +if [ "$ENV" = "prod" ]; then + NGINX_CONTAINER="slc-proxy-prod" + FRONTEND_CONTAINER="slc-frontend-prod" + BACKEND_CONTAINER="slc-backend-prod" + DB_CONTAINER="slc-db-prod" + API_URL="https://spotlight.cam/api/health" +else + NGINX_CONTAINER="slc-proxy" + FRONTEND_CONTAINER="slc-frontend" + BACKEND_CONTAINER="slc-backend" + DB_CONTAINER="slc-db" + API_URL="http://localhost:8080/api/health" +fi + +echo "๐Ÿฅ spotlight.cam Health Check" +echo "๐Ÿ“ฆ Environment: $ENV" +echo "================================" +echo "" + +# Function to check container status +check_container() { + local container=$1 + local service=$2 + + if docker ps --format '{{.Names}}' | grep -q "^${container}$"; then + local status=$(docker inspect --format='{{.State.Status}}' "$container") + if [ "$status" = "running" ]; then + echo "โœ… $service: Running" + return 0 + else + echo "โš ๏ธ $service: Container exists but not running (status: $status)" + return 1 + fi + else + echo "โŒ $service: Container not found" + return 1 + fi +} + +# Check all containers +ALL_OK=true + +check_container "$NGINX_CONTAINER" "nginx" || ALL_OK=false +check_container "$FRONTEND_CONTAINER" "Frontend" || ALL_OK=false +check_container "$BACKEND_CONTAINER" "Backend" || ALL_OK=false +check_container "$DB_CONTAINER" "Database" || ALL_OK=false + +echo "" + +# Check API health endpoint +echo "๐Ÿ”Œ API Health Check:" +if curl -f -s "$API_URL" > /dev/null 2>&1; then + echo "โœ… API responding at $API_URL" +else + echo "โŒ API not responding at $API_URL" + ALL_OK=false +fi + +echo "" + +# Database connection test +echo "๐Ÿ—„๏ธ Database Connection:" +if docker exec "$DB_CONTAINER" pg_isready -U spotlightcam > /dev/null 2>&1; then + echo "โœ… Database accepting connections" +else + echo "โŒ Database not accepting connections" + ALL_OK=false +fi + +echo "" +echo "================================" + +if [ "$ALL_OK" = true ]; then + echo "โœ… All systems operational!" + exit 0 +else + echo "โš ๏ธ Some services are not healthy" + exit 1 +fi diff --git a/scripts/restore-db.sh b/scripts/restore-db.sh new file mode 100644 index 0000000..bf23db7 --- /dev/null +++ b/scripts/restore-db.sh @@ -0,0 +1,65 @@ +#!/bin/bash +# Database restore script for spotlight.cam +# Usage: ./scripts/restore-db.sh [dev|prod] + +set -e + +# Check if backup file is provided +if [ -z "$1" ]; then + echo "โŒ Error: Backup file not specified" + echo "Usage: ./scripts/restore-db.sh [dev|prod]" + echo "Example: ./scripts/restore-db.sh ./backups/backup_dev_20251120_120000.sql dev" + exit 1 +fi + +BACKUP_FILE=$1 +ENV=${2:-dev} + +# Check if backup file exists +if [ ! -f "$BACKUP_FILE" ]; then + echo "โŒ Error: Backup file not found: $BACKUP_FILE" + exit 1 +fi + +# Set container name based on environment +if [ "$ENV" = "prod" ]; then + DB_CONTAINER="slc-db-prod" + DB_NAME="spotlightcam" +else + DB_CONTAINER="slc-db" + DB_NAME="spotlightcam" +fi + +echo "โš ๏ธ WARNING: This will REPLACE the current database!" +echo "๐Ÿ“ฆ Environment: $ENV" +echo "๐Ÿ—„๏ธ Container: $DB_CONTAINER" +echo "๐Ÿ’พ Backup file: $BACKUP_FILE" +echo "" +read -p "Are you sure you want to continue? (yes/no): " -r +echo + +if [[ ! $REPLY =~ ^[Yy][Ee][Ss]$ ]]; then + echo "โŒ Restore cancelled" + exit 0 +fi + +# Check if container is running +if ! docker ps --format '{{.Names}}' | grep -q "^${DB_CONTAINER}$"; then + echo "โŒ Error: Container $DB_CONTAINER is not running" + exit 1 +fi + +echo "๐Ÿ”„ Starting database restore..." + +# Restore backup +cat "$BACKUP_FILE" | docker exec -i "$DB_CONTAINER" psql -U spotlightcam "$DB_NAME" + +# Check if restore was successful +if [ $? -eq 0 ]; then + echo "โœ… Restore completed successfully!" +else + echo "โŒ Restore failed!" + exit 1 +fi + +echo "โœจ Done!"