Add comprehensive tooling for production deployment: Scripts (scripts/): - backup-db.sh: Automated database backups with 7-day retention - restore-db.sh: Safe database restore with confirmation prompts - health-check.sh: Complete service health monitoring - README.md: Operational scripts documentation Monitoring (docs/MONITORING.md): - Application health monitoring - Docker container monitoring - External monitoring setup (UptimeRobot, Pingdom) - Log monitoring and rotation - Alerting configuration - Incident response procedures - SLA targets and metrics All scripts include: - Environment support (dev/prod) - Error handling and validation - Detailed status reporting - Safety confirmations where needed
7.9 KiB
7.9 KiB
Monitoring Guide - spotlight.cam
Complete guide for monitoring spotlight.cam in production.
📊 Monitoring Strategy
Three-Layer Approach
- Application Monitoring - Health checks, logs, metrics
- Infrastructure Monitoring - Docker containers, system resources
- External Monitoring - Uptime, response times, SSL certificates
🏥 Application Monitoring
Built-in Health Check
Endpoint: GET /api/health
Response (healthy):
{
"status": "ok",
"timestamp": "2025-11-20T12:00:00.000Z",
"uptime": 3600,
"environment": "production"
}
Usage:
# Check health
curl https://spotlight.cam/api/health
# Automated check (exit code 0 = healthy)
curl -f -s https://spotlight.cam/api/health > /dev/null
Health Check Script
Use built-in health check script:
# Check all services
./scripts/health-check.sh prod
# Output:
# ✅ nginx: Running
# ✅ Frontend: Running
# ✅ Backend: Running
# ✅ Database: Running
# ✅ API responding
# ✅ Database accepting connections
🐳 Docker Container Monitoring
Check Container Status
# List all containers
docker compose --profile prod ps
# Check specific container
docker inspect slc-backend-prod --format='{{.State.Status}}'
# View resource usage
docker stats --no-stream
Container Health Checks
Built into docker-compose.yml:
- Backend:
curl localhost:3000/api/health - Database:
pg_isready -U spotlightcam
# View health status
docker compose --profile prod ps
# Look for "(healthy)" in STATUS column
📝 Log Monitoring
View Logs
# All services
docker compose --profile prod logs -f
# Specific service
docker logs -f slc-backend-prod
# Last 100 lines
docker logs --tail 100 slc-backend-prod
# With timestamps
docker logs -f --timestamps slc-backend-prod
# Filter errors only
docker logs slc-backend-prod 2>&1 | grep -i error
Log Rotation
Configured in docker-compose.yml:
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
Important Log Patterns
Authentication errors:
docker logs slc-backend-prod | grep "401\|403\|locked"
Database errors:
docker logs slc-backend-prod | grep -i "prisma\|database"
Rate limiting:
docker logs slc-backend-prod | grep "Too many requests"
Email failures:
docker logs slc-backend-prod | grep "Failed to send.*email"
🌐 External Monitoring
Recommended Services
1. UptimeRobot (Free)
- URL: https://uptimerobot.com
- Features:
- 5-minute checks
- Email/SMS alerts
- 50 monitors free
- Status pages
Setup:
- Create account
- Add HTTP monitor:
https://spotlight.cam - Add HTTP monitor:
https://spotlight.cam/api/health - Set alert contacts
- Create public status page (optional)
2. Pingdom
- URL: https://pingdom.com
- Features:
- 1-minute checks
- Transaction monitoring
- Real user monitoring
- SSL monitoring
3. Better Uptime
- URL: https://betteruptime.com
- Features:
- Free tier available
- Incident management
- On-call scheduling
- Status pages
Monitor These Endpoints
| Endpoint | Check Type | Expected |
|---|---|---|
https://spotlight.cam |
HTTP | 200 OK |
https://spotlight.cam/api/health |
HTTP + JSON | {"status":"ok"} |
spotlight.cam |
SSL | Valid, not expiring |
spotlight.cam |
DNS | Resolves correctly |
📈 Metrics to Track
Application Metrics
-
Response Times
- API endpoints: < 200ms
- Frontend load: < 1s
-
Error Rates
- 4xx errors: < 1%
- 5xx errors: < 0.1%
-
Authentication
- Failed logins
- Account lockouts
- Password resets
-
WebRTC
- Connection success rate
- File transfer completions
- Peer connection failures
Infrastructure Metrics
-
CPU Usage
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}" -
Memory Usage
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}" -
Disk Space
df -h du -sh /var/lib/docker -
Database Size
docker exec slc-db-prod psql -U spotlightcam -c "SELECT pg_size_pretty(pg_database_size('spotlightcam'));"
🚨 Alerting Setup
Email Alerts (Simple)
Create alert script:
#!/bin/bash
# /usr/local/bin/alert-spotlight.sh
SUBJECT="⚠️ spotlight.cam Alert"
RECIPIENT="admin@example.com"
# Run health check
if ! /path/to/spotlightcam/scripts/health-check.sh prod; then
echo "Health check failed at $(date)" | mail -s "$SUBJECT" "$RECIPIENT"
fi
Add to crontab:
*/5 * * * * /usr/local/bin/alert-spotlight.sh
Slack Alerts (Advanced)
#!/bin/bash
# /usr/local/bin/alert-slack.sh
SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
if ! /path/to/spotlightcam/scripts/health-check.sh prod; then
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-Type: application/json' \
-d '{
"text": "🚨 spotlight.cam health check failed",
"username": "Monitoring Bot"
}'
fi
📊 Dashboard (Optional)
Simple Dashboard with Grafana
- Setup Prometheus:
# docker-compose.monitoring.yml
services:
prometheus:
image: prom/prometheus
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana
ports:
- "3001:3000"
volumes:
- grafana_data:/var/lib/grafana
- Add metrics endpoint to backend (optional enhancement)
🔍 Troubleshooting Monitoring
Health Check Always Fails
# Test API manually
curl -v https://spotlight.cam/api/health
# Check nginx logs
docker logs slc-proxy-prod
# Check backend logs
docker logs slc-backend-prod
# Test from within container
docker exec slc-proxy-prod curl localhost:80/api/health
High CPU/Memory Usage
# Identify problematic container
docker stats --no-stream
# Check container logs
docker logs --tail 100 slc-backend-prod
# Restart if needed
docker compose --profile prod restart backend-prod
Logs Not Rotating
# Check Docker log files
ls -lh /var/lib/docker/containers/*/*-json.log
# Manual cleanup (careful!)
docker compose --profile prod down
docker system prune -af
docker compose --profile prod up -d
✅ Monitoring Checklist
Daily Checks (Automated)
- Health check endpoint responding
- All containers running
- Database accepting connections
- No critical errors in logs
Weekly Checks (Manual)
- Review error logs
- Check disk space
- Verify backups are running
- Test restore from backup
- Review failed login attempts
Monthly Checks
- SSL certificate expiry (renew if < 30 days)
- Update dependencies
- Review and rotate secrets
- Performance review
- Security audit
📞 Incident Response
When Alert Triggers
-
Check severity
./scripts/health-check.sh prod docker compose --profile prod ps -
Check logs
docker logs --tail 100 slc-backend-prod docker logs --tail 100 slc-db-prod -
Attempt automatic recovery
docker compose --profile prod restart -
If still down, investigate
- Database connection issues
- Disk space full
- Memory exhaustion
- Network issues
-
Document incident
- Time of failure
- Symptoms observed
- Actions taken
- Resolution
🎯 SLA Targets
Uptime
- Target: 99.9% (43 minutes downtime/month)
- Measurement: External monitoring (UptimeRobot)
Performance
- API Response: < 200ms (95th percentile)
- Page Load: < 2s (95th percentile)
Recovery
- Detection: < 5 minutes
- Response: < 15 minutes
- Resolution: < 1 hour (non-critical)
Last Updated: 2025-11-20