feat: add production operations scripts and monitoring guide
Add comprehensive tooling for production deployment: Scripts (scripts/): - backup-db.sh: Automated database backups with 7-day retention - restore-db.sh: Safe database restore with confirmation prompts - health-check.sh: Complete service health monitoring - README.md: Operational scripts documentation Monitoring (docs/MONITORING.md): - Application health monitoring - Docker container monitoring - External monitoring setup (UptimeRobot, Pingdom) - Log monitoring and rotation - Alerting configuration - Incident response procedures - SLA targets and metrics All scripts include: - Environment support (dev/prod) - Error handling and validation - Detailed status reporting - Safety confirmations where needed
This commit is contained in:
427
docs/MONITORING.md
Normal file
427
docs/MONITORING.md
Normal file
@@ -0,0 +1,427 @@
|
||||
# Monitoring Guide - spotlight.cam
|
||||
|
||||
Complete guide for monitoring spotlight.cam in production.
|
||||
|
||||
## 📊 Monitoring Strategy
|
||||
|
||||
### Three-Layer Approach
|
||||
|
||||
1. **Application Monitoring** - Health checks, logs, metrics
|
||||
2. **Infrastructure Monitoring** - Docker containers, system resources
|
||||
3. **External Monitoring** - Uptime, response times, SSL certificates
|
||||
|
||||
---
|
||||
|
||||
## 🏥 Application Monitoring
|
||||
|
||||
### Built-in Health Check
|
||||
|
||||
**Endpoint:** `GET /api/health`
|
||||
|
||||
**Response (healthy):**
|
||||
```json
|
||||
{
|
||||
"status": "ok",
|
||||
"timestamp": "2025-11-20T12:00:00.000Z",
|
||||
"uptime": 3600,
|
||||
"environment": "production"
|
||||
}
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
# Check health
|
||||
curl https://spotlight.cam/api/health
|
||||
|
||||
# Automated check (exit code 0 = healthy)
|
||||
curl -f -s https://spotlight.cam/api/health > /dev/null
|
||||
```
|
||||
|
||||
### Health Check Script
|
||||
|
||||
Use built-in health check script:
|
||||
```bash
|
||||
# Check all services
|
||||
./scripts/health-check.sh prod
|
||||
|
||||
# Output:
|
||||
# ✅ nginx: Running
|
||||
# ✅ Frontend: Running
|
||||
# ✅ Backend: Running
|
||||
# ✅ Database: Running
|
||||
# ✅ API responding
|
||||
# ✅ Database accepting connections
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🐳 Docker Container Monitoring
|
||||
|
||||
### Check Container Status
|
||||
|
||||
```bash
|
||||
# List all containers
|
||||
docker compose --profile prod ps
|
||||
|
||||
# Check specific container
|
||||
docker inspect slc-backend-prod --format='{{.State.Status}}'
|
||||
|
||||
# View resource usage
|
||||
docker stats --no-stream
|
||||
```
|
||||
|
||||
### Container Health Checks
|
||||
|
||||
Built into docker-compose.yml:
|
||||
- **Backend:** `curl localhost:3000/api/health`
|
||||
- **Database:** `pg_isready -U spotlightcam`
|
||||
|
||||
```bash
|
||||
# View health status
|
||||
docker compose --profile prod ps
|
||||
# Look for "(healthy)" in STATUS column
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📝 Log Monitoring
|
||||
|
||||
### View Logs
|
||||
|
||||
```bash
|
||||
# All services
|
||||
docker compose --profile prod logs -f
|
||||
|
||||
# Specific service
|
||||
docker logs -f slc-backend-prod
|
||||
|
||||
# Last 100 lines
|
||||
docker logs --tail 100 slc-backend-prod
|
||||
|
||||
# With timestamps
|
||||
docker logs -f --timestamps slc-backend-prod
|
||||
|
||||
# Filter errors only
|
||||
docker logs slc-backend-prod 2>&1 | grep -i error
|
||||
```
|
||||
|
||||
### Log Rotation
|
||||
|
||||
Configured in docker-compose.yml:
|
||||
```yaml
|
||||
logging:
|
||||
driver: "json-file"
|
||||
options:
|
||||
max-size: "10m"
|
||||
max-file: "3"
|
||||
```
|
||||
|
||||
### Important Log Patterns
|
||||
|
||||
**Authentication errors:**
|
||||
```bash
|
||||
docker logs slc-backend-prod | grep "401\|403\|locked"
|
||||
```
|
||||
|
||||
**Database errors:**
|
||||
```bash
|
||||
docker logs slc-backend-prod | grep -i "prisma\|database"
|
||||
```
|
||||
|
||||
**Rate limiting:**
|
||||
```bash
|
||||
docker logs slc-backend-prod | grep "Too many requests"
|
||||
```
|
||||
|
||||
**Email failures:**
|
||||
```bash
|
||||
docker logs slc-backend-prod | grep "Failed to send.*email"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🌐 External Monitoring
|
||||
|
||||
### Recommended Services
|
||||
|
||||
#### 1. UptimeRobot (Free)
|
||||
- **URL:** https://uptimerobot.com
|
||||
- **Features:**
|
||||
- 5-minute checks
|
||||
- Email/SMS alerts
|
||||
- 50 monitors free
|
||||
- Status pages
|
||||
|
||||
**Setup:**
|
||||
1. Create account
|
||||
2. Add HTTP monitor: `https://spotlight.cam`
|
||||
3. Add HTTP monitor: `https://spotlight.cam/api/health`
|
||||
4. Set alert contacts
|
||||
5. Create public status page (optional)
|
||||
|
||||
#### 2. Pingdom
|
||||
- **URL:** https://pingdom.com
|
||||
- **Features:**
|
||||
- 1-minute checks
|
||||
- Transaction monitoring
|
||||
- Real user monitoring
|
||||
- SSL monitoring
|
||||
|
||||
#### 3. Better Uptime
|
||||
- **URL:** https://betteruptime.com
|
||||
- **Features:**
|
||||
- Free tier available
|
||||
- Incident management
|
||||
- On-call scheduling
|
||||
- Status pages
|
||||
|
||||
### Monitor These Endpoints
|
||||
|
||||
| Endpoint | Check Type | Expected |
|
||||
|----------|-----------|----------|
|
||||
| `https://spotlight.cam` | HTTP | 200 OK |
|
||||
| `https://spotlight.cam/api/health` | HTTP + JSON | `{"status":"ok"}` |
|
||||
| `spotlight.cam` | SSL | Valid, not expiring |
|
||||
| `spotlight.cam` | DNS | Resolves correctly |
|
||||
|
||||
---
|
||||
|
||||
## 📈 Metrics to Track
|
||||
|
||||
### Application Metrics
|
||||
|
||||
1. **Response Times**
|
||||
- API endpoints: < 200ms
|
||||
- Frontend load: < 1s
|
||||
|
||||
2. **Error Rates**
|
||||
- 4xx errors: < 1%
|
||||
- 5xx errors: < 0.1%
|
||||
|
||||
3. **Authentication**
|
||||
- Failed logins
|
||||
- Account lockouts
|
||||
- Password resets
|
||||
|
||||
4. **WebRTC**
|
||||
- Connection success rate
|
||||
- File transfer completions
|
||||
- Peer connection failures
|
||||
|
||||
### Infrastructure Metrics
|
||||
|
||||
1. **CPU Usage**
|
||||
```bash
|
||||
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}"
|
||||
```
|
||||
|
||||
2. **Memory Usage**
|
||||
```bash
|
||||
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}"
|
||||
```
|
||||
|
||||
3. **Disk Space**
|
||||
```bash
|
||||
df -h
|
||||
du -sh /var/lib/docker
|
||||
```
|
||||
|
||||
4. **Database Size**
|
||||
```bash
|
||||
docker exec slc-db-prod psql -U spotlightcam -c "SELECT pg_size_pretty(pg_database_size('spotlightcam'));"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🚨 Alerting Setup
|
||||
|
||||
### Email Alerts (Simple)
|
||||
|
||||
Create alert script:
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# /usr/local/bin/alert-spotlight.sh
|
||||
|
||||
SUBJECT="⚠️ spotlight.cam Alert"
|
||||
RECIPIENT="admin@example.com"
|
||||
|
||||
# Run health check
|
||||
if ! /path/to/spotlightcam/scripts/health-check.sh prod; then
|
||||
echo "Health check failed at $(date)" | mail -s "$SUBJECT" "$RECIPIENT"
|
||||
fi
|
||||
```
|
||||
|
||||
Add to crontab:
|
||||
```bash
|
||||
*/5 * * * * /usr/local/bin/alert-spotlight.sh
|
||||
```
|
||||
|
||||
### Slack Alerts (Advanced)
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# /usr/local/bin/alert-slack.sh
|
||||
|
||||
SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
|
||||
|
||||
if ! /path/to/spotlightcam/scripts/health-check.sh prod; then
|
||||
curl -X POST "$SLACK_WEBHOOK" \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"text": "🚨 spotlight.cam health check failed",
|
||||
"username": "Monitoring Bot"
|
||||
}'
|
||||
fi
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📊 Dashboard (Optional)
|
||||
|
||||
### Simple Dashboard with Grafana
|
||||
|
||||
1. **Setup Prometheus:**
|
||||
```yaml
|
||||
# docker-compose.monitoring.yml
|
||||
services:
|
||||
prometheus:
|
||||
image: prom/prometheus
|
||||
volumes:
|
||||
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
|
||||
ports:
|
||||
- "9090:9090"
|
||||
|
||||
grafana:
|
||||
image: grafana/grafana
|
||||
ports:
|
||||
- "3001:3000"
|
||||
volumes:
|
||||
- grafana_data:/var/lib/grafana
|
||||
```
|
||||
|
||||
2. **Add metrics endpoint to backend** (optional enhancement)
|
||||
|
||||
---
|
||||
|
||||
## 🔍 Troubleshooting Monitoring
|
||||
|
||||
### Health Check Always Fails
|
||||
|
||||
```bash
|
||||
# Test API manually
|
||||
curl -v https://spotlight.cam/api/health
|
||||
|
||||
# Check nginx logs
|
||||
docker logs slc-proxy-prod
|
||||
|
||||
# Check backend logs
|
||||
docker logs slc-backend-prod
|
||||
|
||||
# Test from within container
|
||||
docker exec slc-proxy-prod curl localhost:80/api/health
|
||||
```
|
||||
|
||||
### High CPU/Memory Usage
|
||||
|
||||
```bash
|
||||
# Identify problematic container
|
||||
docker stats --no-stream
|
||||
|
||||
# Check container logs
|
||||
docker logs --tail 100 slc-backend-prod
|
||||
|
||||
# Restart if needed
|
||||
docker compose --profile prod restart backend-prod
|
||||
```
|
||||
|
||||
### Logs Not Rotating
|
||||
|
||||
```bash
|
||||
# Check Docker log files
|
||||
ls -lh /var/lib/docker/containers/*/*-json.log
|
||||
|
||||
# Manual cleanup (careful!)
|
||||
docker compose --profile prod down
|
||||
docker system prune -af
|
||||
docker compose --profile prod up -d
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ✅ Monitoring Checklist
|
||||
|
||||
### Daily Checks (Automated)
|
||||
- [ ] Health check endpoint responding
|
||||
- [ ] All containers running
|
||||
- [ ] Database accepting connections
|
||||
- [ ] No critical errors in logs
|
||||
|
||||
### Weekly Checks (Manual)
|
||||
- [ ] Review error logs
|
||||
- [ ] Check disk space
|
||||
- [ ] Verify backups are running
|
||||
- [ ] Test restore from backup
|
||||
- [ ] Review failed login attempts
|
||||
|
||||
### Monthly Checks
|
||||
- [ ] SSL certificate expiry (renew if < 30 days)
|
||||
- [ ] Update dependencies
|
||||
- [ ] Review and rotate secrets
|
||||
- [ ] Performance review
|
||||
- [ ] Security audit
|
||||
|
||||
---
|
||||
|
||||
## 📞 Incident Response
|
||||
|
||||
### When Alert Triggers
|
||||
|
||||
1. **Check severity**
|
||||
```bash
|
||||
./scripts/health-check.sh prod
|
||||
docker compose --profile prod ps
|
||||
```
|
||||
|
||||
2. **Check logs**
|
||||
```bash
|
||||
docker logs --tail 100 slc-backend-prod
|
||||
docker logs --tail 100 slc-db-prod
|
||||
```
|
||||
|
||||
3. **Attempt automatic recovery**
|
||||
```bash
|
||||
docker compose --profile prod restart
|
||||
```
|
||||
|
||||
4. **If still down, investigate**
|
||||
- Database connection issues
|
||||
- Disk space full
|
||||
- Memory exhaustion
|
||||
- Network issues
|
||||
|
||||
5. **Document incident**
|
||||
- Time of failure
|
||||
- Symptoms observed
|
||||
- Actions taken
|
||||
- Resolution
|
||||
|
||||
---
|
||||
|
||||
## 🎯 SLA Targets
|
||||
|
||||
### Uptime
|
||||
- **Target:** 99.9% (43 minutes downtime/month)
|
||||
- **Measurement:** External monitoring (UptimeRobot)
|
||||
|
||||
### Performance
|
||||
- **API Response:** < 200ms (95th percentile)
|
||||
- **Page Load:** < 2s (95th percentile)
|
||||
|
||||
### Recovery
|
||||
- **Detection:** < 5 minutes
|
||||
- **Response:** < 15 minutes
|
||||
- **Resolution:** < 1 hour (non-critical)
|
||||
|
||||
---
|
||||
|
||||
**Last Updated:** 2025-11-20
|
||||
Reference in New Issue
Block a user