Files
spotlightcam/docs/MONITORING.md

428 lines
7.9 KiB
Markdown
Raw Permalink Normal View History

# Monitoring Guide - spotlight.cam
Complete guide for monitoring spotlight.cam in production.
## 📊 Monitoring Strategy
### Three-Layer Approach
1. **Application Monitoring** - Health checks, logs, metrics
2. **Infrastructure Monitoring** - Docker containers, system resources
3. **External Monitoring** - Uptime, response times, SSL certificates
---
## 🏥 Application Monitoring
### Built-in Health Check
**Endpoint:** `GET /api/health`
**Response (healthy):**
```json
{
"status": "ok",
"timestamp": "2025-11-20T12:00:00.000Z",
"uptime": 3600,
"environment": "production"
}
```
**Usage:**
```bash
# Check health
curl https://spotlight.cam/api/health
# Automated check (exit code 0 = healthy)
curl -f -s https://spotlight.cam/api/health > /dev/null
```
### Health Check Script
Use built-in health check script:
```bash
# Check all services
./scripts/health-check.sh prod
# Output:
# ✅ nginx: Running
# ✅ Frontend: Running
# ✅ Backend: Running
# ✅ Database: Running
# ✅ API responding
# ✅ Database accepting connections
```
---
## 🐳 Docker Container Monitoring
### Check Container Status
```bash
# List all containers
docker compose --profile prod ps
# Check specific container
docker inspect slc-backend-prod --format='{{.State.Status}}'
# View resource usage
docker stats --no-stream
```
### Container Health Checks
Built into docker-compose.yml:
- **Backend:** `curl localhost:3000/api/health`
- **Database:** `pg_isready -U spotlightcam`
```bash
# View health status
docker compose --profile prod ps
# Look for "(healthy)" in STATUS column
```
---
## 📝 Log Monitoring
### View Logs
```bash
# All services
docker compose --profile prod logs -f
# Specific service
docker logs -f slc-backend-prod
# Last 100 lines
docker logs --tail 100 slc-backend-prod
# With timestamps
docker logs -f --timestamps slc-backend-prod
# Filter errors only
docker logs slc-backend-prod 2>&1 | grep -i error
```
### Log Rotation
Configured in docker-compose.yml:
```yaml
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
```
### Important Log Patterns
**Authentication errors:**
```bash
docker logs slc-backend-prod | grep "401\|403\|locked"
```
**Database errors:**
```bash
docker logs slc-backend-prod | grep -i "prisma\|database"
```
**Rate limiting:**
```bash
docker logs slc-backend-prod | grep "Too many requests"
```
**Email failures:**
```bash
docker logs slc-backend-prod | grep "Failed to send.*email"
```
---
## 🌐 External Monitoring
### Recommended Services
#### 1. UptimeRobot (Free)
- **URL:** https://uptimerobot.com
- **Features:**
- 5-minute checks
- Email/SMS alerts
- 50 monitors free
- Status pages
**Setup:**
1. Create account
2. Add HTTP monitor: `https://spotlight.cam`
3. Add HTTP monitor: `https://spotlight.cam/api/health`
4. Set alert contacts
5. Create public status page (optional)
#### 2. Pingdom
- **URL:** https://pingdom.com
- **Features:**
- 1-minute checks
- Transaction monitoring
- Real user monitoring
- SSL monitoring
#### 3. Better Uptime
- **URL:** https://betteruptime.com
- **Features:**
- Free tier available
- Incident management
- On-call scheduling
- Status pages
### Monitor These Endpoints
| Endpoint | Check Type | Expected |
|----------|-----------|----------|
| `https://spotlight.cam` | HTTP | 200 OK |
| `https://spotlight.cam/api/health` | HTTP + JSON | `{"status":"ok"}` |
| `spotlight.cam` | SSL | Valid, not expiring |
| `spotlight.cam` | DNS | Resolves correctly |
---
## 📈 Metrics to Track
### Application Metrics
1. **Response Times**
- API endpoints: < 200ms
- Frontend load: < 1s
2. **Error Rates**
- 4xx errors: < 1%
- 5xx errors: < 0.1%
3. **Authentication**
- Failed logins
- Account lockouts
- Password resets
4. **WebRTC**
- Connection success rate
- File transfer completions
- Peer connection failures
### Infrastructure Metrics
1. **CPU Usage**
```bash
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}"
```
2. **Memory Usage**
```bash
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}"
```
3. **Disk Space**
```bash
df -h
du -sh /var/lib/docker
```
4. **Database Size**
```bash
docker exec slc-db-prod psql -U spotlightcam -c "SELECT pg_size_pretty(pg_database_size('spotlightcam'));"
```
---
## 🚨 Alerting Setup
### Email Alerts (Simple)
Create alert script:
```bash
#!/bin/bash
# /usr/local/bin/alert-spotlight.sh
SUBJECT="⚠️ spotlight.cam Alert"
RECIPIENT="admin@example.com"
# Run health check
if ! /path/to/spotlightcam/scripts/health-check.sh prod; then
echo "Health check failed at $(date)" | mail -s "$SUBJECT" "$RECIPIENT"
fi
```
Add to crontab:
```bash
*/5 * * * * /usr/local/bin/alert-spotlight.sh
```
### Slack Alerts (Advanced)
```bash
#!/bin/bash
# /usr/local/bin/alert-slack.sh
SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
if ! /path/to/spotlightcam/scripts/health-check.sh prod; then
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-Type: application/json' \
-d '{
"text": "🚨 spotlight.cam health check failed",
"username": "Monitoring Bot"
}'
fi
```
---
## 📊 Dashboard (Optional)
### Simple Dashboard with Grafana
1. **Setup Prometheus:**
```yaml
# docker-compose.monitoring.yml
services:
prometheus:
image: prom/prometheus
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana
ports:
- "3001:3000"
volumes:
- grafana_data:/var/lib/grafana
```
2. **Add metrics endpoint to backend** (optional enhancement)
---
## 🔍 Troubleshooting Monitoring
### Health Check Always Fails
```bash
# Test API manually
curl -v https://spotlight.cam/api/health
# Check nginx logs
docker logs slc-proxy-prod
# Check backend logs
docker logs slc-backend-prod
# Test from within container
docker exec slc-proxy-prod curl localhost:80/api/health
```
### High CPU/Memory Usage
```bash
# Identify problematic container
docker stats --no-stream
# Check container logs
docker logs --tail 100 slc-backend-prod
# Restart if needed
docker compose --profile prod restart backend-prod
```
### Logs Not Rotating
```bash
# Check Docker log files
ls -lh /var/lib/docker/containers/*/*-json.log
# Manual cleanup (careful!)
docker compose --profile prod down
docker system prune -af
docker compose --profile prod up -d
```
---
## ✅ Monitoring Checklist
### Daily Checks (Automated)
- [ ] Health check endpoint responding
- [ ] All containers running
- [ ] Database accepting connections
- [ ] No critical errors in logs
### Weekly Checks (Manual)
- [ ] Review error logs
- [ ] Check disk space
- [ ] Verify backups are running
- [ ] Test restore from backup
- [ ] Review failed login attempts
### Monthly Checks
- [ ] SSL certificate expiry (renew if < 30 days)
- [ ] Update dependencies
- [ ] Review and rotate secrets
- [ ] Performance review
- [ ] Security audit
---
## 📞 Incident Response
### When Alert Triggers
1. **Check severity**
```bash
./scripts/health-check.sh prod
docker compose --profile prod ps
```
2. **Check logs**
```bash
docker logs --tail 100 slc-backend-prod
docker logs --tail 100 slc-db-prod
```
3. **Attempt automatic recovery**
```bash
docker compose --profile prod restart
```
4. **If still down, investigate**
- Database connection issues
- Disk space full
- Memory exhaustion
- Network issues
5. **Document incident**
- Time of failure
- Symptoms observed
- Actions taken
- Resolution
---
## 🎯 SLA Targets
### Uptime
- **Target:** 99.9% (43 minutes downtime/month)
- **Measurement:** External monitoring (UptimeRobot)
### Performance
- **API Response:** < 200ms (95th percentile)
- **Page Load:** < 2s (95th percentile)
### Recovery
- **Detection:** < 5 minutes
- **Response:** < 15 minutes
- **Resolution:** < 1 hour (non-critical)
---
**Last Updated:** 2025-11-20