spotlightcam/docs/MONITORING.md

# Monitoring Guide - spotlight.cam

Complete guide for monitoring spotlight.cam in production.

## 📊 Monitoring Strategy

### Three-Layer Approach

1. **Application Monitoring** - Health checks, logs, metrics
2. **Infrastructure Monitoring** - Docker containers, system resources
3. **External Monitoring** - Uptime, response times, SSL certificates

---

## 🏥 Application Monitoring

### Built-in Health Check

**Endpoint:** `GET /api/health`

**Response (healthy):**
```json
{
  "status": "ok",
  "timestamp": "2025-11-20T12:00:00.000Z",
  "uptime": 3600,
  "environment": "production"
}
```

**Usage:**
```bash
# Check health
curl https://spotlight.cam/api/health

# Automated check (exit code 0 = healthy)
curl -f -s https://spotlight.cam/api/health > /dev/null
```

### Health Check Script

Use built-in health check script:
```bash
# Check all services
./scripts/health-check.sh prod

# Output:
# ✅ nginx: Running
# ✅ Frontend: Running
# ✅ Backend: Running
# ✅ Database: Running
# ✅ API responding
# ✅ Database accepting connections
```

---

## 🐳 Docker Container Monitoring

### Check Container Status

```bash
# List all containers
docker compose --profile prod ps

# Check specific container
docker inspect slc-backend-prod --format='{{.State.Status}}'

# View resource usage
docker stats --no-stream
```

### Container Health Checks

Built into docker-compose.yml:
- **Backend:** `curl localhost:3000/api/health`
- **Database:** `pg_isready -U spotlightcam`

```bash
# View health status
docker compose --profile prod ps
# Look for "(healthy)" in STATUS column
```

---

## 📝 Log Monitoring

### View Logs

```bash
# All services
docker compose --profile prod logs -f

# Specific service
docker logs -f slc-backend-prod

# Last 100 lines
docker logs --tail 100 slc-backend-prod

# With timestamps
docker logs -f --timestamps slc-backend-prod

# Filter errors only
docker logs slc-backend-prod 2>&1 | grep -i error
```

### Log Rotation

Configured in docker-compose.yml:
```yaml
logging:
  driver: "json-file"
  options:
    max-size: "10m"
    max-file: "3"
```

### Important Log Patterns

**Authentication errors:**
```bash
docker logs slc-backend-prod | grep "401\|403\|locked"
```

**Database errors:**
```bash
docker logs slc-backend-prod | grep -i "prisma\|database"
```

**Rate limiting:**
```bash
docker logs slc-backend-prod | grep "Too many requests"
```

**Email failures:**
```bash
docker logs slc-backend-prod | grep "Failed to send.*email"
```

---

## 🌐 External Monitoring

### Recommended Services

#### 1. UptimeRobot (Free)
- **URL:** https://uptimerobot.com
- **Features:**
  - 5-minute checks
  - Email/SMS alerts
  - 50 monitors free
  - Status pages

**Setup:**
1. Create account
2. Add HTTP monitor: `https://spotlight.cam`
3. Add HTTP monitor: `https://spotlight.cam/api/health`
4. Set alert contacts
5. Create public status page (optional)

#### 2. Pingdom
- **URL:** https://pingdom.com
- **Features:**
  - 1-minute checks
  - Transaction monitoring
  - Real user monitoring
  - SSL monitoring

#### 3. Better Uptime
- **URL:** https://betteruptime.com
- **Features:**
  - Free tier available
  - Incident management
  - On-call scheduling
  - Status pages

### Monitor These Endpoints

| Endpoint | Check Type | Expected |
|----------|-----------|----------|
| `https://spotlight.cam` | HTTP | 200 OK |
| `https://spotlight.cam/api/health` | HTTP + JSON | `{"status":"ok"}` |
| `spotlight.cam` | SSL | Valid, not expiring |
| `spotlight.cam` | DNS | Resolves correctly |

---

## 📈 Metrics to Track

### Application Metrics

1. **Response Times**
   - API endpoints: < 200ms
   - Frontend load: < 1s

2. **Error Rates**
   - 4xx errors: < 1%
   - 5xx errors: < 0.1%

3. **Authentication**
   - Failed logins
   - Account lockouts
   - Password resets

4. **WebRTC**
   - Connection success rate
   - File transfer completions
   - Peer connection failures

### Infrastructure Metrics

1. **CPU Usage**
   ```bash
   docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}"
   ```

2. **Memory Usage**
   ```bash
   docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}"
   ```

3. **Disk Space**
   ```bash
   df -h
   du -sh /var/lib/docker
   ```

4. **Database Size**
   ```bash
   docker exec slc-db-prod psql -U spotlightcam -c "SELECT pg_size_pretty(pg_database_size('spotlightcam'));"
   ```

---

## 🚨 Alerting Setup

### Email Alerts (Simple)

Create alert script:
```bash
#!/bin/bash
# /usr/local/bin/alert-spotlight.sh

SUBJECT="⚠️ spotlight.cam Alert"
RECIPIENT="admin@example.com"

# Run health check
if ! /path/to/spotlightcam/scripts/health-check.sh prod; then
    echo "Health check failed at $(date)" | mail -s "$SUBJECT" "$RECIPIENT"
fi
```

Add to crontab:
```bash
*/5 * * * * /usr/local/bin/alert-spotlight.sh
```

### Slack Alerts (Advanced)

```bash
#!/bin/bash
# /usr/local/bin/alert-slack.sh

SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

if ! /path/to/spotlightcam/scripts/health-check.sh prod; then
    curl -X POST "$SLACK_WEBHOOK" \
         -H 'Content-Type: application/json' \
         -d '{
           "text": "🚨 spotlight.cam health check failed",
           "username": "Monitoring Bot"
         }'
fi
```

---

## 📊 Dashboard (Optional)

### Simple Dashboard with Grafana

1. **Setup Prometheus:**
```yaml
# docker-compose.monitoring.yml
services:
  prometheus:
    image: prom/prometheus
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana
    ports:
      - "3001:3000"
    volumes:
      - grafana_data:/var/lib/grafana
```

2. **Add metrics endpoint to backend** (optional enhancement)

---

## 🔍 Troubleshooting Monitoring

### Health Check Always Fails

```bash
# Test API manually
curl -v https://spotlight.cam/api/health

# Check nginx logs
docker logs slc-proxy-prod

# Check backend logs
docker logs slc-backend-prod

# Test from within container
docker exec slc-proxy-prod curl localhost:80/api/health
```

### High CPU/Memory Usage

```bash
# Identify problematic container
docker stats --no-stream

# Check container logs
docker logs --tail 100 slc-backend-prod

# Restart if needed
docker compose --profile prod restart backend-prod
```

### Logs Not Rotating

```bash
# Check Docker log files
ls -lh /var/lib/docker/containers/*/*-json.log

# Manual cleanup (careful!)
docker compose --profile prod down
docker system prune -af
docker compose --profile prod up -d
```

---

## ✅ Monitoring Checklist

### Daily Checks (Automated)
- [ ] Health check endpoint responding
- [ ] All containers running
- [ ] Database accepting connections
- [ ] No critical errors in logs

### Weekly Checks (Manual)
- [ ] Review error logs
- [ ] Check disk space
- [ ] Verify backups are running
- [ ] Test restore from backup
- [ ] Review failed login attempts

### Monthly Checks
- [ ] SSL certificate expiry (renew if < 30 days)
- [ ] Update dependencies
- [ ] Review and rotate secrets
- [ ] Performance review
- [ ] Security audit

---

## 📞 Incident Response

### When Alert Triggers

1. **Check severity**
   ```bash
   ./scripts/health-check.sh prod
   docker compose --profile prod ps
   ```

2. **Check logs**
   ```bash
   docker logs --tail 100 slc-backend-prod
   docker logs --tail 100 slc-db-prod
   ```

3. **Attempt automatic recovery**
   ```bash
   docker compose --profile prod restart
   ```

4. **If still down, investigate**
   - Database connection issues
   - Disk space full
   - Memory exhaustion
   - Network issues

5. **Document incident**
   - Time of failure
   - Symptoms observed
   - Actions taken
   - Resolution

---

## 🎯 SLA Targets

### Uptime
- **Target:** 99.9% (43 minutes downtime/month)
- **Measurement:** External monitoring (UptimeRobot)

### Performance
- **API Response:** < 200ms (95th percentile)
- **Page Load:** < 2s (95th percentile)

### Recovery
- **Detection:** < 5 minutes
- **Response:** < 15 minutes
- **Resolution:** < 1 hour (non-critical)

---

**Last Updated:** 2025-11-20
feat: add production operations scripts and monitoring guide Add comprehensive tooling for production deployment: Scripts (scripts/): - backup-db.sh: Automated database backups with 7-day retention - restore-db.sh: Safe database restore with confirmation prompts - health-check.sh: Complete service health monitoring - README.md: Operational scripts documentation Monitoring (docs/MONITORING.md): - Application health monitoring - Docker container monitoring - External monitoring setup (UptimeRobot, Pingdom) - Log monitoring and rotation - Alerting configuration - Incident response procedures - SLA targets and metrics All scripts include: - Environment support (dev/prod) - Error handling and validation - Detailed status reporting - Safety confirmations where needed 2025-11-20 22:22:22 +01:00			`# Monitoring Guide - spotlight.cam`

			`Complete guide for monitoring spotlight.cam in production.`

			`## 📊 Monitoring Strategy`

			`### Three-Layer Approach`

			`1. Application Monitoring - Health checks, logs, metrics`
			`2. Infrastructure Monitoring - Docker containers, system resources`
			`3. External Monitoring - Uptime, response times, SSL certificates`

			`---`

			`## 🏥 Application Monitoring`

			`### Built-in Health Check`

			Endpoint: `GET /api/health`

			`Response (healthy):`
			```json
			`{`
			`"status": "ok",`
			`"timestamp": "2025-11-20T12:00:00.000Z",`
			`"uptime": 3600,`
			`"environment": "production"`
			`}`
			```

			`Usage:`
			```bash
			`# Check health`
			`curl https://spotlight.cam/api/health`

			`# Automated check (exit code 0 = healthy)`
			`curl -f -s https://spotlight.cam/api/health > /dev/null`
			```

			`### Health Check Script`

			`Use built-in health check script:`
			```bash
			`# Check all services`
			`./scripts/health-check.sh prod`

			`# Output:`
			`# ✅ nginx: Running`
			`# ✅ Frontend: Running`
			`# ✅ Backend: Running`
			`# ✅ Database: Running`
			`# ✅ API responding`
			`# ✅ Database accepting connections`
			```

			`---`

			`## 🐳 Docker Container Monitoring`

			`### Check Container Status`

			```bash
			`# List all containers`
			`docker compose --profile prod ps`

			`# Check specific container`
			`docker inspect slc-backend-prod --format='{{.State.Status}}'`

			`# View resource usage`
			`docker stats --no-stream`
			```

			`### Container Health Checks`

			`Built into docker-compose.yml:`
			- Backend: `curl localhost:3000/api/health`
			- Database: `pg_isready -U spotlightcam`

			```bash
			`# View health status`
			`docker compose --profile prod ps`
			`# Look for "(healthy)" in STATUS column`
			```

			`---`

			`## 📝 Log Monitoring`

			`### View Logs`

			```bash
			`# All services`
			`docker compose --profile prod logs -f`

			`# Specific service`
			`docker logs -f slc-backend-prod`

			`# Last 100 lines`
			`docker logs --tail 100 slc-backend-prod`

			`# With timestamps`
			`docker logs -f --timestamps slc-backend-prod`

			`# Filter errors only`
			`docker logs slc-backend-prod 2>&1 \| grep -i error`
			```

			`### Log Rotation`

			`Configured in docker-compose.yml:`
			```yaml
			`logging:`
			`driver: "json-file"`
			`options:`
			`max-size: "10m"`
			`max-file: "3"`
			```

			`### Important Log Patterns`

			`Authentication errors:`
			```bash
			`docker logs slc-backend-prod \| grep "401\\|403\\|locked"`
			```

			`Database errors:`
			```bash
			`docker logs slc-backend-prod \| grep -i "prisma\\|database"`
			```

			`Rate limiting:`
			```bash
			`docker logs slc-backend-prod \| grep "Too many requests"`
			```

			`Email failures:`
			```bash
			`docker logs slc-backend-prod \| grep "Failed to send.*email"`
			```

			`---`

			`## 🌐 External Monitoring`

			`### Recommended Services`

			`#### 1. UptimeRobot (Free)`
			`- URL: https://uptimerobot.com`
			`- Features:`
			`- 5-minute checks`
			`- Email/SMS alerts`
			`- 50 monitors free`
			`- Status pages`

			`Setup:`
			`1. Create account`
			2. Add HTTP monitor: `https://spotlight.cam`
			3. Add HTTP monitor: `https://spotlight.cam/api/health`
			`4. Set alert contacts`
			`5. Create public status page (optional)`

			`#### 2. Pingdom`
			`- URL: https://pingdom.com`
			`- Features:`
			`- 1-minute checks`
			`- Transaction monitoring`
			`- Real user monitoring`
			`- SSL monitoring`

			`#### 3. Better Uptime`
			`- URL: https://betteruptime.com`
			`- Features:`
			`- Free tier available`
			`- Incident management`
			`- On-call scheduling`
			`- Status pages`

			`### Monitor These Endpoints`

			`\| Endpoint \| Check Type \| Expected \|`
			`\|----------\|-----------\|----------\|`
			\| `https://spotlight.cam` \| HTTP \| 200 OK \|
			\| `https://spotlight.cam/api/health` \| HTTP + JSON \| `{"status":"ok"}` \|
			\| `spotlight.cam` \| SSL \| Valid, not expiring \|
			\| `spotlight.cam` \| DNS \| Resolves correctly \|

			`---`

			`## 📈 Metrics to Track`

			`### Application Metrics`

			`1. Response Times`
			`- API endpoints: < 200ms`
			`- Frontend load: < 1s`

			`2. Error Rates`
			`- 4xx errors: < 1%`
			`- 5xx errors: < 0.1%`

			`3. Authentication`
			`- Failed logins`
			`- Account lockouts`
			`- Password resets`

			`4. WebRTC`
			`- Connection success rate`
			`- File transfer completions`
			`- Peer connection failures`

			`### Infrastructure Metrics`

			`1. CPU Usage`
			```bash
			`docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}"`
			```

			`2. Memory Usage`
			```bash
			`docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}"`
			```

			`3. Disk Space`
			```bash
			`df -h`
			`du -sh /var/lib/docker`
			```

			`4. Database Size`
			```bash
			`docker exec slc-db-prod psql -U spotlightcam -c "SELECT pg_size_pretty(pg_database_size('spotlightcam'));"`
			```

			`---`

			`## 🚨 Alerting Setup`

			`### Email Alerts (Simple)`

			`Create alert script:`
			```bash
			`#!/bin/bash`
			`# /usr/local/bin/alert-spotlight.sh`

			`SUBJECT="⚠️ spotlight.cam Alert"`
			`RECIPIENT="admin@example.com"`

			`# Run health check`
			`if ! /path/to/spotlightcam/scripts/health-check.sh prod; then`
			`echo "Health check failed at $(date)" \| mail -s "$SUBJECT" "$RECIPIENT"`
			`fi`
			```

			`Add to crontab:`
			```bash
			`/5 * * * /usr/local/bin/alert-spotlight.sh`
			```

			`### Slack Alerts (Advanced)`

			```bash
			`#!/bin/bash`
			`# /usr/local/bin/alert-slack.sh`

			`SLACK_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"`

			`if ! /path/to/spotlightcam/scripts/health-check.sh prod; then`
			`curl -X POST "$SLACK_WEBHOOK" \`
			`-H 'Content-Type: application/json' \`
			`-d '{`
			`"text": "🚨 spotlight.cam health check failed",`
			`"username": "Monitoring Bot"`
			`}'`
			`fi`
			```

			`---`

			`## 📊 Dashboard (Optional)`

			`### Simple Dashboard with Grafana`

			`1. Setup Prometheus:`
			```yaml
			`# docker-compose.monitoring.yml`
			`services:`
			`prometheus:`
			`image: prom/prometheus`
			`volumes:`
			`- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml`
			`ports:`
			`- "9090:9090"`

			`grafana:`
			`image: grafana/grafana`
			`ports:`
			`- "3001:3000"`
			`volumes:`
			`- grafana_data:/var/lib/grafana`
			```

			`2. Add metrics endpoint to backend (optional enhancement)`

			`---`

			`## 🔍 Troubleshooting Monitoring`

			`### Health Check Always Fails`

			```bash
			`# Test API manually`
			`curl -v https://spotlight.cam/api/health`

			`# Check nginx logs`
			`docker logs slc-proxy-prod`

			`# Check backend logs`
			`docker logs slc-backend-prod`

			`# Test from within container`
			`docker exec slc-proxy-prod curl localhost:80/api/health`
			```

			`### High CPU/Memory Usage`

			```bash
			`# Identify problematic container`
			`docker stats --no-stream`

			`# Check container logs`
			`docker logs --tail 100 slc-backend-prod`

			`# Restart if needed`
			`docker compose --profile prod restart backend-prod`
			```

			`### Logs Not Rotating`

			```bash
			`# Check Docker log files`
			`ls -lh /var/lib/docker/containers//-json.log`

			`# Manual cleanup (careful!)`
			`docker compose --profile prod down`
			`docker system prune -af`
			`docker compose --profile prod up -d`
			```

			`---`

			`## ✅ Monitoring Checklist`

			`### Daily Checks (Automated)`
			`- [ ] Health check endpoint responding`
			`- [ ] All containers running`
			`- [ ] Database accepting connections`
			`- [ ] No critical errors in logs`

			`### Weekly Checks (Manual)`
			`- [ ] Review error logs`
			`- [ ] Check disk space`
			`- [ ] Verify backups are running`
			`- [ ] Test restore from backup`
			`- [ ] Review failed login attempts`

			`### Monthly Checks`
			`- [ ] SSL certificate expiry (renew if < 30 days)`
			`- [ ] Update dependencies`
			`- [ ] Review and rotate secrets`
			`- [ ] Performance review`
			`- [ ] Security audit`

			`---`

			`## 📞 Incident Response`

			`### When Alert Triggers`

			`1. Check severity`
			```bash
			`./scripts/health-check.sh prod`
			`docker compose --profile prod ps`
			```

			`2. Check logs`
			```bash
			`docker logs --tail 100 slc-backend-prod`
			`docker logs --tail 100 slc-db-prod`
			```

			`3. Attempt automatic recovery`
			```bash
			`docker compose --profile prod restart`
			```

			`4. If still down, investigate`
			`- Database connection issues`
			`- Disk space full`
			`- Memory exhaustion`
			`- Network issues`

			`5. Document incident`
			`- Time of failure`
			`- Symptoms observed`
			`- Actions taken`
			`- Resolution`

			`---`

			`## 🎯 SLA Targets`

			`### Uptime`
			`- Target: 99.9% (43 minutes downtime/month)`
			`- Measurement: External monitoring (UptimeRobot)`

			`### Performance`
			`- API Response: < 200ms (95th percentile)`
			`- Page Load: < 2s (95th percentile)`

			`### Recovery`
			`- Detection: < 5 minutes`
			`- Response: < 15 minutes`
			`- Resolution: < 1 hour (non-critical)`

			`---`

			`Last Updated: 2025-11-20`