Operations & Maintenance Manual
Status: Complete
Version: 5.0
Last Updated: 2024
Purpose: Comprehensive guide for operating and maintaining NudgeCampaign in production
Table of Contents
- System Operations Overview
- Infrastructure Management
- Monitoring & Alerting
- Database Maintenance
- Performance Tuning
- Backup & Recovery
- Incident Response
- Security Operations
- Scaling Operations
- Disaster Recovery
- Maintenance Procedures
- Troubleshooting Guide
System Operations Overview
NudgeCampaign operates as a cloud-native, multi-tenant SaaS platform requiring 24/7 availability and proactive maintenance to ensure optimal performance, security, and reliability.
Operational Architecture
Key Operational Metrics
| Metric | Target | Critical Threshold | Response Time |
|---|---|---|---|
| Uptime | 99.9% | <99.5% | Immediate |
| Response Time (P95) | <200ms | >500ms | 5 minutes |
| Error Rate | <1% | >5% | Immediate |
| Database CPU | <70% | >90% | 5 minutes |
| Queue Depth | <1000 | >5000 | 15 minutes |
| Email Delivery Rate | >98% | <95% | 30 minutes |
Service Level Objectives (SLOs)
- Availability SLO: 99.9% uptime measured monthly
- Performance SLO: 95% of requests under 200ms
- Data Durability: 99.999999% (9 nines)
- Email Delivery: 98% successful delivery rate
- Support Response: <4 hours for critical issues
Infrastructure Management
Cloud Infrastructure Overview
Infrastructure as Code
All infrastructure is managed through Terraform:
# terraform/production/main.tf
terraform {
required_version = ">= 1.0"
backend "s3" {
bucket = "nudgecampaign-terraform-state"
key = "production/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
provider "aws" {
region = var.aws_region
default_tags {
tags = {
Environment = "production"
ManagedBy = "terraform"
Application = "nudgecampaign"
}
}
}
module "vpc" {
source = "../modules/vpc"
cidr_block = "10.0.0.0/16"
availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
public_subnet_cidrs = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
private_subnet_cidrs = ["10.0.10.0/24", "10.0.11.0/24", "10.0.12.0/24"]
database_subnet_cidrs = ["10.0.20.0/24", "10.0.21.0/24", "10.0.22.0/24"]
}
module "ecs_cluster" {
source = "../modules/ecs"
cluster_name = "nudgecampaign-production"
vpc_id = module.vpc.vpc_id
subnets = module.vpc.private_subnet_ids
services = {
api = {
cpu = 2048
memory = 4096
desired_count = 4
min_count = 2
max_count = 10
}
worker = {
cpu = 1024
memory = 2048
desired_count = 3
min_count = 1
max_count = 6
}
}
}
module "database" {
source = "../modules/rds"
engine = "postgres"
engine_version = "14.9"
instance_class = "db.r6g.xlarge"
allocated_storage = 100
multi_az = true
backup_retention_period = 30
backup_window = "03:00-04:00"
maintenance_window = "sun:04:00-sun:05:00"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.database_subnet_ids
}
Container Management
ECS task definitions for services:
{
"family": "nudgecampaign-api",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "2048",
"memory": "4096",
"containerDefinitions": [
{
"name": "api",
"image": "nudgecampaign/api:latest",
"portMappings": [
{
"containerPort": 3000,
"protocol": "tcp"
}
],
"environment": [
{
"name": "NODE_ENV",
"value": "production"
}
],
"secrets": [
{
"name": "DATABASE_URL",
"valueFrom": "arn:aws:secretsmanager:us-east-1:xxx:secret:database-url"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/nudgecampaign-api",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
},
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 60
}
}
]
}
Kubernetes Operations
For Kubernetes deployments:
# k8s/production/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nudgecampaign-api
namespace: production
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: nudgecampaign-api
template:
metadata:
labels:
app: nudgecampaign-api
spec:
containers:
- name: api
image: nudgecampaign/api:v5.0.0
ports:
- containerPort: 3000
env:
- name: NODE_ENV
value: "production"
envFrom:
- secretRef:
name: api-secrets
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
Monitoring & Alerting
Monitoring Stack Configuration
# docker-compose.monitoring.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
ports:
- "9090:9090"
grafana:
image: grafana/grafana:latest
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./grafana/datasources:/etc/grafana/provisioning/datasources
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_INSTALL_PLUGINS=redis-datasource
ports:
- "3001:3000"
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
volumes:
- loki_data:/loki
alertmanager:
image: prom/alertmanager:latest
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports:
- "9093:9093"
volumes:
prometheus_data:
grafana_data:
loki_data:
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- "alerts.yml"
scrape_configs:
- job_name: 'nudgecampaign-api'
static_configs:
- targets: ['api:3000']
metrics_path: '/metrics'
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
Alert Rules
# alerts.yml
groups:
- name: application
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: High error rate detected
description: "Error rate is {{ $value }} errors per second"
- alert: SlowResponseTime
expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: Slow response times
description: "95th percentile response time is {{ $value }} seconds"
- name: infrastructure
interval: 30s
rules:
- alert: HighCPUUsage
expr: cpu_usage_percent > 80
for: 5m
labels:
severity: warning
annotations:
summary: High CPU usage
description: "CPU usage is {{ $value }}%"
- alert: LowDiskSpace
expr: disk_free_percent < 10
for: 5m
labels:
severity: critical
annotations:
summary: Low disk space
description: "Only {{ $value }}% disk space remaining"
- alert: DatabaseConnectionPoolExhausted
expr: pg_stat_database_numbackends / pg_settings_max_connections > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: Database connection pool nearly exhausted
description: "{{ $value }}% of connections in use"
Custom Metrics
Application metrics collection:
// src/lib/metrics.ts
import { register, Counter, Histogram, Gauge } from 'prom-client'
// HTTP metrics
export const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status'],
buckets: [0.1, 0.5, 1, 2, 5]
})
export const httpRequestTotal = new Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status']
})
// Business metrics
export const emailsSent = new Counter({
name: 'emails_sent_total',
help: 'Total number of emails sent',
labelNames: ['campaign_id', 'status']
})
export const activeUsers = new Gauge({
name: 'active_users',
help: 'Number of active users',
labelNames: ['organization']
})
// Database metrics
export const dbConnectionPool = new Gauge({
name: 'db_connection_pool_size',
help: 'Database connection pool size',
labelNames: ['status']
})
// Export metrics endpoint
export async function metricsHandler(req: Request): Promise<Response> {
const metrics = await register.metrics()
return new Response(metrics, {
headers: {
'Content-Type': register.contentType
}
})
}
Grafana Dashboards
Key dashboards for operations:
System Overview Dashboard
- Request rate and error rate
- Response time percentiles
- Active users and sessions
- CPU and memory usage
- Database connections
Email Operations Dashboard
- Emails sent per hour
- Delivery success rate
- Bounce and complaint rates
- Queue depth and processing time
- Postmark API errors
Business Metrics Dashboard
- New signups
- Active organizations
- Campaign creation rate
- Revenue metrics
- Churn indicators
Infrastructure Dashboard
- Container health
- Database performance
- Redis cache hit rate
- Network throughput
- Disk usage
Database Maintenance
Regular Maintenance Tasks
-- Daily maintenance script
-- Run at 2 AM UTC during low traffic
-- Update statistics
ANALYZE;
-- Reindex tables with high write activity
REINDEX TABLE campaigns;
REINDEX TABLE email_deliveries;
REINDEX TABLE contacts;
-- Clean up old sessions
DELETE FROM sessions WHERE expires_at < NOW() - INTERVAL '7 days';
-- Archive old email delivery records
INSERT INTO email_deliveries_archive
SELECT * FROM email_deliveries
WHERE created_at < NOW() - INTERVAL '90 days';
DELETE FROM email_deliveries
WHERE created_at < NOW() - INTERVAL '90 days';
-- Vacuum to reclaim space
VACUUM ANALYZE;
Database Performance Monitoring
-- Monitor slow queries
SELECT
query,
calls,
total_time,
mean_time,
max_time,
min_time
FROM pg_stat_statements
WHERE mean_time > 100
ORDER BY mean_time DESC
LIMIT 20;
-- Check table bloat
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
n_live_tup,
n_dead_tup,
ROUND(n_dead_tup::numeric / NULLIF(n_live_tup, 0), 4) AS dead_ratio
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY dead_ratio DESC;
-- Monitor connection usage
SELECT
datname,
numbackends,
ROUND(numbackends::numeric /
(SELECT setting::numeric FROM pg_settings WHERE name = 'max_connections'), 2)
AS connection_ratio
FROM pg_stat_database
WHERE datname NOT IN ('postgres', 'template0', 'template1')
ORDER BY numbackends DESC;
-- Index usage statistics
SELECT
schemaname,
tablename,
indexname,
idx_scan,
idx_tup_read,
idx_tup_fetch,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC
LIMIT 20;
Database Backup Strategy
#!/bin/bash
# backup.sh - Database backup script
# Configuration
DB_HOST="prod-db.nudgecampaign.com"
DB_NAME="nudgecampaign"
DB_USER="backup_user"
BACKUP_DIR="/backups/postgres"
S3_BUCKET="nudgecampaign-backups"
RETENTION_DAYS=30
# Create backup
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz"
echo "Starting backup at $(date)"
# Perform backup with compression
PGPASSWORD=$DB_PASSWORD pg_dump \
-h $DB_HOST \
-U $DB_USER \
-d $DB_NAME \
--no-owner \
--no-acl \
--clean \
--if-exists \
| gzip > $BACKUP_FILE
# Check backup size
BACKUP_SIZE=$(du -h $BACKUP_FILE | cut -f1)
echo "Backup completed: $BACKUP_FILE ($BACKUP_SIZE)"
# Upload to S3
aws s3 cp $BACKUP_FILE s3://${S3_BUCKET}/daily/ \
--storage-class STANDARD_IA
# Clean up old local backups
find $BACKUP_DIR -name "*.sql.gz" -mtime +7 -delete
# Clean up old S3 backups
aws s3 ls s3://${S3_BUCKET}/daily/ \
| while read -r line; do
createDate=$(echo $line | awk '{print $1" "$2}')
createDate=$(date -d "$createDate" +%s)
olderThan=$(date -d "$RETENTION_DAYS days ago" +%s)
if [[ $createDate -lt $olderThan ]]; then
fileName=$(echo $line | awk '{print $4}')
echo "Deleting old backup: $fileName"
aws s3 rm s3://${S3_BUCKET}/daily/$fileName
fi
done
echo "Backup process completed at $(date)"
Database Optimization
-- Create missing indexes based on query patterns
CREATE INDEX CONCURRENTLY idx_campaigns_org_status
ON campaigns(organization_id, status)
WHERE status IN ('draft', 'scheduled', 'sending');
CREATE INDEX CONCURRENTLY idx_email_deliveries_campaign_created
ON email_deliveries(campaign_id, created_at DESC);
CREATE INDEX CONCURRENTLY idx_contacts_org_status_email
ON contacts(organization_id, status, email)
WHERE status = 'subscribed';
-- Partition large tables
CREATE TABLE email_deliveries_2024_01 PARTITION OF email_deliveries
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');
CREATE TABLE email_deliveries_2024_02 PARTITION OF email_deliveries
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');
-- Configure autovacuum for high-activity tables
ALTER TABLE email_deliveries SET (
autovacuum_vacuum_scale_factor = 0.01,
autovacuum_analyze_scale_factor = 0.005
);
ALTER TABLE campaigns SET (
autovacuum_vacuum_scale_factor = 0.05,
autovacuum_analyze_scale_factor = 0.02
);
Performance Tuning
Application Performance
// Performance monitoring middleware
export function performanceMiddleware(req: Request, res: Response, next: NextFunction) {
const start = process.hrtime.bigint()
res.on('finish', () => {
const duration = Number(process.hrtime.bigint() - start) / 1e6 // Convert to ms
// Log slow requests
if (duration > 1000) {
logger.warn('Slow request detected', {
method: req.method,
path: req.path,
duration,
statusCode: res.statusCode
})
}
// Update metrics
httpRequestDuration.observe(
{
method: req.method,
route: req.route?.path || 'unknown',
status: res.statusCode.toString()
},
duration / 1000 // Convert to seconds
)
})
next()
}
Database Connection Pooling
// Optimized Prisma configuration
import { PrismaClient } from '@prisma/client'
const globalForPrisma = global as unknown as { prisma: PrismaClient }
export const prisma = globalForPrisma.prisma || new PrismaClient({
datasources: {
db: {
url: process.env.DATABASE_URL
}
},
log: process.env.NODE_ENV === 'development'
? ['query', 'error', 'warn']
: ['error'],
// Connection pool settings
// Adjust based on your database plan
// These settings are for a db.r6g.xlarge instance
})
if (process.env.NODE_ENV !== 'production') {
globalForPrisma.prisma = prisma
}
// Monitor pool metrics
setInterval(() => {
prisma.$metrics.json().then(metrics => {
dbConnectionPool.set({ status: 'active' }, metrics.counters.find(
m => m.key === 'prisma_pool_connections_open'
)?.value || 0)
})
}, 10000)
Redis Caching Optimization
// Cache configuration
import Redis from 'ioredis'
const redis = new Redis({
host: process.env.REDIS_HOST,
port: parseInt(process.env.REDIS_PORT || '6379'),
password: process.env.REDIS_PASSWORD,
maxRetriesPerRequest: 3,
retryStrategy: (times) => Math.min(times * 50, 2000),
enableOfflineQueue: false,
lazyConnect: true
})
// Cache wrapper with metrics
export async function cacheGet<T>(key: string): Promise<T | null> {
const start = Date.now()
try {
const value = await redis.get(key)
const duration = Date.now() - start
cacheMetrics.observe({ operation: 'get', hit: value ? 'hit' : 'miss' }, duration)
return value ? JSON.parse(value) : null
} catch (error) {
logger.error('Cache get error', { key, error })
return null
}
}
export async function cacheSet<T>(
key: string,
value: T,
ttl: number = 300
): Promise<void> {
const start = Date.now()
try {
await redis.setex(key, ttl, JSON.stringify(value))
const duration = Date.now() - start
cacheMetrics.observe({ operation: 'set' }, duration)
} catch (error) {
logger.error('Cache set error', { key, error })
}
}
CDN and Static Asset Optimization
# nginx.conf for static assets
server {
listen 80;
server_name cdn.nudgecampaign.com;
# Enable gzip compression
gzip on;
gzip_vary on;
gzip_min_length 1024;
gzip_types text/css application/javascript application/json image/svg+xml;
# Cache headers for static assets
location ~* \.(jpg|jpeg|png|gif|ico|css|js|svg|woff|woff2)$ {
expires 1y;
add_header Cache-Control "public, immutable";
add_header Vary "Accept-Encoding";
}
# Security headers
add_header X-Content-Type-Options "nosniff" always;
add_header X-Frame-Options "DENY" always;
add_header X-XSS-Protection "1; mode=block" always;
}
Backup & Recovery
Backup Strategy Overview
Weekly] Incremental[Incremental
Daily] Snapshot[Snapshots
Hourly] Continuous[WAL Archiving
Real-time] end subgraph "Storage Tiers" Hot[S3 Standard
Last 7 days] Warm[S3 IA
8-30 days] Cold[Glacier
31+ days] end Full --> Hot Incremental --> Hot Snapshot --> Hot Continuous --> Hot Hot --> Warm Warm --> Cold
Automated Backup System
#!/bin/bash
# comprehensive-backup.sh
# Function to perform database backup
backup_database() {
echo "Starting database backup..."
# Create logical backup
pg_dump $DATABASE_URL \
--format=custom \
--verbose \
--file=/tmp/db-backup-$(date +%Y%m%d-%H%M%S).dump
# Upload to S3
aws s3 cp /tmp/db-backup-*.dump \
s3://nudgecampaign-backups/database/ \
--storage-class STANDARD_IA
# Clean up local file
rm /tmp/db-backup-*.dump
}
# Function to backup application files
backup_application() {
echo "Starting application backup..."
# Create tarball of uploads
tar -czf /tmp/uploads-$(date +%Y%m%d-%H%M%S).tar.gz \
/var/www/nudgecampaign/uploads/
# Upload to S3
aws s3 cp /tmp/uploads-*.tar.gz \
s3://nudgecampaign-backups/uploads/ \
--storage-class STANDARD_IA
# Clean up
rm /tmp/uploads-*.tar.gz
}
# Function to backup configurations
backup_configs() {
echo "Starting configuration backup..."
# Backup environment variables
aws secretsmanager get-secret-value \
--secret-id nudgecampaign/production \
--query SecretString \
--output text > /tmp/env-backup-$(date +%Y%m%d).json
# Encrypt and upload
gpg --encrypt --recipient ops@nudgecampaign.com \
/tmp/env-backup-*.json
aws s3 cp /tmp/env-backup-*.json.gpg \
s3://nudgecampaign-backups/configs/
# Clean up
rm /tmp/env-backup-*
}
# Main execution
main() {
backup_database
backup_application
backup_configs
# Verify backups
aws s3 ls s3://nudgecampaign-backups/ --recursive \
--query "Contents[?LastModified>=\`$(date -u +%Y-%m-%d)\`]"
# Send notification
aws sns publish \
--topic-arn arn:aws:sns:us-east-1:xxx:backup-notifications \
--message "Backup completed successfully at $(date)"
}
main
Recovery Procedures
#!/bin/bash
# recovery.sh - Disaster recovery script
# Recovery point objective (RPO): 1 hour
# Recovery time objective (RTO): 4 hours
recover_database() {
BACKUP_FILE=$1
echo "Recovering database from $BACKUP_FILE"
# Download backup from S3
aws s3 cp s3://nudgecampaign-backups/database/$BACKUP_FILE /tmp/
# Stop application
kubectl scale deployment nudgecampaign-api --replicas=0
# Restore database
pg_restore \
--dbname=$DATABASE_URL \
--clean \
--if-exists \
--verbose \
/tmp/$BACKUP_FILE
# Verify restoration
psql $DATABASE_URL -c "SELECT COUNT(*) FROM organizations;"
# Restart application
kubectl scale deployment nudgecampaign-api --replicas=3
}
recover_point_in_time() {
TARGET_TIME=$1
echo "Recovering to point in time: $TARGET_TIME"
# Use WAL archives for PITR
recovery_conf="
restore_command = 'aws s3 cp s3://nudgecampaign-backups/wal/%f %p'
recovery_target_time = '$TARGET_TIME'
recovery_target_action = 'promote'
"
echo "$recovery_conf" > /var/lib/postgresql/data/recovery.conf
# Restart PostgreSQL
systemctl restart postgresql
}
Incident Response
Incident Response Plan
Incident Severity Levels
| Level | Description | Response Time | Examples |
|---|---|---|---|
| P1 - Critical | Complete service outage | Immediate | Site down, data loss, security breach |
| P2 - High | Major feature unavailable | 30 minutes | Email sending failed, payment processing down |
| P3 - Medium | Degraded performance | 2 hours | Slow response times, partial feature failure |
| P4 - Low | Minor issue | Next business day | UI glitch, non-critical bug |
Incident Response Runbooks
Runbook: Database Connection Exhaustion
## Symptoms
- Error: "too many connections"
- Application timeouts
- Slow response times
## Immediate Actions
1. Check connection pool metrics:
```sql
SELECT count(*) FROM pg_stat_activity;
Identify problematic connections:
SELECT pid, usename, application_name, state, query_start FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start;Kill long-running queries:
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state != 'idle' AND query_start < now() - interval '10 minutes';Restart application pods:
kubectl rollout restart deployment nudgecampaign-api
Root Cause Analysis
- Check for connection leaks in code
- Review recent deployments
- Analyze query patterns
Prevention
- Implement connection pooling limits
- Add query timeouts
- Monitor connection metrics
#### Runbook: High Error Rate
```markdown
## Symptoms
- Error rate > 5%
- 5xx status codes
- Customer complaints
## Immediate Actions
1. Check error logs:
```bash
kubectl logs -n production -l app=api --tail=100
Check recent deployments:
kubectl rollout history deployment nudgecampaign-apiRollback if necessary:
kubectl rollout undo deployment nudgecampaign-apiScale up if load-related:
kubectl scale deployment nudgecampaign-api --replicas=6
Investigation
- Review error tracking (Sentry)
- Check dependency services
- Analyze traffic patterns
Communication
- Update status page
- Notify affected customers
- Post in #incidents Slack channel
## Security Operations
### Security Monitoring
```yaml
# security-alerts.yml
groups:
- name: security
rules:
- alert: SuspiciousLoginActivity
expr: rate(failed_login_attempts[5m]) > 10
labels:
severity: warning
annotations:
summary: Suspicious login activity detected
- alert: UnauthorizedAPIAccess
expr: sum(rate(api_unauthorized_requests[5m])) > 50
labels:
severity: critical
annotations:
summary: High rate of unauthorized API requests
- alert: DataExfiltration
expr: sum(rate(data_export_size_bytes[1h])) > 1073741824
labels:
severity: critical
annotations:
summary: Unusual amount of data being exported
Security Audit Procedures
#!/bin/bash
# security-audit.sh
# Check for vulnerable dependencies
npm audit
pip check
bundle audit
# Scan Docker images
docker scan nudgecampaign/api:latest
# Check SSL certificates
echo | openssl s_client -connect api.nudgecampaign.com:443 2>/dev/null | \
openssl x509 -noout -dates
# Review IAM permissions
aws iam get-account-authorization-details --output json | \
jq '.UserDetailList[] | {UserName: .UserName, Policies: .AttachedManagedPolicies}'
# Check for exposed secrets
git secrets --scan
# Database security check
psql $DATABASE_URL -c "
SELECT
nspname,
tablename,
tableowner,
has_table_privilege('public', schemaname||'.'||tablename, 'SELECT') as public_select
FROM pg_tables
WHERE tableowner != 'postgres'
ORDER BY nspname, tablename;
"
Scaling Operations
Auto-Scaling Configuration
# k8s/autoscaling.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: nudgecampaign-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: nudgecampaign-api
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 60
Database Scaling
-- Read replica configuration
CREATE PUBLICATION nudgecampaign_replica FOR ALL TABLES;
-- On replica
CREATE SUBSCRIPTION nudgecampaign_replica
CONNECTION 'host=primary-db.nudgecampaign.com dbname=nudgecampaign user=replicator'
PUBLICATION nudgecampaign_replica;
-- Monitor replication lag
SELECT
client_addr,
state,
sent_lsn,
write_lsn,
flush_lsn,
replay_lsn,
write_lag,
flush_lag,
replay_lag
FROM pg_stat_replication;
Disaster Recovery
DR Plan Overview
Failover Procedures
#!/bin/bash
# failover.sh - Disaster recovery failover script
failover_to_dr() {
echo "Starting failover to DR region..."
# 1. Promote standby database
aws rds promote-read-replica \
--db-instance-identifier nudgecampaign-dr-db \
--region us-west-2
# 2. Update application configuration
aws secretsmanager update-secret \
--secret-id nudgecampaign/production \
--secret-string '{"DATABASE_URL": "postgresql://dr-db.nudgecampaign.com/..."}'
# 3. Scale up DR environment
aws ecs update-service \
--cluster nudgecampaign-dr \
--service api \
--desired-count 4 \
--region us-west-2
# 4. Update Route53 DNS
aws route53 change-resource-record-sets \
--hosted-zone-id Z123456789 \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "api.nudgecampaign.com",
"Type": "A",
"AliasTarget": {
"HostedZoneId": "Z098765432",
"DNSName": "dr-alb.us-west-2.elb.amazonaws.com",
"EvaluateTargetHealth": true
}
}
}]
}'
# 5. Verify services
for i in {1..10}; do
if curl -f https://api.nudgecampaign.com/health; then
echo "DR site is responding"
break
fi
sleep 30
done
# 6. Send notifications
aws sns publish \
--topic-arn arn:aws:sns:us-east-1:xxx:critical-alerts \
--message "Failover to DR completed successfully"
}
Maintenance Procedures
Scheduled Maintenance Windows
Production Maintenance Schedule:
- Weekly: Sunday 2:00-4:00 AM UTC (Low traffic period)
- Monthly: First Sunday 2:00-6:00 AM UTC (Extended window)
- Quarterly: Announced 2 weeks in advance
Zero-Downtime Deployment
#!/bin/bash
# zero-downtime-deploy.sh
deploy_application() {
VERSION=$1
echo "Starting zero-downtime deployment of version $VERSION"
# 1. Build and push new image
docker build -t nudgecampaign/api:$VERSION .
docker push nudgecampaign/api:$VERSION
# 2. Update deployment with new image
kubectl set image deployment/nudgecampaign-api \
api=nudgecampaign/api:$VERSION \
--record
# 3. Wait for rollout to complete
kubectl rollout status deployment/nudgecampaign-api
# 4. Run smoke tests
npm run test:smoke
# 5. Check metrics
check_deployment_metrics
# 6. Rollback if needed
if [ $? -ne 0 ]; then
echo "Deployment failed, rolling back..."
kubectl rollout undo deployment/nudgecampaign-api
fi
}
check_deployment_metrics() {
# Check error rate
ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query \
-d 'query=rate(http_requests_total{status=~"5.."}[5m])' \
| jq '.data.result[0].value[1]')
if (( $(echo "$ERROR_RATE > 0.05" | bc -l) )); then
echo "Error rate too high: $ERROR_RATE"
return 1
fi
# Check response time
RESPONSE_TIME=$(curl -s http://prometheus:9090/api/v1/query \
-d 'query=histogram_quantile(0.95, http_request_duration_seconds_bucket)' \
| jq '.data.result[0].value[1]')
if (( $(echo "$RESPONSE_TIME > 0.5" | bc -l) )); then
echo "Response time too slow: $RESPONSE_TIME"
return 1
fi
return 0
}
Troubleshooting Guide
Common Issues and Solutions
Issue: Slow Database Queries
Symptoms:
- High response times
- Database CPU > 80%
- Slow query logs
Diagnosis:
-- Find slow queries
SELECT
query,
calls,
mean_time,
total_time
FROM pg_stat_statements
WHERE mean_time > 100
ORDER BY mean_time DESC
LIMIT 10;
-- Check for missing indexes
SELECT
schemaname,
tablename,
attname,
n_distinct,
correlation
FROM pg_stats
WHERE tablename = 'problem_table';
Solution:
- Add missing indexes
- Optimize query structure
- Implement caching
- Consider read replicas
Issue: Memory Leaks
Symptoms:
- Increasing memory usage over time
- Container OOM kills
- Performance degradation
Diagnosis:
# Check memory usage
kubectl top pods
# Get heap dump
kubectl exec -it pod-name -- kill -USR2 1
kubectl cp pod-name:/tmp/heapdump.hprof ./heapdump.hprof
# Analyze with profiler
Solution:
- Fix memory leaks in code
- Adjust container memory limits
- Implement proper cleanup
- Add memory monitoring
Issue: Email Delivery Failures
Symptoms:
- Low delivery rates
- Bounce notifications
- Customer complaints
Diagnosis:
// Check Postmark status
const stats = await postmarkClient.getServerStatistics()
console.log('Bounce rate:', stats.BounceRate)
console.log('Spam complaints:', stats.SpamComplaintRate)
// Check domain authentication
const domains = await postmarkClient.getDomains()
domains.forEach(domain => {
console.log(domain.Name, domain.SPFVerified, domain.DKIMVerified)
})
Solution:
- Verify domain authentication
- Check IP reputation
- Review email content
- Implement list hygiene
Emergency Contacts
| Role | Name | Phone | |
|---|---|---|---|
| On-Call Engineer | Rotation | Via PagerDuty | oncall@nudgecampaign.com |
| VP Engineering | John Smith | +1-555-0100 | john@nudgecampaign.com |
| Database Admin | Jane Doe | +1-555-0101 | jane@nudgecampaign.com |
| Security Lead | Bob Wilson | +1-555-0102 | security@nudgecampaign.com |
Vendor Support
| Service | Support Level | Contact | Response Time |
|---|---|---|---|
| AWS | Enterprise | AWS Support Console | 15 minutes (critical) |
| Postmark | Priority | support@postmarkapp.com | 1 hour |
| Stripe | Premium | Stripe Dashboard | 4 hours |
| Datadog | Pro | support@datadoghq.com | 24 hours |
Conclusion
This operations and maintenance manual provides comprehensive procedures for managing NudgeCampaign in production. Regular maintenance, proactive monitoring, and well-practiced incident response procedures ensure high availability and performance.
Key operational priorities:
- Monitoring: Continuous observation of system health
- Maintenance: Regular optimization and cleanup
- Security: Ongoing threat detection and mitigation
- Scalability: Prepared for growth with auto-scaling
- Recovery: Tested backup and disaster recovery procedures
Remember to keep this documentation updated as systems and procedures evolve.