Last updated: Aug 4, 2025, 11:26 AM UTC

Operations & Maintenance Manual

Status: Complete
Version: 5.0
Last Updated: 2024
Purpose: Comprehensive guide for operating and maintaining NudgeCampaign in production

Table of Contents

  1. System Operations Overview
  2. Infrastructure Management
  3. Monitoring & Alerting
  4. Database Maintenance
  5. Performance Tuning
  6. Backup & Recovery
  7. Incident Response
  8. Security Operations
  9. Scaling Operations
  10. Disaster Recovery
  11. Maintenance Procedures
  12. Troubleshooting Guide

System Operations Overview

NudgeCampaign operates as a cloud-native, multi-tenant SaaS platform requiring 24/7 availability and proactive maintenance to ensure optimal performance, security, and reliability.

Operational Architecture

graph TB subgraph "Operations Stack" subgraph "Monitoring" Metrics[Prometheus] Logs[Loki] Traces[Jaeger] Uptime[UptimeRobot] end subgraph "Alerting" Alert[AlertManager] PD[PagerDuty] Slack[Slack] Email[Email] end subgraph "Visualization" Grafana[Grafana] Kibana[Kibana] Custom[Custom Dashboard] end subgraph "Management" Terraform[Terraform] Ansible[Ansible] K8s[Kubernetes] end end Metrics --> Grafana Logs --> Kibana Traces --> Grafana Alert --> PD Alert --> Slack Alert --> Email

Key Operational Metrics

Metric Target Critical Threshold Response Time
Uptime 99.9% <99.5% Immediate
Response Time (P95) <200ms >500ms 5 minutes
Error Rate <1% >5% Immediate
Database CPU <70% >90% 5 minutes
Queue Depth <1000 >5000 15 minutes
Email Delivery Rate >98% <95% 30 minutes

Service Level Objectives (SLOs)

  1. Availability SLO: 99.9% uptime measured monthly
  2. Performance SLO: 95% of requests under 200ms
  3. Data Durability: 99.999999% (9 nines)
  4. Email Delivery: 98% successful delivery rate
  5. Support Response: <4 hours for critical issues

Infrastructure Management

Cloud Infrastructure Overview

graph TB subgraph "AWS Infrastructure" subgraph "Compute" ECS[ECS Fargate] Lambda[Lambda Functions] EC2[EC2 Instances] end subgraph "Storage" RDS[(RDS PostgreSQL)] S3[S3 Buckets] EFS[EFS Volumes] end subgraph "Network" VPC[VPC] ALB[Load Balancer] CF[CloudFront] end subgraph "Services" SQS[SQS Queues] SNS[SNS Topics] SES[SES Email] end end CF --> ALB ALB --> ECS ECS --> RDS ECS --> S3 Lambda --> SQS

Infrastructure as Code

All infrastructure is managed through Terraform:

# terraform/production/main.tf
terraform {
  required_version = ">= 1.0"
  
  backend "s3" {
    bucket = "nudgecampaign-terraform-state"
    key    = "production/terraform.tfstate"
    region = "us-east-1"
    encrypt = true
    dynamodb_table = "terraform-locks"
  }
}

provider "aws" {
  region = var.aws_region
  
  default_tags {
    tags = {
      Environment = "production"
      ManagedBy   = "terraform"
      Application = "nudgecampaign"
    }
  }
}

module "vpc" {
  source = "../modules/vpc"
  
  cidr_block = "10.0.0.0/16"
  availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
  
  public_subnet_cidrs  = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  private_subnet_cidrs = ["10.0.10.0/24", "10.0.11.0/24", "10.0.12.0/24"]
  database_subnet_cidrs = ["10.0.20.0/24", "10.0.21.0/24", "10.0.22.0/24"]
}

module "ecs_cluster" {
  source = "../modules/ecs"
  
  cluster_name = "nudgecampaign-production"
  vpc_id = module.vpc.vpc_id
  subnets = module.vpc.private_subnet_ids
  
  services = {
    api = {
      cpu = 2048
      memory = 4096
      desired_count = 4
      min_count = 2
      max_count = 10
    }
    worker = {
      cpu = 1024
      memory = 2048
      desired_count = 3
      min_count = 1
      max_count = 6
    }
  }
}

module "database" {
  source = "../modules/rds"
  
  engine = "postgres"
  engine_version = "14.9"
  instance_class = "db.r6g.xlarge"
  allocated_storage = 100
  
  multi_az = true
  backup_retention_period = 30
  backup_window = "03:00-04:00"
  maintenance_window = "sun:04:00-sun:05:00"
  
  vpc_id = module.vpc.vpc_id
  subnet_ids = module.vpc.database_subnet_ids
}

Container Management

ECS task definitions for services:

{
  "family": "nudgecampaign-api",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "2048",
  "memory": "4096",
  "containerDefinitions": [
    {
      "name": "api",
      "image": "nudgecampaign/api:latest",
      "portMappings": [
        {
          "containerPort": 3000,
          "protocol": "tcp"
        }
      ],
      "environment": [
        {
          "name": "NODE_ENV",
          "value": "production"
        }
      ],
      "secrets": [
        {
          "name": "DATABASE_URL",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:xxx:secret:database-url"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/nudgecampaign-api",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"],
        "interval": 30,
        "timeout": 5,
        "retries": 3,
        "startPeriod": 60
      }
    }
  ]
}

Kubernetes Operations

For Kubernetes deployments:

# k8s/production/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nudgecampaign-api
  namespace: production
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: nudgecampaign-api
  template:
    metadata:
      labels:
        app: nudgecampaign-api
    spec:
      containers:
      - name: api
        image: nudgecampaign/api:v5.0.0
        ports:
        - containerPort: 3000
        env:
        - name: NODE_ENV
          value: "production"
        envFrom:
        - secretRef:
            name: api-secrets
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 5

Monitoring & Alerting

Monitoring Stack Configuration

# docker-compose.monitoring.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:latest
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards
      - ./grafana/datasources:/etc/grafana/provisioning/datasources
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_INSTALL_PLUGINS=redis-datasource
    ports:
      - "3001:3000"

  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - loki_data:/loki

  alertmanager:
    image: prom/alertmanager:latest
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"

volumes:
  prometheus_data:
  grafana_data:
  loki_data:

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: 'nudgecampaign-api'
    static_configs:
      - targets: ['api:3000']
    metrics_path: '/metrics'

  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']

  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

Alert Rules

# alerts.yml
groups:
  - name: application
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: High error rate detected
          description: "Error rate is {{ $value }} errors per second"

      - alert: SlowResponseTime
        expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: Slow response times
          description: "95th percentile response time is {{ $value }} seconds"

  - name: infrastructure
    interval: 30s
    rules:
      - alert: HighCPUUsage
        expr: cpu_usage_percent > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: High CPU usage
          description: "CPU usage is {{ $value }}%"

      - alert: LowDiskSpace
        expr: disk_free_percent < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Low disk space
          description: "Only {{ $value }}% disk space remaining"

      - alert: DatabaseConnectionPoolExhausted
        expr: pg_stat_database_numbackends / pg_settings_max_connections > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: Database connection pool nearly exhausted
          description: "{{ $value }}% of connections in use"

Custom Metrics

Application metrics collection:

// src/lib/metrics.ts
import { register, Counter, Histogram, Gauge } from 'prom-client'

// HTTP metrics
export const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.1, 0.5, 1, 2, 5]
})

export const httpRequestTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status']
})

// Business metrics
export const emailsSent = new Counter({
  name: 'emails_sent_total',
  help: 'Total number of emails sent',
  labelNames: ['campaign_id', 'status']
})

export const activeUsers = new Gauge({
  name: 'active_users',
  help: 'Number of active users',
  labelNames: ['organization']
})

// Database metrics
export const dbConnectionPool = new Gauge({
  name: 'db_connection_pool_size',
  help: 'Database connection pool size',
  labelNames: ['status']
})

// Export metrics endpoint
export async function metricsHandler(req: Request): Promise<Response> {
  const metrics = await register.metrics()
  return new Response(metrics, {
    headers: {
      'Content-Type': register.contentType
    }
  })
}

Grafana Dashboards

Key dashboards for operations:

  1. System Overview Dashboard

    • Request rate and error rate
    • Response time percentiles
    • Active users and sessions
    • CPU and memory usage
    • Database connections
  2. Email Operations Dashboard

    • Emails sent per hour
    • Delivery success rate
    • Bounce and complaint rates
    • Queue depth and processing time
    • Postmark API errors
  3. Business Metrics Dashboard

    • New signups
    • Active organizations
    • Campaign creation rate
    • Revenue metrics
    • Churn indicators
  4. Infrastructure Dashboard

    • Container health
    • Database performance
    • Redis cache hit rate
    • Network throughput
    • Disk usage

Database Maintenance

Regular Maintenance Tasks

-- Daily maintenance script
-- Run at 2 AM UTC during low traffic

-- Update statistics
ANALYZE;

-- Reindex tables with high write activity
REINDEX TABLE campaigns;
REINDEX TABLE email_deliveries;
REINDEX TABLE contacts;

-- Clean up old sessions
DELETE FROM sessions WHERE expires_at < NOW() - INTERVAL '7 days';

-- Archive old email delivery records
INSERT INTO email_deliveries_archive 
SELECT * FROM email_deliveries 
WHERE created_at < NOW() - INTERVAL '90 days';

DELETE FROM email_deliveries 
WHERE created_at < NOW() - INTERVAL '90 days';

-- Vacuum to reclaim space
VACUUM ANALYZE;

Database Performance Monitoring

-- Monitor slow queries
SELECT 
    query,
    calls,
    total_time,
    mean_time,
    max_time,
    min_time
FROM pg_stat_statements
WHERE mean_time > 100
ORDER BY mean_time DESC
LIMIT 20;

-- Check table bloat
SELECT 
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
    n_live_tup,
    n_dead_tup,
    ROUND(n_dead_tup::numeric / NULLIF(n_live_tup, 0), 4) AS dead_ratio
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY dead_ratio DESC;

-- Monitor connection usage
SELECT 
    datname,
    numbackends,
    ROUND(numbackends::numeric / 
          (SELECT setting::numeric FROM pg_settings WHERE name = 'max_connections'), 2) 
          AS connection_ratio
FROM pg_stat_database
WHERE datname NOT IN ('postgres', 'template0', 'template1')
ORDER BY numbackends DESC;

-- Index usage statistics
SELECT 
    schemaname,
    tablename,
    indexname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC
LIMIT 20;

Database Backup Strategy

#!/bin/bash
# backup.sh - Database backup script

# Configuration
DB_HOST="prod-db.nudgecampaign.com"
DB_NAME="nudgecampaign"
DB_USER="backup_user"
BACKUP_DIR="/backups/postgres"
S3_BUCKET="nudgecampaign-backups"
RETENTION_DAYS=30

# Create backup
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz"

echo "Starting backup at $(date)"

# Perform backup with compression
PGPASSWORD=$DB_PASSWORD pg_dump \
    -h $DB_HOST \
    -U $DB_USER \
    -d $DB_NAME \
    --no-owner \
    --no-acl \
    --clean \
    --if-exists \
    | gzip > $BACKUP_FILE

# Check backup size
BACKUP_SIZE=$(du -h $BACKUP_FILE | cut -f1)
echo "Backup completed: $BACKUP_FILE ($BACKUP_SIZE)"

# Upload to S3
aws s3 cp $BACKUP_FILE s3://${S3_BUCKET}/daily/ \
    --storage-class STANDARD_IA

# Clean up old local backups
find $BACKUP_DIR -name "*.sql.gz" -mtime +7 -delete

# Clean up old S3 backups
aws s3 ls s3://${S3_BUCKET}/daily/ \
    | while read -r line; do
        createDate=$(echo $line | awk '{print $1" "$2}')
        createDate=$(date -d "$createDate" +%s)
        olderThan=$(date -d "$RETENTION_DAYS days ago" +%s)
        if [[ $createDate -lt $olderThan ]]; then
            fileName=$(echo $line | awk '{print $4}')
            echo "Deleting old backup: $fileName"
            aws s3 rm s3://${S3_BUCKET}/daily/$fileName
        fi
    done

echo "Backup process completed at $(date)"

Database Optimization

-- Create missing indexes based on query patterns
CREATE INDEX CONCURRENTLY idx_campaigns_org_status 
ON campaigns(organization_id, status) 
WHERE status IN ('draft', 'scheduled', 'sending');

CREATE INDEX CONCURRENTLY idx_email_deliveries_campaign_created 
ON email_deliveries(campaign_id, created_at DESC);

CREATE INDEX CONCURRENTLY idx_contacts_org_status_email 
ON contacts(organization_id, status, email) 
WHERE status = 'subscribed';

-- Partition large tables
CREATE TABLE email_deliveries_2024_01 PARTITION OF email_deliveries
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE email_deliveries_2024_02 PARTITION OF email_deliveries
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

-- Configure autovacuum for high-activity tables
ALTER TABLE email_deliveries SET (
    autovacuum_vacuum_scale_factor = 0.01,
    autovacuum_analyze_scale_factor = 0.005
);

ALTER TABLE campaigns SET (
    autovacuum_vacuum_scale_factor = 0.05,
    autovacuum_analyze_scale_factor = 0.02
);

Performance Tuning

Application Performance

// Performance monitoring middleware
export function performanceMiddleware(req: Request, res: Response, next: NextFunction) {
  const start = process.hrtime.bigint()
  
  res.on('finish', () => {
    const duration = Number(process.hrtime.bigint() - start) / 1e6 // Convert to ms
    
    // Log slow requests
    if (duration > 1000) {
      logger.warn('Slow request detected', {
        method: req.method,
        path: req.path,
        duration,
        statusCode: res.statusCode
      })
    }
    
    // Update metrics
    httpRequestDuration.observe(
      {
        method: req.method,
        route: req.route?.path || 'unknown',
        status: res.statusCode.toString()
      },
      duration / 1000 // Convert to seconds
    )
  })
  
  next()
}

Database Connection Pooling

// Optimized Prisma configuration
import { PrismaClient } from '@prisma/client'

const globalForPrisma = global as unknown as { prisma: PrismaClient }

export const prisma = globalForPrisma.prisma || new PrismaClient({
  datasources: {
    db: {
      url: process.env.DATABASE_URL
    }
  },
  log: process.env.NODE_ENV === 'development' 
    ? ['query', 'error', 'warn'] 
    : ['error'],
  // Connection pool settings
  // Adjust based on your database plan
  // These settings are for a db.r6g.xlarge instance
})

if (process.env.NODE_ENV !== 'production') {
  globalForPrisma.prisma = prisma
}

// Monitor pool metrics
setInterval(() => {
  prisma.$metrics.json().then(metrics => {
    dbConnectionPool.set({ status: 'active' }, metrics.counters.find(
      m => m.key === 'prisma_pool_connections_open'
    )?.value || 0)
  })
}, 10000)

Redis Caching Optimization

// Cache configuration
import Redis from 'ioredis'

const redis = new Redis({
  host: process.env.REDIS_HOST,
  port: parseInt(process.env.REDIS_PORT || '6379'),
  password: process.env.REDIS_PASSWORD,
  maxRetriesPerRequest: 3,
  retryStrategy: (times) => Math.min(times * 50, 2000),
  enableOfflineQueue: false,
  lazyConnect: true
})

// Cache wrapper with metrics
export async function cacheGet<T>(key: string): Promise<T | null> {
  const start = Date.now()
  try {
    const value = await redis.get(key)
    const duration = Date.now() - start
    
    cacheMetrics.observe({ operation: 'get', hit: value ? 'hit' : 'miss' }, duration)
    
    return value ? JSON.parse(value) : null
  } catch (error) {
    logger.error('Cache get error', { key, error })
    return null
  }
}

export async function cacheSet<T>(
  key: string, 
  value: T, 
  ttl: number = 300
): Promise<void> {
  const start = Date.now()
  try {
    await redis.setex(key, ttl, JSON.stringify(value))
    const duration = Date.now() - start
    
    cacheMetrics.observe({ operation: 'set' }, duration)
  } catch (error) {
    logger.error('Cache set error', { key, error })
  }
}

CDN and Static Asset Optimization

# nginx.conf for static assets
server {
    listen 80;
    server_name cdn.nudgecampaign.com;
    
    # Enable gzip compression
    gzip on;
    gzip_vary on;
    gzip_min_length 1024;
    gzip_types text/css application/javascript application/json image/svg+xml;
    
    # Cache headers for static assets
    location ~* \.(jpg|jpeg|png|gif|ico|css|js|svg|woff|woff2)$ {
        expires 1y;
        add_header Cache-Control "public, immutable";
        add_header Vary "Accept-Encoding";
    }
    
    # Security headers
    add_header X-Content-Type-Options "nosniff" always;
    add_header X-Frame-Options "DENY" always;
    add_header X-XSS-Protection "1; mode=block" always;
}

Backup & Recovery

Backup Strategy Overview

graph TB subgraph "Backup Types" Full[Full Backup
Weekly] Incremental[Incremental
Daily] Snapshot[Snapshots
Hourly] Continuous[WAL Archiving
Real-time] end subgraph "Storage Tiers" Hot[S3 Standard
Last 7 days] Warm[S3 IA
8-30 days] Cold[Glacier
31+ days] end Full --> Hot Incremental --> Hot Snapshot --> Hot Continuous --> Hot Hot --> Warm Warm --> Cold

Automated Backup System

#!/bin/bash
# comprehensive-backup.sh

# Function to perform database backup
backup_database() {
    echo "Starting database backup..."
    
    # Create logical backup
    pg_dump $DATABASE_URL \
        --format=custom \
        --verbose \
        --file=/tmp/db-backup-$(date +%Y%m%d-%H%M%S).dump
    
    # Upload to S3
    aws s3 cp /tmp/db-backup-*.dump \
        s3://nudgecampaign-backups/database/ \
        --storage-class STANDARD_IA
    
    # Clean up local file
    rm /tmp/db-backup-*.dump
}

# Function to backup application files
backup_application() {
    echo "Starting application backup..."
    
    # Create tarball of uploads
    tar -czf /tmp/uploads-$(date +%Y%m%d-%H%M%S).tar.gz \
        /var/www/nudgecampaign/uploads/
    
    # Upload to S3
    aws s3 cp /tmp/uploads-*.tar.gz \
        s3://nudgecampaign-backups/uploads/ \
        --storage-class STANDARD_IA
    
    # Clean up
    rm /tmp/uploads-*.tar.gz
}

# Function to backup configurations
backup_configs() {
    echo "Starting configuration backup..."
    
    # Backup environment variables
    aws secretsmanager get-secret-value \
        --secret-id nudgecampaign/production \
        --query SecretString \
        --output text > /tmp/env-backup-$(date +%Y%m%d).json
    
    # Encrypt and upload
    gpg --encrypt --recipient ops@nudgecampaign.com \
        /tmp/env-backup-*.json
    
    aws s3 cp /tmp/env-backup-*.json.gpg \
        s3://nudgecampaign-backups/configs/
    
    # Clean up
    rm /tmp/env-backup-*
}

# Main execution
main() {
    backup_database
    backup_application
    backup_configs
    
    # Verify backups
    aws s3 ls s3://nudgecampaign-backups/ --recursive \
        --query "Contents[?LastModified>=\`$(date -u +%Y-%m-%d)\`]"
    
    # Send notification
    aws sns publish \
        --topic-arn arn:aws:sns:us-east-1:xxx:backup-notifications \
        --message "Backup completed successfully at $(date)"
}

main

Recovery Procedures

#!/bin/bash
# recovery.sh - Disaster recovery script

# Recovery point objective (RPO): 1 hour
# Recovery time objective (RTO): 4 hours

recover_database() {
    BACKUP_FILE=$1
    
    echo "Recovering database from $BACKUP_FILE"
    
    # Download backup from S3
    aws s3 cp s3://nudgecampaign-backups/database/$BACKUP_FILE /tmp/
    
    # Stop application
    kubectl scale deployment nudgecampaign-api --replicas=0
    
    # Restore database
    pg_restore \
        --dbname=$DATABASE_URL \
        --clean \
        --if-exists \
        --verbose \
        /tmp/$BACKUP_FILE
    
    # Verify restoration
    psql $DATABASE_URL -c "SELECT COUNT(*) FROM organizations;"
    
    # Restart application
    kubectl scale deployment nudgecampaign-api --replicas=3
}

recover_point_in_time() {
    TARGET_TIME=$1
    
    echo "Recovering to point in time: $TARGET_TIME"
    
    # Use WAL archives for PITR
    recovery_conf="
    restore_command = 'aws s3 cp s3://nudgecampaign-backups/wal/%f %p'
    recovery_target_time = '$TARGET_TIME'
    recovery_target_action = 'promote'
    "
    
    echo "$recovery_conf" > /var/lib/postgresql/data/recovery.conf
    
    # Restart PostgreSQL
    systemctl restart postgresql
}

Incident Response

Incident Response Plan

flowchart TB Start[Incident Detected] --> Assess{Severity?} Assess -->|Critical| Page[Page On-Call] Assess -->|High| Notify[Notify Team] Assess -->|Medium| Ticket[Create Ticket] Assess -->|Low| Log[Log Issue] Page --> War[War Room] Notify --> Investigate War --> Mitigate[Mitigate Impact] Investigate --> Mitigate Mitigate --> Fix[Implement Fix] Fix --> Verify[Verify Resolution] Verify --> PostMortem[Post-Mortem] PostMortem --> Document[Update Runbooks]

Incident Severity Levels

Level Description Response Time Examples
P1 - Critical Complete service outage Immediate Site down, data loss, security breach
P2 - High Major feature unavailable 30 minutes Email sending failed, payment processing down
P3 - Medium Degraded performance 2 hours Slow response times, partial feature failure
P4 - Low Minor issue Next business day UI glitch, non-critical bug

Incident Response Runbooks

Runbook: Database Connection Exhaustion

## Symptoms
- Error: "too many connections"
- Application timeouts
- Slow response times

## Immediate Actions
1. Check connection pool metrics:
   ```sql
   SELECT count(*) FROM pg_stat_activity;
  1. Identify problematic connections:

    SELECT pid, usename, application_name, state, query_start
    FROM pg_stat_activity
    WHERE state != 'idle'
    ORDER BY query_start;
    
  2. Kill long-running queries:

    SELECT pg_terminate_backend(pid)
    FROM pg_stat_activity
    WHERE state != 'idle'
    AND query_start < now() - interval '10 minutes';
    
  3. Restart application pods:

    kubectl rollout restart deployment nudgecampaign-api
    

Root Cause Analysis

  • Check for connection leaks in code
  • Review recent deployments
  • Analyze query patterns

Prevention

  • Implement connection pooling limits
  • Add query timeouts
  • Monitor connection metrics

#### Runbook: High Error Rate

```markdown
## Symptoms
- Error rate > 5%
- 5xx status codes
- Customer complaints

## Immediate Actions
1. Check error logs:
   ```bash
   kubectl logs -n production -l app=api --tail=100
  1. Check recent deployments:

    kubectl rollout history deployment nudgecampaign-api
    
  2. Rollback if necessary:

    kubectl rollout undo deployment nudgecampaign-api
    
  3. Scale up if load-related:

    kubectl scale deployment nudgecampaign-api --replicas=6
    

Investigation

  • Review error tracking (Sentry)
  • Check dependency services
  • Analyze traffic patterns

Communication

  • Update status page
  • Notify affected customers
  • Post in #incidents Slack channel

## Security Operations

### Security Monitoring

```yaml
# security-alerts.yml
groups:
  - name: security
    rules:
      - alert: SuspiciousLoginActivity
        expr: rate(failed_login_attempts[5m]) > 10
        labels:
          severity: warning
        annotations:
          summary: Suspicious login activity detected
          
      - alert: UnauthorizedAPIAccess
        expr: sum(rate(api_unauthorized_requests[5m])) > 50
        labels:
          severity: critical
        annotations:
          summary: High rate of unauthorized API requests
          
      - alert: DataExfiltration
        expr: sum(rate(data_export_size_bytes[1h])) > 1073741824
        labels:
          severity: critical
        annotations:
          summary: Unusual amount of data being exported

Security Audit Procedures

#!/bin/bash
# security-audit.sh

# Check for vulnerable dependencies
npm audit
pip check
bundle audit

# Scan Docker images
docker scan nudgecampaign/api:latest

# Check SSL certificates
echo | openssl s_client -connect api.nudgecampaign.com:443 2>/dev/null | \
  openssl x509 -noout -dates

# Review IAM permissions
aws iam get-account-authorization-details --output json | \
  jq '.UserDetailList[] | {UserName: .UserName, Policies: .AttachedManagedPolicies}'

# Check for exposed secrets
git secrets --scan

# Database security check
psql $DATABASE_URL -c "
SELECT 
    nspname,
    tablename,
    tableowner,
    has_table_privilege('public', schemaname||'.'||tablename, 'SELECT') as public_select
FROM pg_tables
WHERE tableowner != 'postgres'
ORDER BY nspname, tablename;
"

Scaling Operations

Auto-Scaling Configuration

# k8s/autoscaling.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nudgecampaign-api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nudgecampaign-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 25
        periodSeconds: 60

Database Scaling

-- Read replica configuration
CREATE PUBLICATION nudgecampaign_replica FOR ALL TABLES;

-- On replica
CREATE SUBSCRIPTION nudgecampaign_replica
CONNECTION 'host=primary-db.nudgecampaign.com dbname=nudgecampaign user=replicator'
PUBLICATION nudgecampaign_replica;

-- Monitor replication lag
SELECT 
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;

Disaster Recovery

DR Plan Overview

graph TB subgraph "Primary Region (us-east-1)" Primary[Production System] PrimaryDB[(Primary Database)] PrimaryS3[S3 Buckets] end subgraph "DR Region (us-west-2)" Standby[Standby System] StandbyDB[(Standby Database)] StandbyS3[S3 Replica] end Primary -->|Continuous Replication| StandbyDB PrimaryS3 -->|Cross-Region Replication| StandbyS3 subgraph "Failover Process" Detect[Detect Failure] Promote[Promote Standby] DNS[Update DNS] Verify[Verify Services] end

Failover Procedures

#!/bin/bash
# failover.sh - Disaster recovery failover script

failover_to_dr() {
    echo "Starting failover to DR region..."
    
    # 1. Promote standby database
    aws rds promote-read-replica \
        --db-instance-identifier nudgecampaign-dr-db \
        --region us-west-2
    
    # 2. Update application configuration
    aws secretsmanager update-secret \
        --secret-id nudgecampaign/production \
        --secret-string '{"DATABASE_URL": "postgresql://dr-db.nudgecampaign.com/..."}'
    
    # 3. Scale up DR environment
    aws ecs update-service \
        --cluster nudgecampaign-dr \
        --service api \
        --desired-count 4 \
        --region us-west-2
    
    # 4. Update Route53 DNS
    aws route53 change-resource-record-sets \
        --hosted-zone-id Z123456789 \
        --change-batch '{
            "Changes": [{
                "Action": "UPSERT",
                "ResourceRecordSet": {
                    "Name": "api.nudgecampaign.com",
                    "Type": "A",
                    "AliasTarget": {
                        "HostedZoneId": "Z098765432",
                        "DNSName": "dr-alb.us-west-2.elb.amazonaws.com",
                        "EvaluateTargetHealth": true
                    }
                }
            }]
        }'
    
    # 5. Verify services
    for i in {1..10}; do
        if curl -f https://api.nudgecampaign.com/health; then
            echo "DR site is responding"
            break
        fi
        sleep 30
    done
    
    # 6. Send notifications
    aws sns publish \
        --topic-arn arn:aws:sns:us-east-1:xxx:critical-alerts \
        --message "Failover to DR completed successfully"
}

Maintenance Procedures

Scheduled Maintenance Windows

Production Maintenance Schedule:

  • Weekly: Sunday 2:00-4:00 AM UTC (Low traffic period)
  • Monthly: First Sunday 2:00-6:00 AM UTC (Extended window)
  • Quarterly: Announced 2 weeks in advance

Zero-Downtime Deployment

#!/bin/bash
# zero-downtime-deploy.sh

deploy_application() {
    VERSION=$1
    
    echo "Starting zero-downtime deployment of version $VERSION"
    
    # 1. Build and push new image
    docker build -t nudgecampaign/api:$VERSION .
    docker push nudgecampaign/api:$VERSION
    
    # 2. Update deployment with new image
    kubectl set image deployment/nudgecampaign-api \
        api=nudgecampaign/api:$VERSION \
        --record
    
    # 3. Wait for rollout to complete
    kubectl rollout status deployment/nudgecampaign-api
    
    # 4. Run smoke tests
    npm run test:smoke
    
    # 5. Check metrics
    check_deployment_metrics
    
    # 6. Rollback if needed
    if [ $? -ne 0 ]; then
        echo "Deployment failed, rolling back..."
        kubectl rollout undo deployment/nudgecampaign-api
    fi
}

check_deployment_metrics() {
    # Check error rate
    ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query \
        -d 'query=rate(http_requests_total{status=~"5.."}[5m])' \
        | jq '.data.result[0].value[1]')
    
    if (( $(echo "$ERROR_RATE > 0.05" | bc -l) )); then
        echo "Error rate too high: $ERROR_RATE"
        return 1
    fi
    
    # Check response time
    RESPONSE_TIME=$(curl -s http://prometheus:9090/api/v1/query \
        -d 'query=histogram_quantile(0.95, http_request_duration_seconds_bucket)' \
        | jq '.data.result[0].value[1]')
    
    if (( $(echo "$RESPONSE_TIME > 0.5" | bc -l) )); then
        echo "Response time too slow: $RESPONSE_TIME"
        return 1
    fi
    
    return 0
}

Troubleshooting Guide

Common Issues and Solutions

Issue: Slow Database Queries

Symptoms:

  • High response times
  • Database CPU > 80%
  • Slow query logs

Diagnosis:

-- Find slow queries
SELECT 
    query,
    calls,
    mean_time,
    total_time
FROM pg_stat_statements
WHERE mean_time > 100
ORDER BY mean_time DESC
LIMIT 10;

-- Check for missing indexes
SELECT 
    schemaname,
    tablename,
    attname,
    n_distinct,
    correlation
FROM pg_stats
WHERE tablename = 'problem_table';

Solution:

  1. Add missing indexes
  2. Optimize query structure
  3. Implement caching
  4. Consider read replicas

Issue: Memory Leaks

Symptoms:

  • Increasing memory usage over time
  • Container OOM kills
  • Performance degradation

Diagnosis:

# Check memory usage
kubectl top pods

# Get heap dump
kubectl exec -it pod-name -- kill -USR2 1
kubectl cp pod-name:/tmp/heapdump.hprof ./heapdump.hprof

# Analyze with profiler

Solution:

  1. Fix memory leaks in code
  2. Adjust container memory limits
  3. Implement proper cleanup
  4. Add memory monitoring

Issue: Email Delivery Failures

Symptoms:

  • Low delivery rates
  • Bounce notifications
  • Customer complaints

Diagnosis:

// Check Postmark status
const stats = await postmarkClient.getServerStatistics()
console.log('Bounce rate:', stats.BounceRate)
console.log('Spam complaints:', stats.SpamComplaintRate)

// Check domain authentication
const domains = await postmarkClient.getDomains()
domains.forEach(domain => {
    console.log(domain.Name, domain.SPFVerified, domain.DKIMVerified)
})

Solution:

  1. Verify domain authentication
  2. Check IP reputation
  3. Review email content
  4. Implement list hygiene

Emergency Contacts

Role Name Phone Email
On-Call Engineer Rotation Via PagerDuty oncall@nudgecampaign.com
VP Engineering John Smith +1-555-0100 john@nudgecampaign.com
Database Admin Jane Doe +1-555-0101 jane@nudgecampaign.com
Security Lead Bob Wilson +1-555-0102 security@nudgecampaign.com

Vendor Support

Service Support Level Contact Response Time
AWS Enterprise AWS Support Console 15 minutes (critical)
Postmark Priority support@postmarkapp.com 1 hour
Stripe Premium Stripe Dashboard 4 hours
Datadog Pro support@datadoghq.com 24 hours

Conclusion

This operations and maintenance manual provides comprehensive procedures for managing NudgeCampaign in production. Regular maintenance, proactive monitoring, and well-practiced incident response procedures ensure high availability and performance.

Key operational priorities:

  1. Monitoring: Continuous observation of system health
  2. Maintenance: Regular optimization and cleanup
  3. Security: Ongoing threat detection and mitigation
  4. Scalability: Prepared for growth with auto-scaling
  5. Recovery: Tested backup and disaster recovery procedures

Remember to keep this documentation updated as systems and procedures evolve.