Production Deployment
This guide covers deploying Reflow's observability framework in production environments, including scalability considerations, security best practices, and operational procedures.
Architecture Overview
Production Architecture
graph TB
App1[Reflow App 1] --> LB[Load Balancer]
App2[Reflow App 2] --> LB
App3[Reflow App N] --> LB
LB --> TS1[Tracing Server 1]
LB --> TS2[Tracing Server 2]
TS1 --> DB[(PostgreSQL Primary)]
TS2 --> DB
DB --> Replica[(PostgreSQL Replica)]
TS1 --> Cache[(Redis Cache)]
TS2 --> Cache
Grafana[Grafana] --> DB
Grafana --> Cache
Monitor[Monitoring] --> TS1
Monitor --> TS2
Monitor --> DB
Component Responsibilities
- Reflow Applications: Generate trace events
- Load Balancer: Distribute connections across tracing servers
- Tracing Servers: Receive, process, and store trace data
- PostgreSQL: Primary data storage with replication
- Redis: Caching and real-time data
- Grafana: Visualization and dashboards
- Monitoring: Health checks and alerting
Infrastructure Requirements
Minimum Production Setup
Tracing Server:
- CPU: 2 cores
- Memory: 4GB RAM
- Storage: 50GB SSD
- Network: 1Gbps
Database (PostgreSQL):
- CPU: 4 cores
- Memory: 8GB RAM
- Storage: 200GB SSD (for data) + 100GB (for WAL)
- Network: 1Gbps
Cache (Redis):
- CPU: 2 cores
- Memory: 4GB RAM
- Storage: 20GB SSD
- Network: 1Gbps
High-Scale Production Setup
Tracing Server Cluster:
- 3+ instances
- CPU: 8 cores each
- Memory: 16GB RAM each
- Storage: 100GB SSD each
- Network: 10Gbps
Database Cluster:
- Primary + 2 replicas
- CPU: 16 cores each
- Memory: 64GB RAM each
- Storage: 1TB NVMe SSD each
- Network: 10Gbps
Cache Cluster:
- 3 instance Redis cluster
- CPU: 4 cores each
- Memory: 16GB RAM each
- Storage: 50GB SSD each
- Network: 10Gbps
Container Deployment
Docker Compose
# docker-compose.prod.yml
version: '3.8'
services:
tracing-server:
image: reflow/tracing-server:latest
deploy:
replicas: 3
resources:
limits:
cpus: '4'
memory: 8G
reservations:
cpus: '2'
memory: 4G
environment:
- RUST_LOG=info
- TRACING_DATABASE_URL=postgresql://user:pass@postgres:5432/tracing
- TRACING_REDIS_URL=redis://redis:6379
- TRACING_BIND_ADDRESS=0.0.0.0:8080
- TRACING_MAX_CONNECTIONS=1000
ports:
- "8080:8080"
networks:
- tracing-network
depends_on:
- postgres
- redis
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
postgres:
image: postgres:15
environment:
- POSTGRES_DB=tracing
- POSTGRES_USER=tracing_user
- POSTGRES_PASSWORD_FILE=/run/secrets/postgres_password
- POSTGRES_INITDB_ARGS=--auth-host=scram-sha-256
volumes:
- postgres_data:/var/lib/postgresql/data
- ./init.sql:/docker-entrypoint-initdb.d/init.sql
ports:
- "5432:5432"
networks:
- tracing-network
secrets:
- postgres_password
command: postgres -c shared_preload_libraries=pg_stat_statements
healthcheck:
test: ["CMD-SHELL", "pg_isready -U tracing_user -d tracing"]
interval: 30s
timeout: 5s
retries: 5
redis:
image: redis:7-alpine
volumes:
- redis_data:/data
ports:
- "6379:6379"
networks:
- tracing-network
command: redis-server --appendonly yes --maxmemory 2gb --maxmemory-policy allkeys-lru
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 30s
timeout: 5s
retries: 3
nginx:
image: nginx:alpine
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
ports:
- "80:80"
- "443:443"
networks:
- tracing-network
depends_on:
- tracing-server
volumes:
postgres_data:
redis_data:
networks:
tracing-network:
driver: overlay
secrets:
postgres_password:
external: true
Kubernetes Deployment
# tracing-server-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: tracing-server
labels:
app: tracing-server
spec:
replicas: 3
selector:
matchLabels:
app: tracing-server
template:
metadata:
labels:
app: tracing-server
spec:
containers:
- name: tracing-server
image: reflow/tracing-server:latest
ports:
- containerPort: 8080
env:
- name: RUST_LOG
value: "info"
- name: TRACING_DATABASE_URL
valueFrom:
secretKeyRef:
name: tracing-secrets
key: database-url
- name: TRACING_REDIS_URL
value: "redis://redis:6379"
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: tracing-server
spec:
selector:
app: tracing-server
ports:
- protocol: TCP
port: 8080
targetPort: 8080
type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: tracing-server-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
tls:
- hosts:
- tracing.yourdomain.com
secretName: tracing-tls
rules:
- host: tracing.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: tracing-server
port:
number: 8080
PostgreSQL Configuration
# postgres-deployment.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
serviceName: postgres
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:15
env:
- name: POSTGRES_DB
value: tracing
- name: POSTGRES_USER
value: tracing_user
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-secrets
key: password
ports:
- containerPort: 5432
volumeMounts:
- name: postgres-storage
mountPath: /var/lib/postgresql/data
- name: postgres-config
mountPath: /etc/postgresql/postgresql.conf
subPath: postgresql.conf
resources:
requests:
memory: "4Gi"
cpu: "2000m"
limits:
memory: "8Gi"
cpu: "4000m"
volumes:
- name: postgres-config
configMap:
name: postgres-config
volumeClaimTemplates:
- metadata:
name: postgres-storage
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 200Gi
storageClassName: fast-ssd
Configuration Management
Environment-Specific Configuration
# config/production.toml
[server]
bind_address = "0.0.0.0:8080"
max_connections = 1000
worker_threads = 8
keep_alive_timeout = 30
[database]
url = "postgresql://user:pass@postgres-cluster:5432/tracing"
max_connections = 20
min_connections = 5
connection_timeout = 5000
statement_timeout = 30000
[redis]
url = "redis://redis-cluster:6379"
pool_size = 10
connection_timeout = 3000
[tracing]
batch_size = 100
batch_timeout_ms = 2000
max_event_size = 1048576 # 1MB
compression = true
[logging]
level = "info"
format = "json"
target = "stdout"
[metrics]
enabled = true
bind_address = "0.0.0.0:9090"
Secret Management
# Kubernetes secrets
kubectl create secret generic tracing-secrets \
--from-literal=database-url="postgresql://user:pass@postgres:5432/tracing" \
--from-literal=redis-url="redis://redis:6379" \
--from-literal=jwt-secret="your-jwt-secret"
kubectl create secret generic postgres-secrets \
--from-literal=password="secure-postgres-password"
# Docker secrets
echo "secure-postgres-password" | docker secret create postgres_password -
Security Configuration
TLS/SSL Setup
# nginx.conf
events {
worker_connections 1024;
}
http {
upstream tracing_backend {
server tracing-server:8080;
keepalive 32;
}
server {
listen 80;
return 301 https://$server_name$request_uri;
}
server {
listen 443 ssl http2;
server_name tracing.yourdomain.com;
ssl_certificate /etc/ssl/certs/tracing.crt;
ssl_certificate_key /etc/ssl/private/tracing.key;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-RSA-AES256-GCM-SHA512:DHE-RSA-AES256-GCM-SHA512;
location / {
proxy_pass http://tracing_backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
}
Authentication Configuration
#![allow(unused)] fn main() { // Server configuration with authentication use reflow_tracing::auth::{AuthConfig, JwtAuth}; let auth_config = AuthConfig { jwt_secret: env::var("JWT_SECRET")?, token_expiry: Duration::from_hours(24), issuer: "reflow-tracing".to_string(), audience: "reflow-clients".to_string(), }; let server_config = ServerConfig { auth: Some(auth_config), require_auth: true, ..Default::default() }; }
Network Security
# Network policies
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: tracing-network-policy
spec:
podSelector:
matchLabels:
app: tracing-server
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: reflow-client
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: postgres
ports:
- protocol: TCP
port: 5432
- to:
- podSelector:
matchLabels:
app: redis
ports:
- protocol: TCP
port: 6379
Monitoring and Observability
Prometheus Metrics
# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'tracing-server'
static_configs:
- targets: ['tracing-server:9090']
metrics_path: /metrics
scrape_interval: 10s
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
Health Checks
#![allow(unused)] fn main() { // Health check endpoints use warp::Filter; let health = warp::path("health") .and(warp::get()) .map(|| { // Check database connectivity // Check Redis connectivity // Check disk space warp::reply::json(&json!({ "status": "healthy", "timestamp": Utc::now(), "checks": { "database": "ok", "redis": "ok", "disk_space": "ok" } })) }); let ready = warp::path("ready") .and(warp::get()) .map(|| { // Check if server is ready to accept traffic warp::reply::json(&json!({ "status": "ready", "timestamp": Utc::now() })) }); }
Alerting Rules
# alerting-rules.yaml
groups:
- name: tracing-server
rules:
- alert: TracingServerDown
expr: up{job="tracing-server"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Tracing server is down"
description: "Tracing server {{ $labels.instance }} has been down for more than 5 minutes"
- alert: HighLatency
expr: tracing_request_duration_seconds{quantile="0.95"} > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "95th percentile latency is {{ $value }}s"
- alert: HighErrorRate
expr: rate(tracing_requests_total{status="error"}[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate"
description: "Error rate is {{ $value }} requests/second"
- alert: DatabaseConnectionsHigh
expr: pg_stat_activity_count > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High number of database connections"
description: "{{ $value }} active connections to PostgreSQL"
Performance Tuning
PostgreSQL Optimization
-- postgresql.conf optimizations
shared_buffers = 2GB # 25% of RAM
effective_cache_size = 6GB # 75% of RAM
maintenance_work_mem = 512MB
work_mem = 16MB
max_connections = 200
wal_buffers = 16MB
checkpoint_completion_target = 0.9
random_page_cost = 1.1 # For SSDs
effective_io_concurrency = 200 # For SSDs
-- Enable query logging for optimization
log_min_duration_statement = 1000 # Log queries > 1s
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
Redis Optimization
# redis.conf
maxmemory 4gb
maxmemory-policy allkeys-lru
save 900 1
save 300 10
save 60 10000
tcp-keepalive 300
tcp-backlog 511
Application Tuning
#![allow(unused)] fn main() { // Server configuration for high performance let config = ServerConfig { worker_threads: num_cpus::get(), max_connections: 1000, connection_pool_size: 20, batch_size: 200, batch_timeout: Duration::from_millis(5000), compression: true, buffer_size: 65536, ..Default::default() }; }
Backup and Recovery
Database Backup
#!/bin/bash
# backup.sh
BACKUP_DIR="/backups"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
DB_NAME="tracing"
# Create backup
pg_dump -h postgres -U tracing_user -d $DB_NAME | gzip > "$BACKUP_DIR/tracing_$TIMESTAMP.sql.gz"
# Upload to S3
aws s3 cp "$BACKUP_DIR/tracing_$TIMESTAMP.sql.gz" s3://your-backup-bucket/database/
# Cleanup old backups (keep 30 days)
find $BACKUP_DIR -name "tracing_*.sql.gz" -mtime +30 -delete
Automated Backup with CronJob
# backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup
spec:
schedule: "0 2 * * *" # Daily at 2 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: postgres-backup
image: postgres:15
env:
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: postgres-secrets
key: password
command:
- /bin/bash
- -c
- |
pg_dump -h postgres -U tracing_user tracing | gzip > /backup/tracing_$(date +%Y%m%d_%H%M%S).sql.gz
# Upload to cloud storage
aws s3 cp /backup/tracing_*.sql.gz s3://backup-bucket/
volumeMounts:
- name: backup-storage
mountPath: /backup
volumes:
- name: backup-storage
persistentVolumeClaim:
claimName: backup-pvc
restartPolicy: OnFailure
Disaster Recovery
#!/bin/bash
# restore.sh
BACKUP_FILE=$1
if [ -z "$BACKUP_FILE" ]; then
echo "Usage: $0 <backup_file>"
exit 1
fi
# Download backup from S3
aws s3 cp "s3://your-backup-bucket/database/$BACKUP_FILE" ./
# Restore database
gunzip -c "$BACKUP_FILE" | psql -h postgres -U tracing_user -d tracing
echo "Database restored from $BACKUP_FILE"
Scaling Strategies
Horizontal Scaling
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: tracing-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: tracing-server
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Database Scaling
-- Read replicas configuration
-- Primary server
ALTER SYSTEM SET wal_level = replica;
ALTER SYSTEM SET max_wal_senders = 3;
ALTER SYSTEM SET max_replication_slots = 3;
SELECT pg_reload_conf();
-- Create replication slot
SELECT pg_create_physical_replication_slot('replica_1');
-- Replica server setup
standby_mode = 'on'
primary_conninfo = 'host=postgres-primary port=5432 user=replicator'
Maintenance Procedures
Rolling Updates
#!/bin/bash
# rolling-update.sh
kubectl set image deployment/tracing-server tracing-server=reflow/tracing-server:v2.0.0
kubectl rollout status deployment/tracing-server
kubectl rollout history deployment/tracing-server
Database Maintenance
-- Regular maintenance tasks
VACUUM ANALYZE tracing.events;
VACUUM ANALYZE tracing.traces;
REINDEX INDEX CONCURRENTLY idx_events_timestamp;
-- Partition maintenance
SELECT create_monthly_partitions('tracing.events', '2024-01-01'::date);
SELECT drop_old_partitions('tracing.events', interval '90 days');
Log Rotation
# fluent-bit-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
data:
fluent-bit.conf: |
[INPUT]
Name tail
Path /var/log/containers/*tracing-server*.log
Parser docker
Tag tracing.*
[OUTPUT]
Name es
Match tracing.*
Host elasticsearch.logging.svc.cluster.local
Port 9200
Index tracing-logs
Type _doc
Troubleshooting
Common Issues
High Memory Usage:
# Check memory usage
kubectl top pods
kubectl describe pod tracing-server-xxx
# Adjust memory limits
kubectl patch deployment tracing-server -p '{"spec":{"template":{"spec":{"containers":[{"name":"tracing-server","resources":{"limits":{"memory":"8Gi"}}}]}}}}'
Database Connection Issues:
-- Check active connections
SELECT count(*) FROM pg_stat_activity;
-- Kill long-running queries
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND query_start < NOW() - INTERVAL '10 minutes';
Performance Issues:
# Check metrics
curl http://tracing-server:9090/metrics | grep -E "latency|throughput|errors"
# Scale up
kubectl scale deployment tracing-server --replicas=10
This production deployment guide provides a comprehensive foundation for running Reflow's observability framework at scale with proper security, monitoring, and operational procedures.