Infrastructure in a Box: Dev, Staging, and Production Deployment Pipeline
October 27, 2024
10 min read
Infrastructure in a Box: Dev, Staging, and Production Deployment Pipeline
Modern infrastructure requires consistent, repeatable deployments across development, staging, and production environments. Learn how to build a complete "Infrastructure in a Box" solution with automated promotion pipelines.
The Infrastructure Pipeline
Environment Progression
Developer Laptop (Local Dev)
↓
Development Environment (Shared Dev)
↓
Staging Environment (Pre-Production)
↓
Production Environment (Live)
Architecture Overview
Complete Stack
Infrastructure Layers:
Compute:
- Kubernetes clusters (dev/staging/prod)
- VM infrastructure (VMware/KVM)
- Bare metal servers
Networking:
- VLANs per environment
- Load balancers
- Firewalls
- DNS
Storage:
- Persistent volumes
- Object storage (S3/MinIO)
- Databases
Observability:
- Prometheus + Grafana
- ELK Stack
- Distributed tracing
Security:
- Vault for secrets
- Network policies
- RBAC
- Certificate managementEnvironment Specifications
Development Environment
Purpose: Rapid iteration and testing
Scale: Minimal resources
Characteristics:
- Shared by development team
- Frequent deployments (10+ per day)
- Short-lived feature branches
- Relaxed security policies
- Mock external services
Infrastructure:
Kubernetes:
Nodes: 3 (small VMs)
CPU: 4 cores per node
Memory: 16GB per node
Storage: 100GB per node
Databases:
Type: Containerized
Persistence: Optional
Backups: None
Networking:
VLAN: 10
Subnet: 10.10.0.0/24
Internet: Restricted
Cost: ~$500/monthStaging Environment
Purpose: Pre-production validation
Scale: Production-like
Characteristics:
- Mirror of production
- Automated testing
- Performance testing
- Integration testing
- Security scanning
Infrastructure:
Kubernetes:
Nodes: 5 (medium VMs)
CPU: 8 cores per node
Memory: 32GB per node
Storage: 500GB per node
Databases:
Type: Dedicated instances
Persistence: Required
Backups: Daily
Networking:
VLAN: 20
Subnet: 10.20.0.0/24
Internet: Controlled
Cost: ~$2,000/monthProduction Environment
Purpose: Live customer-facing services
Scale: Full redundancy
Characteristics:
- High availability
- Auto-scaling
- Disaster recovery
- Strict security
- Full monitoring
Infrastructure:
Kubernetes:
Nodes: 10+ (large VMs/bare metal)
CPU: 16+ cores per node
Memory: 64GB+ per node
Storage: 1TB+ per node
Databases:
Type: HA clusters
Persistence: Required
Backups: Hourly + continuous replication
Networking:
VLAN: 30
Subnet: 10.30.0.0/24
Internet: Full access (firewalled)
Cost: ~$10,000+/monthInfrastructure as Code
Terraform Workspace Structure
# Directory structure
infrastructure/
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── terraform.tfvars
│ ├── staging/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── terraform.tfvars
│ └── production/
│ ├── main.tf
│ ├── variables.tf
│ └── terraform.tfvars
├── modules/
│ ├── kubernetes/
│ ├── networking/
│ ├── storage/
│ └── monitoring/
└── shared/
└── backend.tf
# environments/dev/main.tf
terraform {
backend "s3" {
bucket = "terraform-state"
key = "dev/terraform.tfstate"
region = "us-east-1"
}
}
module "kubernetes" {
source = "../../modules/kubernetes"
environment = "dev"
cluster_name = "dev-cluster"
node_count = 3
node_size = "small"
node_cpu = 4
node_memory = 16384
network_cidr = "10.10.0.0/24"
vlan_id = 10
}
module "monitoring" {
source = "../../modules/monitoring"
environment = "dev"
retention_days = 7
alert_channels = ["slack-dev"]
}
# environments/dev/terraform.tfvars
environment = "dev"
region = "datacenter-1"
cost_center = "engineering"
# Resource limits for dev
max_cpu_per_pod = 2
max_memory_per_pod = 4096
max_pods_per_node = 50Environment-Specific Configurations
# modules/kubernetes/variables.tf
variable "environment" {
description = "Environment name"
type = string
validation {
condition = contains(["dev", "staging", "production"], var.environment)
error_message = "Environment must be dev, staging, or production"
}
}
variable "node_count" {
description = "Number of Kubernetes nodes"
type = number
default = 3
}
# Environment-specific defaults
locals {
env_config = {
dev = {
node_size = "small"
backup_enabled = false
ha_enabled = false
monitoring_tier = "basic"
}
staging = {
node_size = "medium"
backup_enabled = true
ha_enabled = true
monitoring_tier = "standard"
}
production = {
node_size = "large"
backup_enabled = true
ha_enabled = true
monitoring_tier = "premium"
}
}
config = local.env_config[var.environment]
}GitOps Deployment Pipeline
Repository Structure
gitops-infrastructure/
├── apps/
│ ├── dev/
│ │ ├── application-a/
│ │ ├── application-b/
│ │ └── kustomization.yaml
│ ├── staging/
│ │ ├── application-a/
│ │ ├── application-b/
│ │ └── kustomization.yaml
│ └── production/
│ ├── application-a/
│ ├── application-b/
│ └── kustomization.yaml
├── infrastructure/
│ ├── base/
│ │ ├── ingress/
│ │ ├── monitoring/
│ │ └── storage/
│ └── overlays/
│ ├── dev/
│ ├── staging/
│ └── production/
└── clusters/
├── dev-cluster.yaml
├── staging-cluster.yaml
└── production-cluster.yaml
ArgoCD Application Definitions
# clusters/dev-cluster.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: dev-applications
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/company/gitops-infrastructure
targetRevision: main
path: apps/dev
destination:
server: https://dev-cluster.local
namespace: default
syncPolicy:
automated:
prune: true
selfHeal: true
allowEmpty: false
syncOptions:
- CreateNamespace=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
---
# Application with environment-specific config
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: web-app-dev
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/company/web-app
targetRevision: develop
path: k8s/overlays/dev
kustomize:
images:
- company/web-app:dev-latest
destination:
server: https://dev-cluster.local
namespace: web-app
syncPolicy:
automated:
prune: true
selfHeal: trueKustomize Overlays
# apps/base/web-app/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 1
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: web-app
image: company/web-app:latest
ports:
- containerPort: 8080
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
---
# apps/overlays/dev/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
bases:
- ../../base/web-app
replicas:
- name: web-app
count: 1
images:
- name: company/web-app
newTag: dev-latest
configMapGenerator:
- name: web-app-config
literals:
- ENVIRONMENT=development
- LOG_LEVEL=debug
- ENABLE_DEBUG=true
---
# apps/overlays/staging/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
bases:
- ../../base/web-app
replicas:
- name: web-app
count: 2
images:
- name: company/web-app
newTag: staging-v1.2.3
configMapGenerator:
- name: web-app-config
literals:
- ENVIRONMENT=staging
- LOG_LEVEL=info
- ENABLE_DEBUG=false
---
# apps/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
bases:
- ../../base/web-app
replicas:
- name: web-app
count: 5
images:
- name: company/web-app
newTag: v1.2.3
configMapGenerator:
- name: web-app-config
literals:
- ENVIRONMENT=production
- LOG_LEVEL=warn
- ENABLE_DEBUG=false
patches:
- path: production-resources.yaml
- path: production-hpa.yamlCI/CD Pipeline
GitLab CI Configuration
# .gitlab-ci.yml
stages:
- build
- test
- deploy-dev
- deploy-staging
- deploy-production
variables:
DOCKER_REGISTRY: registry.company.local
APP_NAME: web-app
build:
stage: build
script:
- docker build -t $DOCKER_REGISTRY/$APP_NAME:$CI_COMMIT_SHA .
- docker push $DOCKER_REGISTRY/$APP_NAME:$CI_COMMIT_SHA
only:
- branches
- tags
test:
stage: test
script:
- docker run $DOCKER_REGISTRY/$APP_NAME:$CI_COMMIT_SHA npm test
- docker run $DOCKER_REGISTRY/$APP_NAME:$CI_COMMIT_SHA npm run lint
only:
- branches
- tags
deploy-dev:
stage: deploy-dev
script:
# Update image tag in GitOps repo
- git clone https://github.com/company/gitops-infrastructure
- cd gitops-infrastructure
- |
cd apps/overlays/dev
kustomize edit set image company/web-app:dev-$CI_COMMIT_SHA
- git add .
- git commit -m "Update dev to $CI_COMMIT_SHA"
- git push
only:
- develop
environment:
name: development
url: https://dev.company.local
deploy-staging:
stage: deploy-staging
script:
- git clone https://github.com/company/gitops-infrastructure
- cd gitops-infrastructure
- |
cd apps/overlays/staging
kustomize edit set image company/web-app:staging-$CI_COMMIT_TAG
- git add .
- git commit -m "Update staging to $CI_COMMIT_TAG"
- git push
only:
- tags
when: manual
environment:
name: staging
url: https://staging.company.local
deploy-production:
stage: deploy-production
script:
- git clone https://github.com/company/gitops-infrastructure
- cd gitops-infrastructure
- |
cd apps/overlays/production
kustomize edit set image company/web-app:$CI_COMMIT_TAG
- git add .
- git commit -m "Update production to $CI_COMMIT_TAG"
- git push
only:
- tags
when: manual
environment:
name: production
url: https://company.com
before_script:
# Require approval
- echo "Deploying to production requires approval"Promotion Strategy
Automated Promotion Flow
Development:
Trigger: Push to develop branch
Deployment: Automatic
Testing: Unit tests, linting
Approval: None required
Rollback: Automatic on failure
Staging:
Trigger: Git tag creation
Deployment: Manual approval
Testing:
- Integration tests
- Performance tests
- Security scans
- Smoke tests
Approval: Tech lead
Rollback: Manual
Production:
Trigger: Staging validation passes
Deployment: Manual approval
Testing:
- Canary deployment (10%)
- Full deployment (100%)
Approval: Engineering manager + Product owner
Rollback: Automated on health check failureCanary Deployment
# Production canary deployment
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: web-app
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
service:
port: 8080
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 1m
webhooks:
- name: load-test
url: http://flagger-loadtester/
timeout: 5s
metadata:
cmd: "hey -z 1m -q 10 -c 2 http://web-app-canary:8080/"Environment Parity
Configuration Management
# Shared configuration (all environments)
shared_config:
app_name: web-app
port: 8080
health_check_path: /health
metrics_path: /metrics
# Environment-specific overrides
dev_config:
replicas: 1
cpu_request: 100m
memory_request: 128Mi
cpu_limit: 500m
memory_limit: 512Mi
log_level: debug
enable_profiling: true
database_url: postgresql://dev-db:5432/app
staging_config:
replicas: 2
cpu_request: 500m
memory_request: 512Mi
cpu_limit: 2000m
memory_limit: 2Gi
log_level: info
enable_profiling: false
database_url: postgresql://staging-db:5432/app
production_config:
replicas: 5
cpu_request: 1000m
memory_request: 1Gi
cpu_limit: 4000m
memory_limit: 4Gi
log_level: warn
enable_profiling: false
database_url: postgresql://prod-db:5432/app
autoscaling:
min_replicas: 5
max_replicas: 20
target_cpu: 70Secrets Management
Vault Integration
# Vault secrets per environment
vault/
├── dev/
│ ├── database-credentials
│ ├── api-keys
│ └── certificates
├── staging/
│ ├── database-credentials
│ ├── api-keys
│ └── certificates
└── production/
├── database-credentials
├── api-keys
└── certificates
# Kubernetes External Secrets
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: app-secrets
namespace: production
spec:
refreshInterval: 1h
secretStoreRef:
name: vault-backend
kind: SecretStore
target:
name: app-secrets
creationPolicy: Owner
data:
- secretKey: database-password
remoteRef:
key: production/database-credentials
property: password
- secretKey: api-key
remoteRef:
key: production/api-keys
property: external-apiMonitoring and Observability
Environment-Specific Dashboards
Grafana Dashboards:
Development:
- Application metrics
- Error rates
- Response times
- Resource usage
Staging:
- All dev metrics
- Load test results
- Performance benchmarks
- Cost tracking
Production:
- All staging metrics
- SLA compliance
- Business metrics
- Capacity planning
- Incident trackingAlerting Strategy
Development:
Alerts: Slack only
Severity: Info
On-call: None
Staging:
Alerts: Slack + Email
Severity: Warning
On-call: Optional
Production:
Alerts: PagerDuty + Slack + Email
Severity: Critical
On-call: Required (24/7)
Escalation: 15 min → Manager → DirectorCost Management
Environment Cost Tracking
Monthly Infrastructure Costs:
Development:
Compute: $300
Storage: $100
Network: $50
Monitoring: $50
Total: ~$500
Staging:
Compute: $1,200
Storage: $400
Network: $200
Monitoring: $200
Total: ~$2,000
Production:
Compute: $6,000
Storage: $2,000
Network: $1,000
Monitoring: $500
DR/Backup: $500
Total: ~$10,000
Cost Optimization:
- Auto-shutdown dev/staging after hours
- Spot instances for non-critical workloads
- Reserved instances for production
- Storage lifecycle policiesBest Practices
Infrastructure in a Box Principles
Consistency:
- Same tools across all environments
- Infrastructure as Code for everything
- Automated testing at every stage
- Version control for all configs
Security:
- Least privilege access
- Secrets in Vault, never in Git
- Network segmentation
- Regular security scans
Reliability:
- Automated backups
- Disaster recovery testing
- Health checks and monitoring
- Graceful degradation
Efficiency:
- Resource right-sizing
- Auto-scaling policies
- Cost monitoring and alerts
- Regular optimization reviewsConclusion
Infrastructure in a Box provides a complete, repeatable deployment pipeline from development to production. By treating infrastructure as code and implementing GitOps workflows, teams can deploy confidently and consistently across all environments.
Key Benefits:
- Consistent environments reduce "works on my machine" issues
- Automated promotion reduces human error
- GitOps provides audit trail and rollback capability
- Environment parity ensures production-like testing
- Infrastructure as Code enables rapid disaster recovery
Success Metrics:
- Deployment frequency: 10+ per day (dev), 5+ per week (staging), daily (production)
- Lead time: < 1 hour from commit to production
- Change failure rate: < 5%
- Mean time to recovery: < 15 minutes
References:
- GitOps Principles
- Terraform Best Practices
- Kubernetes Production Patterns
- The Twelve-Factor App
- Site Reliability Engineering (Google)