Back to blog

Infrastructure in a Box: Dev, Staging, and Production Deployment Pipeline

October 27, 2024
10 min read

Infrastructure in a Box: Dev, Staging, and Production Deployment Pipeline

Modern infrastructure requires consistent, repeatable deployments across development, staging, and production environments. Learn how to build a complete "Infrastructure in a Box" solution with automated promotion pipelines.

The Infrastructure Pipeline

Environment Progression

Developer Laptop (Local Dev) ↓ Development Environment (Shared Dev) ↓ Staging Environment (Pre-Production) ↓ Production Environment (Live)

Architecture Overview

Complete Stack

Infrastructure Layers: Compute: - Kubernetes clusters (dev/staging/prod) - VM infrastructure (VMware/KVM) - Bare metal servers Networking: - VLANs per environment - Load balancers - Firewalls - DNS Storage: - Persistent volumes - Object storage (S3/MinIO) - Databases Observability: - Prometheus + Grafana - ELK Stack - Distributed tracing Security: - Vault for secrets - Network policies - RBAC - Certificate management

Environment Specifications

Development Environment

Purpose: Rapid iteration and testing Scale: Minimal resources Characteristics: - Shared by development team - Frequent deployments (10+ per day) - Short-lived feature branches - Relaxed security policies - Mock external services Infrastructure: Kubernetes: Nodes: 3 (small VMs) CPU: 4 cores per node Memory: 16GB per node Storage: 100GB per node Databases: Type: Containerized Persistence: Optional Backups: None Networking: VLAN: 10 Subnet: 10.10.0.0/24 Internet: Restricted Cost: ~$500/month

Staging Environment

Purpose: Pre-production validation Scale: Production-like Characteristics: - Mirror of production - Automated testing - Performance testing - Integration testing - Security scanning Infrastructure: Kubernetes: Nodes: 5 (medium VMs) CPU: 8 cores per node Memory: 32GB per node Storage: 500GB per node Databases: Type: Dedicated instances Persistence: Required Backups: Daily Networking: VLAN: 20 Subnet: 10.20.0.0/24 Internet: Controlled Cost: ~$2,000/month

Production Environment

Purpose: Live customer-facing services Scale: Full redundancy Characteristics: - High availability - Auto-scaling - Disaster recovery - Strict security - Full monitoring Infrastructure: Kubernetes: Nodes: 10+ (large VMs/bare metal) CPU: 16+ cores per node Memory: 64GB+ per node Storage: 1TB+ per node Databases: Type: HA clusters Persistence: Required Backups: Hourly + continuous replication Networking: VLAN: 30 Subnet: 10.30.0.0/24 Internet: Full access (firewalled) Cost: ~$10,000+/month

Infrastructure as Code

Terraform Workspace Structure

# Directory structure infrastructure/ ├── environments/ │ ├── dev/ │ │ ├── main.tf │ │ ├── variables.tf │ │ └── terraform.tfvars │ ├── staging/ │ │ ├── main.tf │ │ ├── variables.tf │ │ └── terraform.tfvars │ └── production/ │ ├── main.tf │ ├── variables.tf │ └── terraform.tfvars ├── modules/ │ ├── kubernetes/ │ ├── networking/ │ ├── storage/ │ └── monitoring/ └── shared/ └── backend.tf # environments/dev/main.tf terraform { backend "s3" { bucket = "terraform-state" key = "dev/terraform.tfstate" region = "us-east-1" } } module "kubernetes" { source = "../../modules/kubernetes" environment = "dev" cluster_name = "dev-cluster" node_count = 3 node_size = "small" node_cpu = 4 node_memory = 16384 network_cidr = "10.10.0.0/24" vlan_id = 10 } module "monitoring" { source = "../../modules/monitoring" environment = "dev" retention_days = 7 alert_channels = ["slack-dev"] } # environments/dev/terraform.tfvars environment = "dev" region = "datacenter-1" cost_center = "engineering" # Resource limits for dev max_cpu_per_pod = 2 max_memory_per_pod = 4096 max_pods_per_node = 50

Environment-Specific Configurations

# modules/kubernetes/variables.tf variable "environment" { description = "Environment name" type = string validation { condition = contains(["dev", "staging", "production"], var.environment) error_message = "Environment must be dev, staging, or production" } } variable "node_count" { description = "Number of Kubernetes nodes" type = number default = 3 } # Environment-specific defaults locals { env_config = { dev = { node_size = "small" backup_enabled = false ha_enabled = false monitoring_tier = "basic" } staging = { node_size = "medium" backup_enabled = true ha_enabled = true monitoring_tier = "standard" } production = { node_size = "large" backup_enabled = true ha_enabled = true monitoring_tier = "premium" } } config = local.env_config[var.environment] }

GitOps Deployment Pipeline

Repository Structure

gitops-infrastructure/ ├── apps/ │ ├── dev/ │ │ ├── application-a/ │ │ ├── application-b/ │ │ └── kustomization.yaml │ ├── staging/ │ │ ├── application-a/ │ │ ├── application-b/ │ │ └── kustomization.yaml │ └── production/ │ ├── application-a/ │ ├── application-b/ │ └── kustomization.yaml ├── infrastructure/ │ ├── base/ │ │ ├── ingress/ │ │ ├── monitoring/ │ │ └── storage/ │ └── overlays/ │ ├── dev/ │ ├── staging/ │ └── production/ └── clusters/ ├── dev-cluster.yaml ├── staging-cluster.yaml └── production-cluster.yaml

ArgoCD Application Definitions

# clusters/dev-cluster.yaml apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: dev-applications namespace: argocd spec: project: default source: repoURL: https://github.com/company/gitops-infrastructure targetRevision: main path: apps/dev destination: server: https://dev-cluster.local namespace: default syncPolicy: automated: prune: true selfHeal: true allowEmpty: false syncOptions: - CreateNamespace=true retry: limit: 5 backoff: duration: 5s factor: 2 maxDuration: 3m --- # Application with environment-specific config apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: web-app-dev namespace: argocd spec: project: default source: repoURL: https://github.com/company/web-app targetRevision: develop path: k8s/overlays/dev kustomize: images: - company/web-app:dev-latest destination: server: https://dev-cluster.local namespace: web-app syncPolicy: automated: prune: true selfHeal: true

Kustomize Overlays

# apps/base/web-app/deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: web-app spec: replicas: 1 selector: matchLabels: app: web-app template: metadata: labels: app: web-app spec: containers: - name: web-app image: company/web-app:latest ports: - containerPort: 8080 resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 512Mi --- # apps/overlays/dev/kustomization.yaml apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization bases: - ../../base/web-app replicas: - name: web-app count: 1 images: - name: company/web-app newTag: dev-latest configMapGenerator: - name: web-app-config literals: - ENVIRONMENT=development - LOG_LEVEL=debug - ENABLE_DEBUG=true --- # apps/overlays/staging/kustomization.yaml apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization bases: - ../../base/web-app replicas: - name: web-app count: 2 images: - name: company/web-app newTag: staging-v1.2.3 configMapGenerator: - name: web-app-config literals: - ENVIRONMENT=staging - LOG_LEVEL=info - ENABLE_DEBUG=false --- # apps/overlays/production/kustomization.yaml apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization bases: - ../../base/web-app replicas: - name: web-app count: 5 images: - name: company/web-app newTag: v1.2.3 configMapGenerator: - name: web-app-config literals: - ENVIRONMENT=production - LOG_LEVEL=warn - ENABLE_DEBUG=false patches: - path: production-resources.yaml - path: production-hpa.yaml

CI/CD Pipeline

GitLab CI Configuration

# .gitlab-ci.yml stages: - build - test - deploy-dev - deploy-staging - deploy-production variables: DOCKER_REGISTRY: registry.company.local APP_NAME: web-app build: stage: build script: - docker build -t $DOCKER_REGISTRY/$APP_NAME:$CI_COMMIT_SHA . - docker push $DOCKER_REGISTRY/$APP_NAME:$CI_COMMIT_SHA only: - branches - tags test: stage: test script: - docker run $DOCKER_REGISTRY/$APP_NAME:$CI_COMMIT_SHA npm test - docker run $DOCKER_REGISTRY/$APP_NAME:$CI_COMMIT_SHA npm run lint only: - branches - tags deploy-dev: stage: deploy-dev script: # Update image tag in GitOps repo - git clone https://github.com/company/gitops-infrastructure - cd gitops-infrastructure - | cd apps/overlays/dev kustomize edit set image company/web-app:dev-$CI_COMMIT_SHA - git add . - git commit -m "Update dev to $CI_COMMIT_SHA" - git push only: - develop environment: name: development url: https://dev.company.local deploy-staging: stage: deploy-staging script: - git clone https://github.com/company/gitops-infrastructure - cd gitops-infrastructure - | cd apps/overlays/staging kustomize edit set image company/web-app:staging-$CI_COMMIT_TAG - git add . - git commit -m "Update staging to $CI_COMMIT_TAG" - git push only: - tags when: manual environment: name: staging url: https://staging.company.local deploy-production: stage: deploy-production script: - git clone https://github.com/company/gitops-infrastructure - cd gitops-infrastructure - | cd apps/overlays/production kustomize edit set image company/web-app:$CI_COMMIT_TAG - git add . - git commit -m "Update production to $CI_COMMIT_TAG" - git push only: - tags when: manual environment: name: production url: https://company.com before_script: # Require approval - echo "Deploying to production requires approval"

Promotion Strategy

Automated Promotion Flow

Development: Trigger: Push to develop branch Deployment: Automatic Testing: Unit tests, linting Approval: None required Rollback: Automatic on failure Staging: Trigger: Git tag creation Deployment: Manual approval Testing: - Integration tests - Performance tests - Security scans - Smoke tests Approval: Tech lead Rollback: Manual Production: Trigger: Staging validation passes Deployment: Manual approval Testing: - Canary deployment (10%) - Full deployment (100%) Approval: Engineering manager + Product owner Rollback: Automated on health check failure

Canary Deployment

# Production canary deployment apiVersion: flagger.app/v1beta1 kind: Canary metadata: name: web-app namespace: production spec: targetRef: apiVersion: apps/v1 kind: Deployment name: web-app service: port: 8080 analysis: interval: 1m threshold: 5 maxWeight: 50 stepWeight: 10 metrics: - name: request-success-rate thresholdRange: min: 99 interval: 1m - name: request-duration thresholdRange: max: 500 interval: 1m webhooks: - name: load-test url: http://flagger-loadtester/ timeout: 5s metadata: cmd: "hey -z 1m -q 10 -c 2 http://web-app-canary:8080/"

Environment Parity

Configuration Management

# Shared configuration (all environments) shared_config: app_name: web-app port: 8080 health_check_path: /health metrics_path: /metrics # Environment-specific overrides dev_config: replicas: 1 cpu_request: 100m memory_request: 128Mi cpu_limit: 500m memory_limit: 512Mi log_level: debug enable_profiling: true database_url: postgresql://dev-db:5432/app staging_config: replicas: 2 cpu_request: 500m memory_request: 512Mi cpu_limit: 2000m memory_limit: 2Gi log_level: info enable_profiling: false database_url: postgresql://staging-db:5432/app production_config: replicas: 5 cpu_request: 1000m memory_request: 1Gi cpu_limit: 4000m memory_limit: 4Gi log_level: warn enable_profiling: false database_url: postgresql://prod-db:5432/app autoscaling: min_replicas: 5 max_replicas: 20 target_cpu: 70

Secrets Management

Vault Integration

# Vault secrets per environment vault/ ├── dev/ │ ├── database-credentials │ ├── api-keys │ └── certificates ├── staging/ │ ├── database-credentials │ ├── api-keys │ └── certificates └── production/ ├── database-credentials ├── api-keys └── certificates # Kubernetes External Secrets apiVersion: external-secrets.io/v1beta1 kind: ExternalSecret metadata: name: app-secrets namespace: production spec: refreshInterval: 1h secretStoreRef: name: vault-backend kind: SecretStore target: name: app-secrets creationPolicy: Owner data: - secretKey: database-password remoteRef: key: production/database-credentials property: password - secretKey: api-key remoteRef: key: production/api-keys property: external-api

Monitoring and Observability

Environment-Specific Dashboards

Grafana Dashboards: Development: - Application metrics - Error rates - Response times - Resource usage Staging: - All dev metrics - Load test results - Performance benchmarks - Cost tracking Production: - All staging metrics - SLA compliance - Business metrics - Capacity planning - Incident tracking

Alerting Strategy

Development: Alerts: Slack only Severity: Info On-call: None Staging: Alerts: Slack + Email Severity: Warning On-call: Optional Production: Alerts: PagerDuty + Slack + Email Severity: Critical On-call: Required (24/7) Escalation: 15 min → Manager → Director

Cost Management

Environment Cost Tracking

Monthly Infrastructure Costs: Development: Compute: $300 Storage: $100 Network: $50 Monitoring: $50 Total: ~$500 Staging: Compute: $1,200 Storage: $400 Network: $200 Monitoring: $200 Total: ~$2,000 Production: Compute: $6,000 Storage: $2,000 Network: $1,000 Monitoring: $500 DR/Backup: $500 Total: ~$10,000 Cost Optimization: - Auto-shutdown dev/staging after hours - Spot instances for non-critical workloads - Reserved instances for production - Storage lifecycle policies

Best Practices

Infrastructure in a Box Principles

Consistency: - Same tools across all environments - Infrastructure as Code for everything - Automated testing at every stage - Version control for all configs Security: - Least privilege access - Secrets in Vault, never in Git - Network segmentation - Regular security scans Reliability: - Automated backups - Disaster recovery testing - Health checks and monitoring - Graceful degradation Efficiency: - Resource right-sizing - Auto-scaling policies - Cost monitoring and alerts - Regular optimization reviews

Conclusion

Infrastructure in a Box provides a complete, repeatable deployment pipeline from development to production. By treating infrastructure as code and implementing GitOps workflows, teams can deploy confidently and consistently across all environments.

Key Benefits:

  • Consistent environments reduce "works on my machine" issues
  • Automated promotion reduces human error
  • GitOps provides audit trail and rollback capability
  • Environment parity ensures production-like testing
  • Infrastructure as Code enables rapid disaster recovery

Success Metrics:

  • Deployment frequency: 10+ per day (dev), 5+ per week (staging), daily (production)
  • Lead time: < 1 hour from commit to production
  • Change failure rate: < 5%
  • Mean time to recovery: < 15 minutes

References:

  • GitOps Principles
  • Terraform Best Practices
  • Kubernetes Production Patterns
  • The Twelve-Factor App
  • Site Reliability Engineering (Google)