Cloud Architecture

Canary k8s deployments with Flagger

Canary deployments are a strategic approach to software release where a new version of an application is gradually rolled out to a small subset of users. This allows for testing in a live production environment while minimizing the risk of widespread impact.

Flagger is an open-source progressive delivery tool that automates canary deployments and rollbacks in Kubernetes. It simplifies the process and provides advanced features for monitoring, alerting, and automatic rollbacks.

Key Components of Flagger Canary Deployments:

Traffic Splitting/Routing: Flagger creates a new deployment for the canary version, typically a replica of the stable deployment. A portion of the incoming traffic is directed to the canary deployment, while the rest goes to the stable deployment. Flagger uses a traffic routing mechanism, such as Istio or Kubernetes Ingress, to split traffic between the stable and canary deployments.
Monitoring and Analysis/Metric-Based Scaling: The canary deployment is closely monitored for performance, errors, and user behavior. Flagger monitors key metrics like error rates, latency, and request rates. If the canary deployment meets predefined criteria, it automatically scales up.
Automated Rollback: If the canary deployment fails to meet the criteria or experiences issues, Flagger automatically rolls back to the stable deployment
Progressive Rollout: If the canary deployment performs satisfactorily, the traffic is gradually shifted to it, eventually reaching 100%.

Flagger implements several deployment strategies (Canary releases, A/B testing, Blue/Green mirroring) using a service mesh (App Mesh, Istio, Linkerd, Kuma, Open Service Mesh) or an ingress controller (Contour, Gloo, NGINX, Skipper, Traefik, APISIX) for traffic routing. For release analysis, Flagger can query Prometheus, InfluxDB, Datadog, New Relic, CloudWatch, Stackdriver or Graphite and for alerting it uses Slack, MS Teams, Discord and Rocket. More info can found here: https://docs.flagger.app/

Flatter Canary deployment with NGINX:
The main doc how it works https://docs.flagger.app/tutorials/nginx-progressive-delivery

Here is how flagger can be installed on a cluster and use nginx (TODO: have terraform module to have this chart installed on amy cluster):

# scripts to to run anc create a test/sample canary deployment
kubectl create ns ingress-nginx
helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-nginx \
--set controller.metrics.enabled=true \
--set controller.podAnnotations."prometheus\.io/scrape"=true \
--set controller.podAnnotations."prometheus\.io/port"=10254

# metrics-server
helm upgrade --install metrics-server metrics-server/metrics-server \
--namespace kube-system \
--set args.0="--kubelet-insecure-tls"

# flagger
helm upgrade --install flagger flagger/flagger \
--namespace ingress-nginx \
--set prometheus.install=true \
--set meshProvider=nginx

# monitor canaries
watch kubectl get canaries --all-namespaces

# install test http-echo app
k create namespace localhost
kcn localhost
k apply -n localhost -f http-echo-manifest.yaml
# flagger load tester to load test the app
helm upgrade --install flagger-loadtester flagger/loadtester \
--namespace=localhost

# http-echo-manifest.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: http-echo
  labels:
    app: http-echo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: http-echo
  template:
    metadata:
      labels:
        app: http-echo
    spec:
      containers:
        - name: http-echo
          image: mendhak/http-https-echo:34
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: 5m

---
apiVersion: v1
kind: Service
metadata:
  name: http-echo
  labels:
    app: http-echo
spec:
  type: ClusterIP
  ports:
    - port: 80
      targetPort: 8080
      name: http
  selector:
    app: http-echo
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: http-echo
  labels:
    app: http-echo
spec:
  ingressClassName: nginx
  rules:
    - http:
        paths:
          - path: "/ping"
            pathType: Exact
            backend:
              service:
                name: http-echo
                port:
                  number: 80
      host: http-echo.localhost
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: http-echo
  labels:
    app: http-echo
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: http-echo
  minReplicas: 2
  maxReplicas: 4
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          # scale up if usage is above
          # 99% of the requested CPU (10m)
          averageUtilization: 99
---
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: http-echo
spec:
  provider: nginx
  # deployment reference
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: http-echo
  # ingress reference
  ingressRef:
    apiVersion: networking.k8s.io/v1
    kind: Ingress
    name: http-echo
  # HPA reference (optional)
  autoscalerRef:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    name: http-echo
  # the maximum time in seconds for the canary deployment
  # to make progress before it is rollback (default 600s)
  progressDeadlineSeconds: 60
  service:
    # ClusterIP port number
    port: 80
    # container port number or name
    targetPort: 8080
  analysis:
    # schedule interval (default 60s)
    interval: 10s
    # max number of failed metric checks before rollback
    threshold: 10
    # max traffic percentage routed to canary
    # percentage (0-100)
    maxWeight: 50
    # canary increment step
    # percentage (0-100)
    stepWeight: 5
    # NGINX Prometheus checks
    metrics:
      - name: request-success-rate
        # minimum req success rate (non 5xx responses)
        # percentage (0-100)
        thresholdRange:
          min: 99
        interval: 1m
    # testing (optional)
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://flagger-loadtester.localhost/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -sd 'test' http://http-echo-canary/ping | grep ping"
      - name: load-test
        url: http://flagger-loadtester.localhost/
        timeout: 5s
        metadata:
          cmd: "hey -z 1m -q 3 -c 1 http://http-echo.localhost/ping"

Important notes:

It supports load testing on canary deployment before switching to primary
The canary kind kubernetes resource attachs/overrides app ingress/deployment/service/hpa resources and creates some custom ones(canary and primary named), so that at next app helm/kubectl updates can reset those overrides. We fix this by jus haveing service resource not created explicitly and allowing canary to create/manage service. The new dasmeta base chart already supports flagger canary deployments: https://github.com/dasmeta/helm/tree/main/examples/base/with-canary-deployment
The eks module supports flagger package/operator enabling and there is also example how it can be configured for nginx: https://github.com/dasmeta/terraform-aws-eks/tree/main/examples/eks-with-flagger
Flagger supports both global and custom notifications setups, dasmeta base helm chart starting 0.2.8 version supports to set custom alerting configs and flagger-metrics-and-alerts chart allows to create alert providers. The terraform aws eks module also updated to support aleerting config. For slack channel notification setups there is need to create webhook by using legacy slack webhook integration, more info can be found here: https://medium.com/@life-is-short-so-enjoy-it/slack-post-message-with-incoming-webhooks-e8d588fdbe89
TODO: There is need to have/create detailed dashboard in prometheus to see canary deployment process in details, the cananry/primary pods/ingress metrics can be used, also there are flagger custom metrics available which also can give information