Canary deployments are a strategic approach to software release where a new version of an application is gradually rolled out to a small subset of users. This allows for testing in a live production environment while minimizing the risk of widespread impact.
Flagger is an open-source progressive delivery tool that automates canary deployments and rollbacks in Kubernetes. It simplifies the process and provides advanced features for monitoring, alerting, and automatic rollbacks.
Key Components of Flagger Canary Deployments:
Traffic Splitting/Routing: Flagger creates a new deployment for the canary version, typically a replica of the stable deployment. A portion of the incoming traffic is directed to the canary deployment, while the rest goes to the stable deployment. Flagger uses a traffic routing mechanism, such as Istio or Kubernetes Ingress, to split traffic between the stable and canary deployments.
Monitoring and Analysis/Metric-Based Scaling: The canary deployment is closely monitored for performance, errors, and user behavior. Â Flagger monitors key metrics like error rates, latency, and request rates. If the canary deployment meets predefined criteria, it automatically scales up.
Automated Rollback: If the canary deployment fails to meet the criteria or experiences issues, Flagger automatically rolls back to the stable deployment
Progressive Rollout: If the canary deployment performs satisfactorily, the traffic is gradually shifted to it, eventually reaching 100%.
Flagger implements several deployment strategies (Canary releases, A/B testing, Blue/Green mirroring) using a service mesh (App Mesh, Istio, Linkerd, Kuma, Open Service Mesh) or an ingress controller (Contour, Gloo, NGINX, Skipper, Traefik, APISIX) for traffic routing. For release analysis, Flagger can query Prometheus, InfluxDB, Datadog, New Relic, CloudWatch, Stackdriver or Graphite and for alerting it uses Slack, MS Teams, Discord and Rocket. More info can found here: https://docs.flagger.app/
Flatter Canary deployment with NGINX:
The main doc how it works https://docs.flagger.app/tutorials/nginx-progressive-delivery
Here is how flagger can be installed on a cluster and use nginx (TODO: have terraform module to have this chart installed on amy cluster):
# scripts to to run anc create a test/sample canary deployment
kubectl create ns ingress-nginx
helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-nginx \
--set controller.metrics.enabled=true \
--set controller.podAnnotations."prometheus\.io/scrape"=true \
--set controller.podAnnotations."prometheus\.io/port"=10254
# metrics-server
helm upgrade --install metrics-server metrics-server/metrics-server \
--namespace kube-system \
--set args.0="--kubelet-insecure-tls"
# flagger
helm upgrade --install flagger flagger/flagger \
--namespace ingress-nginx \
--set prometheus.install=true \
--set meshProvider=nginx
# monitor canaries
watch kubectl get canaries --all-namespaces
# install test http-echo app
k create namespace localhost
kcn localhost
k apply -n localhost -f http-echo-manifest.yaml
# flagger load tester to load test the app
helm upgrade --install flagger-loadtester flagger/loadtester \
--namespace=localhost# http-echo-manifest.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: http-echo
labels:
app: http-echo
spec:
replicas: 1
selector:
matchLabels:
app: http-echo
template:
metadata:
labels:
app: http-echo
spec:
containers:
- name: http-echo
image: mendhak/http-https-echo:34
ports:
- containerPort: 8080
resources:
requests:
cpu: 5m
---
apiVersion: v1
kind: Service
metadata:
name: http-echo
labels:
app: http-echo
spec:
type: ClusterIP
ports:
- port: 80
targetPort: 8080
name: http
selector:
app: http-echo
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: http-echo
labels:
app: http-echo
spec:
ingressClassName: nginx
rules:
- http:
paths:
- path: "/ping"
pathType: Exact
backend:
service:
name: http-echo
port:
number: 80
host: http-echo.localhost
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: http-echo
labels:
app: http-echo
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: http-echo
minReplicas: 2
maxReplicas: 4
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
# scale up if usage is above
# 99% of the requested CPU (10m)
averageUtilization: 99
---
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: http-echo
spec:
provider: nginx
# deployment reference
targetRef:
apiVersion: apps/v1
kind: Deployment
name: http-echo
# ingress reference
ingressRef:
apiVersion: networking.k8s.io/v1
kind: Ingress
name: http-echo
# HPA reference (optional)
autoscalerRef:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
name: http-echo
# the maximum time in seconds for the canary deployment
# to make progress before it is rollback (default 600s)
progressDeadlineSeconds: 60
service:
# ClusterIP port number
port: 80
# container port number or name
targetPort: 8080
analysis:
# schedule interval (default 60s)
interval: 10s
# max number of failed metric checks before rollback
threshold: 10
# max traffic percentage routed to canary
# percentage (0-100)
maxWeight: 50
# canary increment step
# percentage (0-100)
stepWeight: 5
# NGINX Prometheus checks
metrics:
- name: request-success-rate
# minimum req success rate (non 5xx responses)
# percentage (0-100)
thresholdRange:
min: 99
interval: 1m
# testing (optional)
webhooks:
- name: acceptance-test
type: pre-rollout
url: http://flagger-loadtester.localhost/
timeout: 30s
metadata:
type: bash
cmd: "curl -sd 'test' http://http-echo-canary/ping | grep ping"
- name: load-test
url: http://flagger-loadtester.localhost/
timeout: 5s
metadata:
cmd: "hey -z 1m -q 3 -c 1 http://http-echo.localhost/ping"
Important notes:
It supports load testing on canary deployment before switching to primary
The canary kind kubernetes resource attachs/overrides app ingress/deployment/service/hpa resources and creates some custom ones(canary and primary named), so that at next app helm/kubectl updates can reset those overrides. We fix this by jus haveing service resource not created explicitly and allowing canary to create/manage service. The new dasmeta base chart already supports flagger canary deployments: https://github.com/dasmeta/helm/tree/main/examples/base/with-canary-deployment
The eks module supports flagger package/operator enabling and there is also example how it can be configured for nginx: https://github.com/dasmeta/terraform-aws-eks/tree/main/examples/eks-with-flagger
Flagger supports both global and custom notifications setups, dasmeta
basehelm chart starting 0.2.8 version supports to set custom alerting configs andflagger-metrics-and-alertschart allows to create alert providers. The terraform aws eks module also updated to support aleerting config. For slack channel notification setups there is need to create webhook by using legacy slack webhook integration, more info can be found here: https://medium.com/@life-is-short-so-enjoy-it/slack-post-message-with-incoming-webhooks-e8d588fdbe89TODO: There is need to have/create detailed dashboard in prometheus to see canary deployment process in details, the cananry/primary pods/ingress metrics can be used, also there are flagger custom metrics available which also can give information