This project is under heavy development and has no tagged releases yet.
But I'd still appreciate it if you could help me by testing it and submitting pull requests, so that you can get the first release earlier!
We already have a throughout getting-started guide, a working Helm chart, and a container image published at
mumoshu/okra:canary
. So it shouldn't be that hard to give it a shot.
Okra
is a Kubernetes controller and a set of CRDs which provide advanced multi-cluster appilcation rollout capabilities, such as canary deployment of clusters.
okra
eases managing a lot of ephemeral Kubernetes clusters.
If you've been using ephemeral Kubernetes clusters and employed blue-green or canary deployments for zero-downtime cluster updates, you might have suffered from a lot of manual steps required. okra
is intended to automate all those steps.
In a standard scenario, a system update with okra
would like the below.
- You provision one or more new clusters with cluster tags like
name=web-1-v2, role=web, version=v2
- Okra auto-imports the clusters into ArgoCD
- ArgoCD ApplicationSet deploys your apps onto the new clusters
- Okra updates the loadbalancer configuration to gradually migrate traffic to the new clusters, while running various checks to ensure application availability
ToC:
- How it works
- Getting Started
- Comparison with Flagger and Argo Rollouts
- CRDs
- CLI
- Why is it named "okra"?
okra
(currently) integrates with AWS ALB and target groups for traffic management, CloudWatch Metrics and Datadog for canary analysis.
okra
currently works on AWS only, but the design and the implementation of it is generic enough to be capable of adding more IaaS supports. Any contribution around that is welcomed.
Here's the list of possible additional IaaSes that the original author (@mumoshu) has thought of:
- Cluster API
- GKE
Here's the list of possible additional loadbalancers:
- AWS NLB
- Envoy
- Istio Ingerss Gateway
- ingress-nginx
Okra
manages cells for you. A cell can be compared to a few things.
A cell is like a Kubernetes pod of containers. A Kubernetes pod an isolated set of containers, where each container usually runs a single application, and you can have two or more pods for availability and scalability. A Okra cell
is a set of Kubernetes clusters, where each cluster runs your application and you can have two or more clusters behind a loadbalancer for horizontal scalability beyond the limit of a single cluster.
A cell is like a storage array but for Kubernetes clusters. You hot-swap a disk in a storage array while running. Similarly, with okra
you hot-swap a cluster in a cell while keeping your application up and running.
Okra's cell-contorller
is responsible for managing the traffic shift across clusters.
You give each Cell
a set of settings to discover AWS target groups and configure loadbalancers, and metrics.
The controller periodically discovers AWS target groups. Once there are enough number of new target groups, it then compares the target groups associated to the loadbalancer. If there's any difference, it starts updating the ALB while checking various metrics for safe rollout.
Okra uses Kubernetes CRDs and custom resources as a state store and uses the standard Kubernetes API to interact with resources.
Okra calls various AWS APIs to create and update AWS target groups and update AWS ALB and NLB forward config for traffic management.
Unlike Argo Rollouts
and Flagger
, in Okra
there is no notions of "active" and "preview" services for a blue-green deployment, or "canary" and "stable" services for a canary deployment.
It assumes there's one or more target groups per cell. cell
basically does a canary deployment, where the old set of target groups is consdidered "stable" and the new set of target groups is considered "canary".
In Flagger
or Argo Rollouts
, you need to update its K8s resource to trigger a new rollout. In Okra you don't need to do so. You preconfigure its resource and Okra auto-starts a rollout once it discovers enough number of new target groups.
okra
updates your Cell
.
A okra Cell
is composed of target groups and an AWS loadbalancer, and a set of metrics for canary anlysis.
Each target group is tied to a cluster
, where a cluster
is a Kubernetes cluster that runs your container workloads.
An application
is deployed onto clusters
by ArgoCD
. The traffic to the application
is routed via an AWS ALB in front of clusters
.
okra
acts as an application traffic migrator.
It detects new target groups
, and live migrate traffic by hot-swaping old target groups serving the affected applications
with the new target groups, while keepining the applications
up and running.
- Install Okra
- Create Load Balancer
- Provision Kubernetes Clusters
- Deploy Applications onto Clusters
- Register Target Groups
- Create Cell
- Create and Rollout New Clusters
- Analysises and Experiments
First, you need to provision a Kubernetes cluster that is running ArgoCD, Argo Rollouts, and ArgoCD ApplicationSet controller.
We call it management cluster
in the following guide.
To deploy required components onto the management cluster, use the following snippet:
# 1. Install ArgoCD and ApplicationSet
# https://argocd-applicationset.readthedocs.io/en/stable/Getting-Started/#b-install-applicationset-and-argo-cd-together
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj-labs/applicationset/v0.3.0/manifests/install-with-argo-cd.yaml
# 2. Install Argo Rollouts
# https://argoproj.github.io/argo-rollouts/installation/
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
Once your management cluster is up and running, install okra
on it using Helm or Kustomize.
Option 1: Helm:
$ helm upgrade --install charts/okra -f values.yaml
Option 2: Kustomize:
$ kustomize build config/manager | kubectl apply -f
You can specify okra's container image tag to anything that is available on https://hub.docker.com/r/mumoshu/okra/tags.
For Helm, you do it like
helm upgrade --install charts/okra --set image.tag=$TAG
.
Note that you need to provide AWS credentials to
okra
as it calls various AWS API to list and describe EKS clusters, generate Kubernetes API tokens, and interacting with loadbalancers.For Helm, the simplest (but not recommended in production) way would be to provide
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
:
values.yaml
:region: ap-northeast-1 image: tag: "canary" additionalEnv: - name: AWS_ACCESS_KEY_ID value: "..." - name: AWS_SECRET_ACCESS_KEY value: "..."For production environments, you'd better use IAM roles for service accounts for security reason.
Create a loadbalancer in front of all the clusters you're going to manage with Okra.
Currently, only AWS Application LoadBalancer is supported.
You can use Terraform, AWS CDK, Pulumi, AWS Console, AWS CLI, or whatever tool to create the loadbalancer.
The only requirement to use that with Okra is to take note of "ALB Listener ARN", which is used to tell Okra which loadbalancer to use for traffic management.
Once okra
is ready and you see no error, add one or more EKS clusters on your AWS account.
- Tag your EKS clusters with
Service=demo
, as we use it to letokra
auto-import those as ArgoCD cluster secrets. - Create one or more target groups per EKS cluster and take note of target group ARNS
Do either of the below to register clusters to ArgoCD and Okra
- Run
argocd cluster add
on the new cluster and either (1) create a new ArgoCDApplication
custom resource per cluster or (2) let ArgoCDApplicationSet
custom resource to auto-deploy onto the clusters - Use Okra's
ClusterSet
to auto-import EKS clusters to ArgoCD and useApplicationSet
to auto-deploy
See Argocd cluster add - Argo CD - Declarative GitOps CD for Kubernetes for more information on the argocd cluster add
command.
Also see argoproj-labs/applicationset for more information on ArgoCD ApplicationSet
and the controller.
Assuming your Okra instance has access to AWS EKS and STS APIs, you can use Okra's ClusterSet
custom resources to
auto-discover EKS clusters and create corresponding ArgoCD cluster secrets.
This, in combination with ArgoCD ApplicationSet
, enables you to auto-deploy your applications onto any newly created
EKS clusters, without ever touching ArgoCD or Okra at all.
The following ClusterSet
auto-discovers AWS EKS clusters tagged with Service=demo
and creates
corresponding ArgoCD cluster secrets.
apiVersion: okra.mumo.co/v1alpha1
kind: ClusterSet
metadata:
name: cell1
spec:
generators:
- awseks:
selector:
matchTags:
Service: demo
template:
metadata:
labels:
service: demo
Note that template.metadat.labels.sevice
instruct cluster secrets to get metadata.labels
of service: demo
, so that AWSTargetGroupSet
can discover those clusters by labels.
Let's say you had an EKS cluster that looks like the below:
$ aws eks describe-cluster --name cdk1
{
"cluster": {
"name": "cdk1",
"arn": "arn:aws:eks:REGION:ACCOUNT:cluster/cdk1",
"createdAt": "2021-09-20T03:21:44.391000+00:00",
"version": "1.21",
"endpoint": "https://SOME_CLUSTER_ID.SOME_SHARD_ID.REGION.eks.amazonaws.com",
"roleArn": "arn:aws:iam::ACCOUNT:role/NAME",
"resourcesVpcConfig": {
"subnetIds": [
"subnet-aaa",
"subnet-bbb",
"subnet-ccc"
],
"securityGroupIds": [
"sg-ddd"
],
"clusterSecurityGroupId": "sg-eee",
"vpcId": "vpc-fff",
"endpointPublicAccess": true,
"endpointPrivateAccess": true,
"publicAccessCidrs": [
"0.0.0.0/0"
]
},
"kubernetesNetworkConfig": {
"serviceIpv4Cidr": "172.20.0.0/16"
},
"logging": {
"clusterLogging": [
{
"types": [
"api",
"audit",
"authenticator",
"controllerManager",
"scheduler"
],
"enabled": false
}
]
},
"identity": {
"oidc": {
...
}
},
"status": "ACTIVE",
"certificateAuthority": {
...
},
"platformVersion": "eks.2",
"tags": {
"Service": "demo"
}
}
}
Note that this Okra is able to find this EKS cluster because:
- This cluster has
tags
of"Service": "demo"
while - The ClusterSet created above has
generators[].awseks.selector.matchTags
ofService: demo
An ArgoCD cluster secret created by the above ClusterSet
should look the below, which is a regular ArgoCD cluster secret with the specified labels.
apiVersion: v1
kind: Secret
metadata:
name: cdk1
namespace: default
labels:
argocd.argoproj.io/secret-type: cluster
service: demo
type: Opaque
data:
config: <BASE64 ENCODED CONFIG JSON>
name: <BASE64 ENCODED CLUSTER NAME>
server: <BASE64 ENCODED HTTPS URL OF K8S API ENDPOINT>
Again, note that this clustser secret got metadata.labels
of service: demo
, because ClusterSet
had template.metadat.labels.sevice
.
Okra works by gradually updating target groups weights behind a loadbalancer. In order to do so,
you firstly need to tell which target groups to manage, by creating AWSTargetGroup
custom resource
on your management cluster per target group.
An AWSTargetGroup
custom resource is basically a target group ARN with a version number and labels.
apiVersion: okra.mumo.co/v1alpha1
kind: AWSTargetGroup
metadata:
name: default-web1
labels:
role: web
okra.mumo.co/version: 1.0.0
spec:
# Replace REGION, ACCOUNT, NAME, and ID with the actual values
arn: arn:aws:elasticloadbalancing:REGION:ACCOUNT:targetgroup/NAME/ID
Assuming you've already created ArgoCD cluster secret for clusters, Okra's AWSTargetGroupSet
can be used to auto-discover
target groups associated to the cluster and register those as AWSTargetGroup
resources.
The following AWSTargetGroupSet
auto-discovers TargetGroupBinding
resources labeled with role=web
from clusters
labeled with service=demo
, to create corresponding AWSTargetGroup
resources in the management cluster.
apiVersion: okra.mumo.co/v1alpha1
kind: AWSTargetGroupSet
metadata:
name: cell1
namespace: default
spec:
generators:
- awseks:
bindingSelector:
matchLabels:
role: web
clusterSelector:
matchLabels:
service: demo
template:
metadata: {}
Let's say you had the below TargetGroupBinding
custom resource labled with role: web
in the new cluster labeled with serivce: demo
:
# In the new cluster
apiVersion: elbv2.k8s.aws/v1beta1
kind: TargetGroupBinding
metadata:
name: web1
namespace: default
labels:
role: web
okra.mumo.co/version: 1.0.0
spec:
arn: arn:aws:elasticloadbalancing:REGION:ACCOUNT:targetgroup/NAME/ID
Okra is able to find the cluster thanks to clusterSelector.matchLabels.service=demo
and also able find this target group binding thanks to bindingSelector.matchLabels.role=web
.
The outcome is that Okra creates the below AWSTargetGroup
in the management cluster. Note that metadata.name
of it is
derived from the original TargetGroupBinding
's metadata.namespace
and metadata.name
, concatenated with -
in between.
# In the management cluster
apiVersion: okra.mumo.co/v1alpha1
kind: AWSTargetGroup
metadata:
name: default-web1
labels:
role: web
okra.mumo.co/version: 1.0.0
spec:
# Replace REGION, ACCOUNT, NAME, and ID with the actual values
arn: arn:aws:elasticloadbalancing:REGION:ACCOUNT:targetgroup/NAME/ID
The label role: web
is later used by Cell
to detect it as a candidate for a canary target group, and the okra.mumo.co/version: 1.0.0
label is used to group and sort all the detected target groups to finally see which set of target groups are considered as a part of the next canary.
Finally, create a Cell
resource.
It specifies how it utilizes an existing AWS ALB in Spec.Ingress.AWSApplicationLoadBalancer
and which listener rule to be used for rollout, and the information to detect target groups that serves your application.
An example Cell
custom resource follows.
On each reconcilation loop, Okra looks for AWSTargetGroup
resources labeled with role=web
,
and group those up by the version numbers saved under the okra.mumo.co/version
labels.
As Spec.Replicas
being set to 2
, it waits until 2 latest target groups appear, and starts a canary rollout only after that.
If your application is not that big and a single cluster suffices, you can safely set replicas: 1
or omit replicas
at all.
kind: Cell
metadata:
name: cell1
spec:
ingress:
type: AWSApplicationLoadBalancer
awsApplicationLoadBalancer:
listener:
rule:
forward: {}
hosts:
- example.com
priority: 10
listenerARN: arn:aws:elasticloadbalancing:ap-northeast-1:ACCOUNT:listener/app/...
targetGroupSelector:
matchLabels:
role: web
replicas: 2
updateStrategy:
canary:
steps:
- setWeight: 20
- analysis:
args:
- name: service-name
value: exampleapp
templates:
- templateName: success-rate
- pause:
duration: 5s
- setWeight: 40
type: Canary
The listener rule part is required in order to configure your ALB.
listener:
rule:
forward: {}
hosts:
- example.com
priority: 10
And this directly corresponds to the configuration of an ALB Listener Rule.
priority: 10
is the priority of the listener rule to be added to your ALB listener and hosts: [example.com]
is the conditions associated to the rule.
ALB supports both default and non-default listener rules. Every non-default listener rule requires a priority and non-empty rule conditions.
Okra is designed to not modify the default rule as it can be disruptive sometimes. That's why it requires priority
and a rule condition.
Other rule conditions like headers
, methods
, pathPatterns
, and so on are also supported. See the output of kubectl explain awsapplicationloadbalancerconfig.spec.listener --recursive
for all the available conditions.
spec.updateStategy.canary.steps
contains a definition of canary rollout steps.
Each step can be any of the belows:
setWeight
: updates the canary target groups total weight to the given value. For example, when there's only one canary target group to be rolled out and it'ssetWeight: 20
, the canary target group gets weight of20
. If there were two canary target groups, each gets weight of10
analysis
: runs ArgoCDAnalysisRun
with given arguments. See ArgoCD's documentation on Analysis for more information onAnalysisRun
and its templateAnalysisTemplate
.pause
: pauses the rollout for the duration.
Now you're all set!
Every time you provision new clusters with greater version number, Cell
automatically discovers new target groups associated to the new clusters, gradually update loadbalancer target groups weights while running various analysis.
Need a Kubernetes version upgrade? Create new Kubernetes clusters with the new Kubernetes version and watch Cell
automatically and safely rolls out the clusters.
Need a host OS upgrade? Create new clusters with nodes with the new version of the host OS and watch Cell
rolls out the new clusters.
And you can do the same on every kind of cluster-wide change! Enjoy running your ephemeral Kubernetes clusters.
Okra
provides almost the same features for Argo Rollouts Analysises and Experiments.
A major difference between Argo Rollouts' and Okra's is that Okra's provides access to the status.desiredVersion
field in an analysis and experiment query argument, so that you can analyze and experiment your canary based on metrics specific to the version of your clusters.
In the below example, we have an analysis step that creates an analaysis run from a analysis template named success-rate
, whose argument cluster-version
is set to the value obtained from cell.status.desiredVersion
.
The desiredVersion
status field contains the desired version number(obtained from e.g. EKS cluster tags and AWS target group tags) of the cluster repliacs being rolled out, so that you can analyze based on metrics specific to the newly rolled out clusters.
A typical cell whose one of canary steps is a analysis
would look like the below. Notice the fieldPath: status.desiredVersion
used to dynamically generate the cluster-version
analysis run argument.
apiVersion: okra.mumo.co/v1alpha1
kind: Cell
metadata:
name: web
spec:
updateStrategy:
type: Canary
canary:
steps:
# ...
- analysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: guestbook-svc.default.svc.cluster.local
- name: cluster-version
valueFrom:
fieldRef:
fieldPath: status.desiredVersion
Similarly, an experiment step can inculde a fieldPath
to have a dynaically generate argument:
apiVersion: okra.mumo.co/v1alpha1
kind: Cell
metadata:
name: web
spec:
updateStrategy:
type: Canary
canary:
steps:
# ...
- experiment:
duration: 5m
templates:
- name: wy
# references the wy replicaset defined below
specRef: wy
# This should default to 1 as defined by Argo Rollouts but
# the author observed that it doesn't work in practice.
#
replicas: 1
analyses:
- name: success-rate-dd
templateName: success-rate-dd
args:
- name: service-name
value: wy-serve
- name: cluster-version
valueFrom:
fieldRef:
fieldPath: status.desiredVersion
---
apiVersion: apps/v1
kind: ReplicaSet
metadata:
labels:
app: wy
name: wy
spec:
replicas: 0
selector:
matchLabels:
app: wy
template:
metadata:
creationTimestamp: null
labels:
app: wy
spec:
containers:
- image: mumoshu/wy:latest
name: wy
ports:
- containerPort: 8080
resources: {}
args:
- repeat
- get
- -forever
- -interval=5s
- -url=http://localhost:8080
- -argocd-cluster-secret=cdk1
- -service=wy-serve
- -remote-port=8080
- -local-port=8080
envFrom:
- secretRef:
name: wy
optional: true
As explained earlier, Okra
relies on Argo Rollouts Datadog
support.
It works like this- you define a Okra Cell
, so that the okra controller creates either Argo Rollouts AnalysisRun
or Experiment
, which in turn instruct Argo Rollouts to periodically query Datadog metrics with the "Query timeseries points" API, update the AnalysisRun or Experiment's statuses to be either Successful
or Failed
. The final step is the okra controller gets notified about the status update and react to it by reconciling the parent Cell
resource, incrementing the canary step.
The only part specific to Datadog is that it queries Datadog, which has been implemented in argoproj/argo-rollouts#705 in Argo Rollouts.
If you're curious how you'd instrument your app so that it's metrics cna be used from Okra, you'd better get started by reading e.g. Mapping Prometheus Metrics to Datadog Metrics. There's nothing specific to Okra here.
Before authoring a complex Cell
spec including Analysis and Expriment, the author recommends you to try browsing Datadog dashboard, or use simpler tool like curl
to query metries.
After you've done so, start tinkering with Okra, so that when it break you can be extra sure when and where it broke!
It is inteded to be deployed onto a "control-plane" cluster to where you usually deploy applications like ArgoCD.
It requires you to use:
- NLB or ALB to load-balance traffic "across" clusters
- You bring your own LB, Listener, and tell
okra
the Listener ID, Number of Target Groups per Cell, and a label to group target groups by version.
- You bring your own LB, Listener, and tell
- Uses ArgoCD ApplicationSets to deploy your applications onto cluster(s)
In the future, it may add support for using Route 53 Weighted Routing instead of ALB.
Although we assume you use ApplicationSet for app deployments, it isn't really a strict requirement. Okra doesn't communiate with ArgoCD or ApplicationSet. All Okra does is to discover EKS clusters, create and label target groups for the discovered clusters, and rollout the target groups. You can just bring your own tool to deploy apps onto the clusters today.
It supports complex configurations like below:
- One or more clusters per cell, or an ALB listener rule. Imagine a case that you need a pair of clusters to serve your service.
okra
is able to canary-deploy the pair of clusters, by periodically updating two target group weights as a whole.
The following situations are handled by Okra:
- When there are enough number of "new" target groups, Okra gradually updates target group weights for a rollout
- Okra automatically falls back to a "old" target groups when there are only old target groups in the AWS account while ALB points to "new" target groups that disappeared
Okra
provides several Kuberntetes CustomResourceDefinitions(CRD) to achieve its goal.
See crd.md for more documentation and details of each CRD.
okra
provides 3 executables.
okrad
: the Kubernetes controller manager that consists of various Kubernetes controller for Okra CRDs. Intended to be run in a Kubernetes cluster.okractl
: Akubectl
-like CLI application that is for interacting withokrad
through Kubernetes API server. Intended to be run on your machine or on a CI system for automation.okra
: the standalone CLI application that does its best to provide every single logic implemented inokrad
's controllers. Intended to be run in CI to replicateokrad
's functionality on a CI system, or to test each okra functionality in isolation.
The standard and author's recommended usage of Okra involves okrad
and okractl
.
For okra
, we do our best to expose every single okrad
+ okractl
functionality via respective okra
CLI commands, so that you can test each functionality in isolation.
It may be even possible to build your own CI job that replaces okra
out of those commands!
See CLI for more information and its usage.
Okra is inspired by various open-source projects listed below.
- ArgoCD is a continuous deployment system that embraces GitOps to sync desired state stored in Git with the Kubernetes cluster's state.
okra
integrates withArgoCD
and especially itsApplicationSet
controller for applicaation deployments.okra
relies on ArgoCDApplicationSet
controller'sCluster Generator
feature
- Flagger and Argo Rollouts enables canary deployments of apps running across pods.
okra
enables canary deployments of clusters running on IaaS. - argocd-clusterset auto-discovers EKS clusters and turns those into ArgoCD cluster secrets.
okra
does the same with itsClusterSet
CRD andargocdcluster-controller
. - terraform-provider-eksctl's courier_alb resource enables canary deployments on target groups behind AWS ALB with metrics analysis for Datadog and CloudWatc metrics.
okra
does the same with it'sAWSApplicationLoadBalancerConfig
CRD andawsapplicationloadbalancerconfig-controller
.
Initially it was named kubearray
, but the original author wanted something more catchy and pretty.
In the beginning of this project, the author thought that hot-swapping a cluster while keeping your apps running looks like hot-swaping a drive while keeping a server running.
We tend to call a cluster of storages where each storage drive can be hot-swapped a "storage array", hence calling a tool to build a cluster of clusters where each cluster can be hot-swapped "kubearray" seemed like a good idea.
Later, he searched over the Internet for a prettier and catchier alternative. While browsing a list of cool Japanese terms with 3 syllables, he encountered "okra". "Okra" is a pod vegetable full of edible seeds. The term is relatively unique that it sounds almost the same in both Japanese and English. The author thought that "okra" can be a good metaphor for a cluster of sub-clusters when each seed in an okra is compared to a sub-cluster.