Headless Kubernetes meets truly native GitOps. A set of Kubernetes controllers to configure/provision Kubernetes controllers to be as loosely coupled as human operators, with eventual consistency in every sense of the term.
- Triśaṅku - Fork the Heavens
- Content
- Why
- What
- How
- Take It For A Spin
- Next
Kubernetes controllers are typically control-loops that reconcile some specification
captured in a Kubernetes object with the corresponding downstream state.
The Kubernetes objects, which capture both the specifications and the status of some object the particular controller is responsible for,
act as a point of coordination between the specification, the status and the action required to reconcile the two.
Typically, the source of specification is external to the particular controller and the controller is responsible
for keeping the status up-to-date and taking actions to reconcile the status with the specification.
The Kubernetes objects are documents hosted on the kube-apiserver
which in turn relies on storage components such as etcd
.
This architecture is designed to enable eventual consistency.
Kubernetes controllers bring automated specialised components close to the way human specialists typically work in the real world. The operator pattern is explicitly modeled after human operators. But there remains a difference between the human operators and Kubernetes controllers.
The ways human operators act and coordinate amongst one another in the real world is eventually consistent (the real world is the ultimate eventually consistent system).
The way Kubernetes controllers act could be called eventually consistent in the sense that it normalises the desired and the actual states of the system diverging and made to converge continually.
But it manages the consensus regarding the last seen desired and actual states in a central system, namely, the kube-apiserver
.
(or possibly some extension server).
So, while the individual controllers may be eventually consistent, the coordination between them is
strongly consistent, or atleast rigidly structured.
- With this centralised design for consensus, the individual controllers can at best idle around if they do not have access to
the
kube-apiserver
. - Also, because of this centralisation of consensus, any caching done by the controllers will have to be a write-through cache.
I.e., the caches can be stale (do not have the latest data from
kube-apiserver
) but not dirty (have changes not yet sent to thekube-apiserver
).
It is interesting to think about ways to eliminate this remaining difference between the way humans operators and controllers coordinate, if only as a thought experiment. But it is quite likely that there are real-world scenarios that might benifit from an eventually consistent model for coordination between specialised controllers.
Once upon a time, Triśaṅku, a king, wanted to ascend to the heavens while still in his mortal body. When he requested the sages to perform a yajña to make this happen, he was cursed and disgraced by them, including the royal preceptor, sage Vasiṣṭha, saying that only entering the heavens after shedding the mortal body was dharma and doing so before that was not. Sage Viśvāmitra, however, agreed to perform the yajña. As the yajña proceeded, Trishanku started ascending to the heavens, but he was kicked out at the gates by Indra, the king of Gods. Upon seeing Triśaṅku fall from the heavens head first, Viśvāmitra vowed to fullfil his promise to Triśaṅku and proceeded to create an alternative heavens in the southern sky (where Triśaṅku had fallen) and install Triśaṅku as the rival Indra there. On a panicked Indra's pleading, Bṛhaspati, the preceptor of Gods, managed to convince Viśvāmitra to abandon this project lest the universe fall into chaos, but on the condition that the nascent alternative heavens around the upside-down Triśaṅku can remain in the southern sky. The modern-day constellation Crux (also known as the Southern Cross) forms a part of this abandoned alternative heavens.
Viśvāmitra's pull request (Triśaṅku) to the heavens was rejected by Indra. So, Viśvāmitra forked the heavens. On pleading from Indra, Brihaspati pursuaded Viśvāmitra to stop working on the fork further, but not without the existing changes from the fork being merged as a proof of concept for an alternative implementation.
Even the consensus in the heavens was negotiated!
The problem of eventual consistency for a general class of specialist human operators, namely, computer programmers,
has been solved quite interestingly and successfully by Git
.
Git enables individual computer programmers to work independently (on their own local clones) at their own pace
and coordinate amongst one another (by pulling relevant changes) as and when required in a way they find convenient and productive.
Git does not mandate any particular structure for the coordination-flow or managing consensus; any network of coordination or consensus with any degree of simplicity or complexity is supported. This enables not only groups of programmers to experiment with different coordination-flows or consensus management and home in on the flow that works best for them, but also for a suitable modularity to emerge for the solution to the problem they are trying to solve.
This treats the problem of consensus the same way the Kubernetes control-loop treats the problem of the desired and the actual state diverging. This is because consensus also is a kind of state. While the desired state of consensus may be that every part of the system sees the same consensus, the actual state might be that different parts of the system see different consensus at any given point in time and in the principle of control-loop, action is continually performed to converge the diverging consensus in the system.
This normalises the desired and the actual state (including the consensus) diverging and made to converge continually.
- With this eventual consistency approach for consensus-management, the programmers can act even without access to a central consensus. Any action that diverges from an eventual consensus, will be rectified by follow up actions when the consensus emerges.
- The local git clones can be, and often are, both stale (does not have the latest changes from the other programmers) and dirty (have changes not yet seen by the other programmers).
This maps closest to the way eventual consistency works in the real world.
If Git can be used as the storage layer for the specification and status, the controllers can coordinate amongst themselves (and with human operators) in the same way computer programmers do, provided there is some additional support for setting up and automation of flexibile network of coordination.
But it is not possible to implement this with binary-compatibility for existing Kubernetes controllers, though it would be possible with source-compatibility.
Since Git does not mandate any particular way to structure the coordination, it is possible to setup any suitable structure, for example, via an upstream Git repository as shown above.
Kubernetes apiserver implementation has a storage backend abstraction.
But to make use of it to use Git as the storage backend would involve making changes to the kube-apiserver
source code.
Kubernetes apiserver uses etcd
as the default storage backend.
An etcd
shim that uses
Git for storage can enable binary-compatibility even with kube-apiserver
while coordinating Kubernetes controllers using Git.
Gitcd is an etcd
shim that uses a Git repository for storage.
The gitcd serve
command serves a Git repository as an etcd
shim (and continually pull from an upstream branch if configured).
The gitcd pull
command continually merges the changes from a local branch to an upstream branch.
The TrishankuHeaven
controller helps declaratively setup a TrishankuHeaven
in a Kubernetes Pod
,
i.e., coordination using Git like human computer programmers instead of a centralised Kubernetes control-plane,
for existing Kubernetes controllers with full binary-compatibility.
The host for the pod
could be any Kubernetes cluster that has the required
network connectivity that the target controller (and possibly the Git-based coordination) requires.
The TrishankuHeaven
is a Kubernetes custom resource,
which captures, in its specification section,
the PodTemplate
for the Kubernetes controller along with the required Git configuration to be used for coordination.
The TrishankuHeaven
controller
then acts on this object to create and maintain a Deployment
for the controller with the specified PodTemplate
,
but enhanced with additional containers (initial and normal) to act as a binary-compatible triśaṅku heaven,
so that the controller can continue to work with its own local sidecar kube-apiserver
, with the other gitcd
sidecar containers helping with the coordination with the other controllers via Git.
The additional init-containers help prepare the Git repository to be used as a backend for gitcd
.
- The container
git-pre
initialises or clones the Git repo if necessary. - The container
gitcd-init
initialises thegitcd
data and metadata branches in the Git repository.
The additional containers help create a local Kubernetes environment for the target controller which is backed by the Git repository.
- The container
gitcd
acts as anetcd
shim which is backed by the Git repository. - The container
kube-apiserver
uses thegitcd
container as the storage backend and acts as a local Kubernetes control-plane for the target controller. The target controller is configured to talk to this localkube-apiserver
instead of the cluster'skube-apiserver
. - The container
events-etcd
hosts a single-memberetcd
cluster for the high-traffic and somewhat transientevent
objects, so that thegitcd
instance is not overwhelmed.
This way, existing Kubernetes controllers can coordinate amongst one another while working independently without the need for a central control-plane.
Since Git does not mandate any particular way to structure the coordination, it is possible to setup any suitable structure, for example, via an upstream Git repository as shown above.
In this scenario, an additional controller AutomatedMerge
is used to automate
the merging of the changes from the controller into the upstream repository.
This controller generates a separate Deployment
for this purpose.
It approach provides maximal flexibility in designing the network of coordination-flow between the controllers
(from a central upstream branch to completely decentralised branches pulling changes from one another).
The gitcd
container in the Deployment
generated by the TrishankuHeaven
controller already fetches and merges the changes continually from the main
data and metadata branches into the controller
data and metadata branches and push to the upstream Git repository.
This completes the circle of coordination.
This approach for coordinating Kuberenetes controllers without the need for a central control-plane creates the possibility of a fully decentralised Kubernetes cluster where each component/controller works independently while coordinating amongst one another via Git in such a way that the phenomenon of a Kubernetes cluster emerges even without a central control-plane. That is, the phenomenon of Kubernetes cluster emerges from weakly interacting autonomous controllers. Perhaps such a fully decentralised Kubernetes cluster could be called a headless Kubernetes cluster. This could potentially form the basis for achieving the functionality of automonous Kubernetes clusters.
The individual steps of setting up such a headless cluster can be seen here.
A simplified sequence diagram of the same steps can be seen below.
-
As already noted above, the part about mergeing and pushing controller branches into upstream coordination branch is now redisigned to be managed via the
AutoamtedMerge
resource which maintains a dedicatedDeployment
resource for this purpose. -
The above setup assumes the network connectivity between the headless control-plane and the headless worker nodes if it is required. Ideally, each the headless component (control-plane and worker nodes) needs network access only to the upstream Git repository apart from what it needs to perform its normal duties.
-
The above setup leaves out the details of setting up the headeless virtual machine and making sure that it joins as a node of the headless cluster. Ideally, this should also be automated declaratively in a control-loop along the lines of gardener/machine-controller-manager, which also can be hosted as another headless control-plane controller in the bootstrap cluster.
The above example used a host Kubernetes cluster to host the headless control-plane for the headless cluster. Alternatively, two headless clusters could be configured to host the headless control-planes of each other (or three headless clusters hosting the control-planes of one another in closed sequence as seen in this proposal). The high level steps for this can be as below.
-
Setup a
blue
headless cluster using abootstrap
Kubernetes cluster to host its headless control-plane. -
Setup a
green
headless cluster using theblue
headless cluster to host its headless control-plane. -
Prepare the
green
headless cluster to host a new replica of theblue
headless control-plane. -
Scale down the original
blue
headless control-plane replica in thebootstrap
cluster to avoid racing with the new replica to be setup next in thegreen
headless cluster. -
Setup the new replica of the
blue
headless control-plane in thegreen
headless cluster by pointing the correspondingtrishankuheavens
to theblue
upstream Git repository.
- In these depictions, the headless controllers are shown simplistically communicating with the upstream Git repositories (instead of a central apiserver) omitting the details of the sidecar containers or separate
deployments
that make such communication happen. - As mentioned above, this setup leaves out control-loop automation of provisioning headless nodes for either of the headless clusters.
So, this setup is not self-healing if either of the headless nodes are lost.
This can be remedied by setting up something like
gardener/machine-controller-manager
.- The sample setup shows how the
gardener/machine-controller-manager
can be setup in such a headless Kubernetes cluster. More work is needed to configure theTrishankuHeaven
for thekubelet
running in the provisioned machines so that it can join the headless cluster as anode
. as a headless control-plane component to provision and manage the headless nodes.
- The sample setup shows how the
- This setup also ignores the complications involved in transitioning the control-plane of the
blue
headless cluster from thebootstrap
cluster to thegreen
headless cluster. Such a transition is eased considerably by the fact that merely pointing theblue
headless cluster's control-plane to the same upstream Git repo solves the data migration problem. Leader-election or other such mechanisms would be required to aviod/mitigate two replicas of theblue
headless cluster (one each in thebootstrap
andgreen
clusters) racing with each other.
- A GitHub account with a personal access token with permission to create private repos and push and pull from them.
- A GCP accound with a service account with access to create compute instances.
- A Kubernetes cluster with access to GitHub and GCP API endpoints.
The secret git-cred
should contain the personal access token details for access to the GitHub account to create a private repo and push and pull from it.
The secret should have the following information.
data.url
: The base URL for the GitHub account.data.username
: The username for the GitHub account.data.password
: The GitHub personal access token.
The secret gcp-cred
should contain the service token JSON for the GCP account where the compute instance will be created by MCM of the headless cluster.
The secret should have the following information.
data.serviceAccountJSON
: The GCP service account JSON with access to create compute instances.data.userData
: The user data to be passed in the GCP compute instance creation request. Ideally, this should be configured to setup thekubelet
in the compute instance to join the headless cluster as anode
. The work for such configuration is pending. For now, an empty user data such as#cloud-config
would do. With this, the compute instance will be provisioned by it will not join the headless cluster as anode
.
The container images for the trishanku
components such as gitcd
, heaven
(or for that matter the gardener components like gardener/machine-controller-manager
and gardener/machine-controller-manager-provider-gcp
)
are not available publicly.
So, the images might have to be built from source and pushed a suitable container image registry by customizing the image variable specified in the correspnding Makefile
in these projects before running make docker-build
or make docker-image
before pushing the docker image.
If the selected container image registry requires access permission to pull images from, such access permissions need to be specified in a secret gar-cred
so that the sidecar containers created for the trishankuheavens
can pull these images.
The sample YAML files create a private GitHub repo sample-k8s
under the GitHub organisation trishanku-org
.
If this is to be customised, the changes have to be done consistently in all the YAML files.
Since, all the controllers in the headless cluster, including the gardener/machine-controller-manager
controllers,
use the above-mentioned private repo as the coordination point for their individual Git repos,
the secret gcp-cred
gets copied into this private repo.
So, it is important to keep even a customised repo as a private repo (and to delete the repo after this exercise) to avoid leaking GCP credentials.
Please make sure to clone the trishanku heaven
(.) repo and to cd
into the cloned directory before proceeding.
Also, please make sure that the KUBECONFIG
is pointing to the chosen Kubernetes cluster.
make run &
Alternatively, please run the trishanku heaven
as a controller in a pod in the cluster with the required permissions.
$ kubectl apply -f config/samples
automatedmerge.controllers.trishanku.org.trishanku.org/sample-k8s-kcm created
githubrepository.controllers.trishanku.org.trishanku.org/sample-k8s created
trishankuheaven.controllers.trishanku.org.trishanku.org/sample-k8s created
trishankuheaven.controllers.trishanku.org.trishanku.org/sample-k8s-kcm created
trishankuheaven.controllers.trishanku.org.trishanku.org/sample-k8s-kube-scheduler created
configmap/sample-k8s-kube-scheduler created
serviceaccount/sample-k8s-mcm-crds created
role.rbac.authorization.k8s.io/sample-k8s-mcm-crds created
rolebinding.rbac.authorization.k8s.io/sample-k8s-mcm-crds created
job.batch/sample-k8s-mcm-crds created
configmap/sample-k8s-mcm-crds-entrypoint created
trishankuheaven.controllers.trishanku.org.trishanku.org/sample-k8s-mcm created
trishankuheaven.controllers.trishanku.org.trishanku.org/sample-k8s-mc created
$ wait kubectl get pods
NAME READY STATUS RESTARTS AGE
sample-k8s-kcm-automerge-554c8fff8f-8kgwv 1/1 Running 0 2m9s
sample-k8s-kcm-heaven-7d4f66667f-w9x79 4/4 Running 0 2m15s
sample-k8s-kube-scheduler-heaven-b6459f8d7-jlbfc 4/4 Running 0 32s
sample-k8s-mc-heaven-7ffc78d454-lr7q2 4/4 Running 0 2m15s
sample-k8s-mcm-crds-swl6m 0/1 Completed 2 2m16s
sample-k8s-mcm-heaven-7499c45fff-mng2x 4/4 Running 0 2m15s
This might take a couple of minutes.
# Please make sure to authenticate gcloud to the right GCP account and to customise the region and zone correctly.
$ gcloud compute instances list
NAME ZONE MACHINE_TYPE PREEMPTIBLE INTERNAL_IP EXTERNAL_IP STATUS
test-machine asia-south1-a n1-standard-1 10.160.0.6 34.93.143.120 RUNNING
$ kubectl delete -f config/samples
automatedmerge.controllers.trishanku.org.trishanku.org "sample-k8s-kcm" deleted
githubrepository.controllers.trishanku.org.trishanku.org "sample-k8s" deleted
trishankuheaven.controllers.trishanku.org.trishanku.org "sample-k8s" deleted
trishankuheaven.controllers.trishanku.org.trishanku.org "sample-k8s-kcm" deleted
trishankuheaven.controllers.trishanku.org.trishanku.org "sample-k8s-kube-scheduler" deleted
configmap "sample-k8s-kube-scheduler" deleted
serviceaccount "sample-k8s-mcm-crds" deleted
role.rbac.authorization.k8s.io "sample-k8s-mcm-crds" deleted
rolebinding.rbac.authorization.k8s.io "sample-k8s-mcm-crds" deleted
job.batch "sample-k8s-mcm-crds" deleted
configmap "sample-k8s-mcm-crds-entrypoint" deleted
trishankuheaven.controllers.trishanku.org.trishanku.org "sample-k8s-mcm" deleted
trishankuheaven.controllers.trishanku.org.trishanku.org "sample-k8s-mc" deleted
The steps for deprovisioning of the compute instance created by the gardener/machine-controller-manager
is yet to be updated in this documentation.
For the time-being, please deprovision the compute instance via gcloud
command.
# Please make sure to authenticate gcloud to the right GCP account and to customise the region and zone correctly.
$ gcloud compute instances delete test-machine --zone asia-south1-a
The following instances will be deleted. Any attached disks configured to be auto-deleted will be deleted unless they are
attached to any other instances or the `--keep-disks` flag is given and specifies them for keeping. Deleting a disk is
irreversible and any data on the disk will be lost.
- [test-machine] in [asia-south1-a]
Do you want to continue (Y/n)? Y
Deleted [https://www.googleapis.com/compute/v1/projects/trishanku/zones/asia-south1-a/instances/test-machine].
This project is a proof of concept.
As noted above, the sample setup sets up a headless Kubernetes cluster with the kube-controller-manager
, the kube-scheduler
and the controllers of the
gardener/machine-controller-manager
to run independently while coordinating with one another only by communicating changes via Git.
- Work is pending to configure a
TrishankuHeaven
for thekubelet
inside the machines provisioned by thegardener/machine-controller-manager
to help it join and participate in the headless cluster as anode
. - Also, a lot more work is required to make it efficient and productive.
The above sample uses a private GitHub repo as a point of coordination amongst the controllers of a headless cluster.
- It is possible to do such a coordination with a locally accessible Git repo but working is pending to document this.
- Work is also pending to support other Git-hosting platforms.
The kube-apiserver
currently stores resources defined by
CustomResourceDefinition
as single-line JSON string in the storage.
This could be problem for the normal merge-conflict resolution mechanism of Git which are designed to detect conflict at
the granularity of lines and not in individual fields of unformatted deeply structured text like JSON.
- Work is pending to support automation of resolving merge-conflict in such cases optimally by enabling JSON-aware conflict detection and resolution.
There are different possible applications for such an approach of loosely coordinating independent controllers. Please reach out at @AmshumanKR (Twitter) or here in the GitHub issues if interested in collaborating.
Git was picked for this project because it enables unlimited forking and multi-way merging with unconstrained conflict resolution. But the inefficiency of using Git as a database is obvious (though seeing the history of change made by the individual controller can be quite interesting in itself and might even have some diagnostic value). Some of the inefficiency is mitigated by the fact that each fork of the Git repo serves only a single controller and the eventually consistent coordination naturally lends itself to subdividing the problem space (and hence, the data space) to any suitable granularity.
However, the real inefficiency lies in Gitcd using a file/folder structure as a key-value store and not so much in Git being used to version track such a file/folder structure. In principle, this can be remedied by using a more conventional database which supports eventually consistent coordination with unlimited forks and multi-way merging the way Git does. Some candidates are as follows.