Always run a changing system - The CAOS approach

May 17, 2021

This post is more than a year old. The contents and recommendations in this blog could be outdated.

Collaborator

engineering

The habit of being lazy

We at CAOS are lazy people. We like to focus on building our Cloud Native IAM ZITADEL and simultaneously eliminate the pains we got to know through our individual careers. In my definition, being lazy is “having the time to be able to focus on things I find more interesting” (Credits to Stefan Bechtold @bechte, that sentence really had an impact on me). Obviously to me that means things like writing this blog, therefore I hope it is interesting to you.

Lazy is not equal to being irresponsible, but rather to get rid of boring habits. Once you find yourself doing the same thing for the 5th time, wouldn’t you say it gets "boring" ? Automate it and invest your time in more valuable things than human repetition. (As a musician, I strongly believe in improvement through practice, repetition and focus on “skill”, but in IT, things work differently). Isn’t it irresponsible to always do things in the same manner without questioning them ?

“If you always do what you always did, you`ll always get what you always got.”
Albert Einstein

Some examples: Most of you have been through some software development and software delivery processes and habits in your past experiences. To name a few:

a fixed release date (let's say 4 times a year)
different release stream branches (for example known as master)
updates of dependencies (software/subsystems like databases, web servers, etc)
the unplanned but important system update
the quarterly operations maintenance window for operating system updates
configuration updates done by the ops department
the yearly High Availability test
release “installation/update” manuals
...

I chose some of the above for a good reason, they do have the potential for mistakes and more often, the potential to over-complicate things.

How do we “think” software and operations

What I want to share with you today is our current approach of building our software and deploying it anywhere we need it, some of our learnings and how we approach software development and delivery.

accept the fact that there WILL BE mistakes
don’t blame others for making mistakes because every mistake that doesn’t happen in production is a good one, as somebody learned something
create an environment that makes mistakes visible as early as possible
make all technical decisions at least with a Pull/Merge Request to ensure a minimum of 2 people that have agreed
KEEP THINGS SIMPLE

And of course we do have some (technical) arrangements:

no direct commits to main/master
no manual build steps
version control includes an automatic semantic versioning by commit message
deploy fast
deploy often
deploy automatically (who wants to repeat the same installation instructions each time anyway?)
treat infrastructure as a replaceable element, don’t create vendor locks
deliver our complete stack as fast and as easy as possible
TRANSPARENCY in each configuration, piece of code and the way we build/deploy software

Obviously, we didn’t sit down at the first official CAOS meeting and created tons of rules, definitions of done, system guidelines, code formatting guidelines and so on. The funny spirit of a startup is, you just DO and see what happens. Sure, we made mistakes, some of them were very painful, some just very educational. Some of them led to intensive discussions, some to a couple of beers. What I want to put emphasis on, this is (and hopefully will always be) an evolving process. We learned that the current solution can be the best for today, but we shouldn’t take that for granted, as it can change tomorrow when our (new) goals demand it.

Our infrastructure journey

Containers and their orchestrator

I am a former OPS guy, of course I will always start with the infrastructure, sorry coders. We are Kubernetes nerds. We like containers and we are happy to ship them as often as possible. Never heard of Kubernetes? Read about it here kubernetes.io, I promise you it IS worth your time. To me it always feels like having Aladdin's Lamp. First,you think carefully about what shall happen (if you had just 1 wish) and how, then you rub the lamp, place your wish and abracadabra… things happen magically. That is basically what kubernetes does to containers when you place your wish with YAML files, you just need to be very precise in what you ask for.

For the downside, it is really easy to learn how to rub a lamp and place a wish. Learning how to control a complex container orchestration platform takes time. Accept that. Things behave differently on a classic VM Infrastructure or an old school HA Cluster. As with every new technology, you need to get used to it. It took us some years to find the way we use it now and we are still learning each and every day.

To explain it a bit more technically: we abstract the underlying compute power like virtual machines, datacenter machines, even computers (“the larger ones”. Jen Barber/IT Crowd) and put a container runtime on it (docker/containerd/cri-o). We then use Kubernetes to control that container runtime and orchestrate/manipulate our containers in the way we want.

The infrastructure itself has its configuration files in its own git repository and is therefore version tracked and reproducible. That makes life easy for us as we are able to track each change to a single commit.

There are many options to get a Kubernetes cluster. They vary in cost, speed of delivery and reliability. Google, AWS, Azure, Rancher, OpenShift, etc. would be the “old kids on the block”. We started with a managed cluster based on Googles GKE, some simple clicks and we were up and running and had our (self) scaling infrastructure.

There you have it: logging, monitoring, scaling, load balancing, storage, backups, fully managed out of the box. Voilà!

Customer needs

Pretty soon we found out that customers would like to run our (or any) software within their own data centers to have control over their data and the compute power.

This is where things get interesting. It was not really a big deal to orchestrate our software on a local Kubernetes (or any reseller) cluster. But what about the observability, traces, load balancers, backups, etc.? We relied on Google’s monitoring and logging, dynamic load balancers cloud infrastructure (which is pretty good), that convenience didn't apply at an on premise setup.

There it was, a developers classic: the make or buy decision. Our primary operations goal is to deliver everything you need to run our software fast and easy.

We sat down, discussed and decided to ship our software with a complete monitoring stack. Prometheus, Loki and Grafana were our weapons of choice as they are cloud native, easy to use and plugable/scalable.

That left us with the need for Load Balancing, Storage and some other subsystems. Call it “classical operations stuff”. Most datacenter and customer IT departments were able to provide us with one, the other or both, but that had a massive impact on our “deliver everything in a short timeframe”-strategy.

The birth of ORBOS

With various providers, customers and an evolving IAM we regularly faced the same issues with infrastructure. Most of the time it was that “one little bit” that was missing and held us back. It is like when you build a complete house, paint it, light the oven to warm it up, but you don’t have a key to get into it as someone needs to drive by and deliver it to you.

From a business owner perspective, the claim to deliver fast is obligatory once you declared it. He doesn’t care about technical details such as: “there is no load balancer IP address available”. Therefore we chose an “ship everything if needed” approach to fit into the individual environments of our customers.

As we also needed infrastructure, that foremost included ourselves. Our ZITADEL development team wanted to deploy the software on a fast and reliable system, without the need to take care of the underlying infrastructure, rebuild of database indexes, backup, restore, etc.

With tools like kubeadm and a couple of virtual machines, we were capable of deploying a Kubernetes infrastructure pretty fast. This is when development and operation aspects have been put into a melting pot and heated up by our discussions.

The output: we want to be independent!

We thought of infrastructure with the mindset of a product owner. If there is an “infrastructure owner”, what would be his demands?

Some examples:

a specific operating system
always up to date packages
installation and configuration of specific tools
- Loadbalancer
- Floating IP
- Webserver/API Gateway
- Firewall
- transparent configuration management
- up to date certificates
- encrypted secrets
- a reliable monitoring stack
- etc.

We took our time and various iterations to evaluate and test a set of tools with the focus on things like:

keep things as simple as possible (that is the hard bit)
everything needs to be automated 100%
no central component
it needs to be as transparent as possible

Once we knew what our desired output was and we knew how to get there, we:
automated, tested, automated, destroyed, automated, tested, automated, reviewed,.....

What had started as a solution for ourselves to provide us and our customers with an out of the box setup of an automated Kubernetes infrastructure soon showed its potential to become a product on its own.

ORBOS

As we profit from Kubernetes' strong open source community we decided to opensource our ideas and give something back to the community. That is how it should work, we strongly believe in the open source culture.

Powered with GitOps, Golang and containers we managed to build our “maintenance” ecosystem for Kubernetes clusters. ORBOS not only installs infrastructure as code, it manages it. As a stubborn OPS guy I know it is easy to build an infrastructure, the more expensive bit is to maintain it. Of course there is a trade-off between “simple things” and “fully automated” magic, it is up to you to decide which pain you prefer: a failing logic (traceable by a git commit or logfile), or a missing user/PW combination in a confluence document.

Sometimes it is a monster, but 99% of the time it does all the things for us we don't want to take care of ourselves on a regular basis.

Codebase and building software

Let's start with the elephant(s) in the room:
“but it works on my machine.”
“my frontend mock is fine, it has to do with the backend”
“the build pipeline does not work, but my local build works fine”
“what do I need to do to setup the project?”
“dev environment is broken, what do we need to do to get our data back in?”
-- A. Developer

Phrases like the above are classics in software development. Most of us heard or said them once in a while. We wanted to collaborate efficiently, therefore we tried to take care of them.

Build ecosystem

We use Golang for our backend applications while the customer facing frontend is written with Angular. That is already 2 ecosystems and build logic for a single product.

We use Docker and Kubernetes for the orchestration of our software and of course it has to be configured somehow.

Our first fundamental decision was to use docker itself to build our software. That automatically had a very positive impact on our development process.

our builds are completely independent of the underlying build system (GitHub, Gitlab, Jenkins, etc...)
the builds behave EXACTLY the same as they do at our buildserver/pipeline
developers can test the software binaries with docker compose and it behaves the same way as in development or production environments
a new developer just needs to run a docker build command to get ZITADEL up and running
multistep builds give developers the freedom to just build front or backend within the same docker build
the pipeline itself is under version control and gets tested every time a docker build is performed
Docker caching makes the build process pretty fast
docker builds run on every operating system or build server

Dependency Management

As described above we use Golang and Angular, that means program languages which are under heavy development with a lot of dependencies. We decided to handle those dependencies and update them as frequently as possible.

Did you ever try to update a ½ year old software project to the latest versions?
good luck!

Do it once and you either don’t touch your versions in the future or you decide to update dependencies frequently. The last option is the one we chose.

GitHub offers Dependabot, a plugin that creates pull requests to keep your dependencies secure and up-to-date. Handling the pull requests frequently is sometimes an overhead from a developers perspective, but it is worth the effort to be up to date. Especially from a security perspective.

Release Management

As we got our builds and the infrastructure we focused on the release process.
Let's recap what we got:

self maintaining platform
transparent containerized pipelines
up to date packages and code
“ready to deploy” containers and their configurations

A mandatory step is to run a code/security scan on the created container image. There are a lot of solutions out there, we went for CodeQL.

To version our artifacts we use the semantic release plugin, which we customized to our needs. What it basically does is to create a semantic tag based on the prefix of a commit message. Actually it can do a lot for you. (see semantic release)

As a team, we agreed on these mechanisms and there is no way around them. If you do a wrong commit message, you won't get a release. Simple, yet effective. Our branching strategy is simple, take a branch for your current work, create a pull request to get it back into main/master. At least a second person handles the merge request and gives their approval. Always!

Deploy

There we are: updated sources, production ready containerized software, kubernetes clusters. But we still need to deploy things and glue it together, right?

Most of us used build and deploy pipelines in their career. One to create an artifact, the other to bring it to a certain environment. Instead of ssh-ing to each node of a cluster and manually installing a piece of software, the logic went to pipelines that were capable of things like:

perform a backup
set a maintenance window to a monitoring system
inform the team that a deployment is running
deploy and configure the artifacts
perform smoke tests
update documentation
let the team/customer know that a certain release has been installed

In the last couple of years the GitOps pattern gained more and more popularity. It is the idea of describing a decentralized git repository per environment and describe the system there. A piece of logic (known as a reconciler or an operator) would then look after that “ops repository” in a short interval and “do things” if something had changed either on the repository, or if someone has changed anything on the target system without updating the ops repository.

In our early days we started to “play around” with that pattern and developed a feeling for it in the last couple of years. We tried many reconcilers and tools to be able to decide what works best for us as a company.

It is obvious that the application you want to install determines the way it has to be done. After a couple of discussions we agreed to use our own operator to install and maintain ZITADEL. It was not the “reinvent the wheel” mentality, but the solutions we had tested for reconciliation missed an important bit: “always another little piece we needed” :)
Instead of tweaking the tools available to an unhealthy extent we decided to build an operator that has a 100% match to our software.

It is easy to describe the target status of a piece of software in a Kubernetes YAML fashion. But it is easy as well to end up with tons of scripts to fulfill the target status, even if they are executed automatically. I, as a perl/bash linux type of guy, had to admit that fact. We decided to develop the operator in golang to keep it as close as possible to the source code of the application itself. The operator would then reconcile the desired state we described in a ops repository into the environment it is running.

The ZITADEL operator does everything we need to run and maintain a functional instance: DNS, certificates, S3 backups, installation of needed dependencies, API gateway mappings,... etc.

To wrap up: GitOps reverses the logic of a deployment. The target environment pulls its information out of a ops repository and some kind of logic ensures the desired state.

And that's what we do now, we change the versions of an ops repository and our operator, that watches the repository, takes care of the changes.

Depending on the software being described, it is no big deal to have multiple servers watching a single ops repository. Did you ever do a live migration with a single commit and doubled/copied the exact same infrastructure? We did, and it worked out pretty well.

Mistakes

Let’s speak about mistakes in particular. We make them, sometimes I think we invented them. They are okay. They are good, if someone learns something and they don’t break production. We invested a lot of time in our processes with the focus on mistakes.

The first thing is to ACCEPT mistakes. Don’t be afraid of mistakes. Don’t make people fear mistakes. As a team we grow with our individual success, that includes our mistakes.

You will make mistakes, your colleagues will make mistakes, your infrastructure provider will make mistakes,...etc. The best process in the world won’t prevent you from mistakes or unexpected behaviour of software in general. We build our daily work around the idea that basically everything could fail: Google, AWS, Azure, Github, the libraries and plugins we use. Kubernetes came in handy for a fully scalable reliable infrastructure, but we had to think carefully about an infrastructure that can handle mistakes like:

a failing node of kubernetes
corrupt database
failing code
DDoS due to high load
a complete datacenter fail
etc.

We accept that there are problems we have not thought of, call that an unknown mistake. But we build our processes as strong and reliable as possible to be able to survive a mistake. Of course we hate making mistakes, each and every one is painful. Some more, some less. The fact that our service has not been unavailable outside of a planned maintenance window until today means a lot to us. It is either a lot of luck, or a solid planning with a certain amount of fault tolerance.

Lazy?

The huge amount of development time described above does not sound lazy to you? It is up to you to decide if it is worth the effort to fully automate the development and delivery process of a piece of software. I wouldn’t want to make the effort for a local excel installation and some sum functions in it.

In our world, we are too lazy to do all the little repetitive things manually, that traditionally no developer seems to enjoy. The process of building software or a deployment is the same on monday 7am as at friday 8pm (I still wouldn't deploy into a weekend if there is no REALLY good reason to do so, call me stubborn).

We still have the startup mindset. We think, do, test and see what we come up with. After a review we repeat. Our experience increases with each iteration. Probably that could be what agile collaboration is about. Create rules where necessary and rely on trust and transparency as often as possible.

Most importantly, we decide as a team: that includes to take the fame, as well as the blame, as a team.

"always run a changing system" - the caos approach

Star our Projects on Github:

Checkout our Blog, if you want to learn more about our Dev/GitOps approach at CAOS