Config management in rust

Building a secure yaml api for kubernetes

At babylon health we have a ton of microservices running on kubernetes that are, in turn, controlled by hundreds of thousands of lines of autogenerated yaml.

So for our own sanity, we built shipcat - a standardisation tool (powered by rust-lang and serde) to control the declarative format and lifecycle of every microservice.

..but first, a bit about the problem:

Kubernetes API

Deploying services to kubernetes is no easy task. The abstraction might be nice once you’ve wrapped your head around it, but it’s a significant mental overhead for hundreds of engineers to have to understand. Try telling every engineer that they all need to hand craft their yaml for whatever they need of:

  • ConfigMap
  • Secrets
  • Deployment / ReplicaSet / Pod
  • Service
  • HorizontalPodAutoscaler
  • ServiceAccount
  • Role
  • RoleBinding
  • Ingress

and you’ll quickly realize that this does not scale. Your engineers maybe be able to handle it, but they shouldn’t all have to deal with this excessively verbose API which lacks in crucial validation. Instead, creating a standard gives your platform internal consistency and makes it easier to upgrade manage.

Helm

One of the main abstraction attempts kubernetes has seen in this space is helm. A client side templating system (ignoring the bad server side part) that lets you abstract away much of the above into charts (a collection of yaml go templates) ready to be filled in with helm values; the more concise yaml that developers write directly.

Simplistic usage of helm would involve having a charts folder:

charts
└── base
    ├── Chart.yaml
    ├── templates
    │   ├── configmap.yamls
    │   ├── deployment.yaml
    │   ├── hpa.yaml
    │   ├── rbac.yaml
    │   ├── secrets.yaml
    │   ├── serviceaccount.yaml
    │   └── service.yaml
    └── values.yaml

and calling it with your substitute myvalues.yaml:

helm template charts/base myapp -f myvalues.yaml | \
    kubectl apply -lapp=myapp --prune -f -

which will garbage collect older kube resources with the myapp label, and start any necessary rolling upgrades in kubernetes.

Drawbacks

Even though you can avoid a lot of the common errors by re-using charts across apps, there’s still very little sanity on what helm values can contain. Here are some values you can pass through a helm chart to kube and still be accepted:

  • misspelled optional values (silently ignored)
  • resource requests exceeding largest node (cannot schedule nor vertically auto scale)
  • resource requests > resource limits (illogical)
  • out of date secrets (generally causing crashes)
  • missing health checks / readinessProbe (broken services can rollout)
  • images and versions that does not exist (fails to install/upgrade)

And that’s once you’ve gotten over how frurstrating it can be to write helm templates in the first place.

Limitations

While validation is a fixable annoyance, a bigger observation is that these helm values files become a really interesting, but entirely accidental abstraction. These files become the canonical representation of your services, but you have no useful logic around it. You have very little validation, almost no definition of what’s allowed in there (helm lint is lackluster), you have no process of standardisation, it’s hard to test sprawling automation scripts around the values files, and you do not have any sane way of evolving these charts.

Main idea: shipcat

What if if we could take the general idea that developers just write simplified yaml manifests for their app, but we actually define that API instead? By actually defining the structs we can provide a bunch of security checking and validation on top of it, and we will have a well-defined boundary for automation / ci / dev tools.

By defining all our syntax in a library we can have cli tools for automation and executables running as kube operators using the same definitions. It effectively provides a way for us to versioning our platform.

It also allows us to solve the secret problem. We can extend the manifests with syntax that allows synchronsing secrets from Vault at both deploy and validation time.

Disclaimer

This style of tool is not a revolutionary idea. Last kubecon pretty much everyone had their own wrappers around yaml to help with these problems. Some common examples these days are: kubecfg, ksonnet, flux, helmfile, which all try to help out in this space, but they were all missing most of the sanity we required when we started experimenting at the start of 2018.

Note that this was our first take on adding some validation around kube in a world that had a lot of external config management that didn’t plug nicely into kube. It’s heavily evolving and not general purpose as it stands. Extra parts of our validation will probably still move to a more declarative format (like OPAs) than the raw logic described herein.

Manifests

When migrating to kubernetes, the abstraction we settled on was service-level manifests:

name: webapp
image: clux/webapp-rs
version: 0.2.0
env:
  DATABASE_URL: IN_VAULT
resources:
  requests:
    cpu: 100m
    memory: 100Mi
  limits:
    cpu: 300m
    memory: 300Mi
replicaCount: 2
health:
  uri: /health
httpPort: 8000
regions:
- minikube
metadata:
  contacts:
  - name: "Eirik"
    slack: "@clux"
  team: Doves
  repo: https://github.com/clux/webapp-rs

This encapsulates the most important kube apis that developers should configure themselves, who’s responsible for it, what regions it’s deployed in, what secrets are needed (notice the IN_VAULT marker), and how resource intensive it is.

Strict Syntax

Because these manifests were going to be the entry point for CI pipelines and handle platform specific validation (for medical software), we wanted maximum strictness everywhere and that includes the ability to catch errors before manifests are committed to master.

We lean heavily on serde’s customisable codegeneration to easily encapsulate even the most awkward kube apis - that we want our developers to take advantage of - and to auto-generate the boilerplate validation around types and spelling errors.

Here’s our structs that encapsulate Role-based access control for applications consuming kube apis:

#[derive(Serialize, Deserialize, Clone, Debug)]
#[serde(deny_unknown_fields)]
pub struct Rbac {
    /// API groups containing resources (defined below)
    pub apiGroups: Vec<AllowedApiGroups>,
    /// Resources on which to apply verbs / actions
    pub resources: Vec<AllowedResources>,
    /// Actions to be allowed
    pub verbs: Vec<AllowedVerbs>
}

#[derive(Serialize, Deserialize, Clone, Debug)]
#[serde(rename_all = "lowercase")]
pub enum AllowedApiGroups {
    #[serde(rename = "")]
    Empty,
    Extensions,
    Batch,
    #[serde(rename = "babylontech.co.uk")]
    Babylontech,
}

#[derive(Serialize, Deserialize, Clone, Debug)]
#[serde(rename_all = "lowercase")]
pub enum AllowedResources {
    Deployments,
    ReplicaSets,
    Jobs,
    CronJobs,
    Pods,
    #[serde(rename = "pods/log")]
    PodsSlashLog,
    ConfigMaps,
    Namespaces,
    HorizontalPodAutoscaler,
    Events,
    Nodes,
    RoleBindings,
    Roles,
    Secrets,
    ServiceAccounts,
    Services,
    ShipcatManifests,
    ShipcatConfigs,
}

#[derive(Serialize, Deserialize, Clone, Debug)]
#[serde(rename_all = "lowercase")]
pub enum AllowedVerbs {
    List,
    Get,
    Watch,
}

Notice the awkward empty string having explicit meanings in the kube api, which serde elegantly normalises with a rename field level attribute, but obeying the rename_all container level attribute as an overrideable default.

In other words, serde takes care of most of our validation. It even catches spelling-errors and extraneous types/keys due to the #[serde(deny_unknown_fields)] instruction.

While this goes way beyond what you normally can do with a yaml validator: we can still do better. Every struct can implement a verify function that encapsulates common mistakes that are clearly errors and should be caught before they are sent out to our kube clusters:

impl Rbac {
    pub fn verify(&self) -> Result<()> {
        if self.apiGroups.is_empty() {
            bail!("RBAC needs to have at least one item in apiGroups");
        }
        if self.resources.is_empty() {
            bail!("RBAC needs to have at least one item in resources");
        }
        if self.verbs.is_empty() {
            bail!("RBAC needs to have at least one item in verbs");
        }
        Ok(())
    }
}

We don’t trait these functions because we sometimes pass around some context from other structures to have cross-referencing validation.

Finally, this Rbac struct is attached to our core Manifest so developers can take advantage of it by simply adding:

rbac:
- apiGroups: ["extensions"]
  resources: ["deployments"]
  verbs: ["get", "watch", "list"]

to their service manifest. That’s it. They now have a service that is allowed can watch kube Deployment objects via a generated in-cluster service account token.

In this case, this syntax is a straight translation of the kubernetes API (but leaving out the boilerplate static yaml), and this is often what we do in shipcat.

In some cases, however, we do provide our own simplifying abstractions, but this tends to be for other integrations.

All of our syntax is defined in shipcat/structs so that our developers can easily extend the syntax once we’ve reached code-review consensus.

Once a new version of shipcat is released, we bump its pin in our configuration repository with all our manifests, and the new syntax + feature becomes availble to developers.

Developer / Pre-merge CI Usage

Developers can check that their manifests pass validation rules locally, or wait for pre-merge validation on CI:

shipcat validate myapp

which will run all the associated rules required by the manifest for this service.

To further check what kube yaml this is generating we have:

shipcat template myapp

which is roughly equivalent to:

shipcat values myapp | helm template charts/base

We do currently lean on helm charts for generating the complete kube yaml, but this is an implementation detail that only a handful of engineers need to touch as we follow the one chart to rule them all approach. Charts are also linted with kubeval against all services in all regions during chart upgrades.

Upgrade CI Usage

We currently provide a wrapper around the entire upgrade process with:

shipcat apply myapp

and a CI reconciliation wrapper:

shipcat cluster helm reconcile

which will wait for the helm release(s) (working around various tiller bugs to do so correctly).

This is an abstraction that is due for a change, however. The tiller dependency is being removed on our end, and meanwhile, helm 3 is rearchitecting away tiller entirely.

Conclusion

The ability to standardise, version, manage secrets, gradually introduce features to our kube clusters, and have a common point to write automation on top of has been invaluable to us. Huge thanks to rust and serde for making it this easy to do. <3

Going further

While the core part of shipcat is the standardised syntax, we also have a bunch of automation/integrations in the shipcat cli, as well as a kubernetes operator; raftcat (where we really see code-generation and generics shine). We’ll talk about these further another time.

In the mean time, feel free to dig around; shipcat is open source. There’s also our talk: Babylon Health - Leveraging Kubernetes for global scale from DoxLon2018 for a little more context.