Talk log from KubeCon LA

Notes from a week of pandemic browsing CNCF youtube

First KubeCon in a while I haven’t done anything for (didn’t even buy an ticket). This post is largely for myself, but thought I’d put some thoughts here public. All talks referenced were recently published on the CNCF youtube channel, and the posts here are really just my notes (make of them what you will).

My interest areas this kubecon fall broadly into these categories;

  • observability related :: maintain a lot of metrics related tooling + do a lot of dev advocacy
  • community related :: am trying to donate kube-rs to cncf and grow that community
  • misc tech :: engineer likes shiny things

sorted in order of interest (grouped by category):

Observability

Using SLOs for Continuous Performance Optimizations

keptn and its evented automation system does seem really good. treats SLOs as first class things. higher level abstraction than other CD systems. no need to write automation systems.. pretty new (cncf sandbox). I should try it.

Keptn Office Hours also goes into a lot of details here for this.

Evolving Prometheus for More Use Cases

Bartek on latest news:

  • New config: sample_limit (body limit) + label_limit (num labels) + label_name_length_limit (label len) + target_limit (per-scrape config limit).
  • Configure scraping by labels e.g. prometheus.io/scrape.
  • Exemplars with OpenMetrics format. Supported in java/golang/python. (NB: I closed my rust pr due to time constraints / lack of support)

Thanos remote-read to help federated setups. Via G-Research. But remote_write more popular. Can set prometheus to only remote_write recording rule results!

  • Prometheus Agent based on Grafana Agent (contributed by them) (better disk usage, DS mode presumably).
  • Grafana Operator; dashboards as CRDs (can split configmap monorepo that normally uses sidecars)
  • prom-label-proxy: isolation. each team only sees their own metrics + resources.

Upcoming: ingestion scaling automation; HPA scaling scraping via dynamically assign scrape targets. High density histograms.

What You Need to Know About OpenMetrics

prometheus + its exposition format is a global standard. Now big collaboration on new standard.

largely the same; but some cleanups and new features.

  • counters require _total suffix, timestamp unit is in seconds (used to be ms)
  • added metadata (units in scrapes), exemplar support
  • (minor breaking changes, opt in with header)
  • push/pull considerations (cannot emulate all of pull with push though)
  • text format mandatory / optional protobuf
  • python client is the reference impl (also go/java)

prometheus conformance program (vendors need to do things to get “Prometheus Compliant” logo) separate talk:

  • to use the mark (for a period of time) have to sign LF paperwork
  • includes: good faith testing clauses, submit tests to prom team
  • monetary incentives - because they plan on iterating on test suite quickly

EBF Superpowers

  • cilium hubble works as a CNI and can help visualise traffic
  • falco can detect syscalls
  • pixie can show flamegraphs within containers

“observability / networking sidecars needs yaml, but ebpf is kernel level.”

linkerd people go into limitation of ebpf as a “mesh” in this thread:

similar overview to rakyll’s eBPF in Microservices Observability, which additionally notes the distribution problem with ebpf at the end.

Understanding Service Mesh Metric Merging

How scraping works with istio (to ensure you get app + proxy) from meshday. Awkward, but ok.

Effortless Profiling on Kubernetes

kubectl flame - creating a container on the same node as target container with profiler binaries (sharing process ids + ns and fs). => can use capturing tools like py-spy/async-profiler to capture flamegraphs without touching running containers it then runs kubectl cp’s the thing out to disk and cleans up thing (no rust support though)

might be obsolete / rewritten with ephemeralContainers (no need find node and grab ps/ns/fs stuff) prodfiler does something similar as a service

Misc Tech

Leveraging WebAssembly to Write Kubernetes Admission Policies

Kubewarden! Rust dynamic admission controller using kube-rs with WASM. No DLS. OCI registry to publish policies. Runs all of them through the policy server.

  • Tracing support into policy wasms!
  • CRD now for policies: module (oci path) + rbac + constraints.
  • opa build -t wasm wasmify via opa
  • testing: kwctl run -e gatekeeper ---settings-json '{...}' --request-path some.json gatekeeper/policy.wasa

Should test this out properly. Looks like less of a hassle than OPA/gatekeeper.

Edge Computing using K3s on Raspberry Pi

nice up to date tutorial to look into in case of apocalypse.

Allocation Optimizer for Minimizing Power Consumption

using science on cpu power usage based on cpu utilization %.

Shifting Spotify from Spreadsheets to Backstage

great service catalog. tons of plugins. costs. trigger incidents. probably better than opslevel? but backstage needs to be in-cluster. also wants to do things that keptn wants to do.

Building Catalogs of Operators for OLM the Declarative Way

OLM craziness on top of controllers. opm serves a registry of controllers in a catalog…

Faster Container Image Distribution

tared image distribution problematic coz you have to download all of it. so two new systems:

  • eStargz: extension to OCI (backwards compat) - subproject of containerd
  • looks like 20-40% of pull speeds of original
  • can enable with k3s server --snapshotter=stargz (but need lazy pull enabled images)
  • can buildkit build using buildx build -o type=registry,name=org/repo:tag,oci-mediatypes=true,compression=estargz
  • also ways to convert images nerdctl ord ctr-remote
  • opencontainers/image-spec#815

and

  • nydus - future looking (incubator dragonfly sub-project)
  • next OCI image spec propoasal
  • improved lazy pulling, better ecosystem integration
  • benchmarks looks better than estargz?
  • harbor with auto-conversion

What We Learned from Reading 100+ Kubernetes Post-Mortems

nice quick failure stories

  • cronjob concurrencyPolicy: Forbid otherwise crashing causing pod duplication “fork bombs”
  • incorrect yaml placements discards bad yaml on bad CI
  • ingress: no * in rules[].host
  • pods: no limits on 3rd party image -> took down cluster when it memory leaked

TL;DR: use good validation and good CD.

From Storming to Performing: Growing Your Project’s Contributor Experience

matt butcher. 4 stages on how they apply to OS:

  • FORM: deal with prs positively / identity / website / branding / communications / twitter (think early) / maintainer guide docs
  • STORM: conflicts (dispute resolution / CoC / Governance / coding standards / contributors != employees (ask + thank)
  • NORM: sharing responsibilities (issue mgmt / triage / delegate (find volunteers) / standardising communication channels)
  • PERFORM: optimising for long haul (retaining maintainers / burnout / turnover / acquire new maintainers)

at all stages; people are still volunteers, be kind, thank them, give them something (responsibility / status) if possible sometimes people need to step down. steps are not hard-delineated

  • adjourning could be the last step (nothing more to really do?)

triage maintainer could be a good idea.

Kubernetes SIG CLI: Intro and Updates

scope: standardisation of cli framework / posix compliance / conventions - owns kubectl kui, cli-runtime cli-experimental cli-utils, krew kustomize

  • they are conceding that apply --prune is awful and has drawbacks. (alpha and probably won’t ever graduate). cli-utils has experiments for improvements.
  • all stuff use cobra (want to remove that) - want to pull apply into something people can use (so can use their stuff as library)
  • kubectl has many imperative things (like kubectl create - hard to maintain)
  • kubectl is bad on performance - too much serialization (json -> yaml -> json -> go structs …) go is strictly typed without generics. memory usage balloons.
  • “kubectl is a very difficult codebase to work on” -_-

Measuring the Health of Your CNCF Project

Via CNCF project-health and devstats cncf dashboards. Project health metrics:

  • Responsiveness (more likely to retain contributors)
    • First Response time on PRs (1 hour good, 3 days bad)
    • Resolution (time to close - dislike this - autoclose bot)
  • Contributor Activity (community toxic? clear contribution policies makes it easier for new/episodic contribs)
    • Contributor activity
    • Contributors new and episodic (shows growth of contributors)
  • Contributor Risk (low risk; many contributors, org diversity)
  • Project Velocity (decrease => maturity or health issues)
  • Release Activity (regular cadence improves trust, quick security response)
  • Inclusivity (inclusive / welcoming porjects attract + retain diverse contributors)
    • mentoring programs? Timeframe? Can run sensibly if you have a regular release cadence, otherwise have to pick a time frame. They have dashboards.

Turn Contributors Into Maintainers with TAG Contributor Strategy

produces templates, guide for governance (already used it!)

  • descriptive helps. goals need to align.
  • clarify what to do when making a PR - minimize manual steps
  • thank people, recognition programs (in releases), create a warming community
  • get people on the contribution ladder. linkerd has a linkerd hero. define the ladder (gamifies the task).
  • maintainers value code and are biased towards that. need people that have other skills. need someone to help with docs?
  • they have a contributor ladder
  • governance == membership. people want to belong to something. proves to them that they are treated equally, and htey have ownership.
  • corporate contributors are shown they won’t be railroaded. investment ~~ influence.

Design Up Front: Socializing Ideas with Enhancement Proposals

On enhancement proposals / RFCs. key takeaways were good:

  • taking time to communicate your ideas clearly and getting feedback / responding to that feedback makes your ideas better and makes you grow as an engineer.
  • helps improve stability, but can be intimidating.
  • need to invest in it, and follow up on reviewers and contributors.
  • the system dies if you don’t.

CNCF Technical Oversight at Scale

creates TAGS (technical advisory groups). help cncf projects incubate/graduate.

  • we might be in the runtime tag; https://github.com/cncf/tag-runtime

  • cncf project updates talk: crossplane/keda/cilium/flux/opentelemetry incubating

  • flux uses ss apply, drift detection, stable apis (although their GA Roadmap talk had just docs/test/standardisation stuff)

  • prometheus high res histograms

  • keda: event driven autoscaler: listens to eventing systems -> translates to metrics -> turns it into cpu/memory metrics “tricking the system”

Technical Oversight Committee

a public meeting. interesting just to get an overview of its goals. good links and reasonable goals (discussion was ok):

CNCF Tag-Runtime

Useful because it’s the TAG that seems likely for kube-rs donation. dims is a liaison!

  • Scope areas limited so far, but “open to expanding”.
  • Contains: krustlet + Akri

Kubernetes SIG Docs

….is apprently mostly hugo + netlify. they have a contributor role of a PR wrangler (and rotate that).

Miscellaneous Notes

See also