Kubernetes Node Setup Jobs aka Run-Once Daemonsets

Sat, 19 Nov 2022

Managed Kubernetes restrictions

Managed Kubernetes takes a huge operational burden off your shoulders. You can have perfectly elastic infrastructure where nodes scale in and out of your cluster as workload requirements change. It’s convenient, but it does come with limitations.

One I ran into recently is running setup commands on the nodes in your cluster when they start.

For example, what if you need to configure a kernel parameter on your nodes for one of your workloads? Upstream Kubernetes has a short list of “safe” sysctls, and some managed clouds expand that list slightly. But if the sysctl you want to configure is not in that list it doesn’t seem possible to configure it using the provided managed Kubernetes interface.

Kubernetes issues and suggested workarounds

There are several Kubernetes issues describing a similar requirement, the oldest from 2015:

Some suggested solutions are:

  1. Allow DaemonSet pods to have a RestartPolicy: OnFailure. However that is not implemented and it doesn’t look like it will be anytime soon. Plus it’s quite an inelegant solution as the RestartPolicy and the historical exit status of a pod is not obviously exposed in the Kubernetes API.
  2. Run a script that waits for DaemonSet pods to run once and then deletes the DaemonSet (link, link). But it requires the script to be run externally to the cluster.
  3. Run a DaemonSet with an initContainer to do the node setup, and then the main container of the pod “pauses”. This is better than the previous solutions because it doesn’t require any changes to Kubernetes and is an in-cluster solution. But it means that there are pause pods hanging around forever taking precious resources such as IP addresses on the node.

Another solution that I’ve seen is for workload pods that need special node configuration to run a privileged initContainer that checks the node and configures it if necessary. However that isn’t ideal either because now you need to allow privileged pods in your workload namespace. And with Pod Security Standards recently becoming the way manage security restrictions on a namespace level, you don’t want to have to allow pods in the namespace to run using the Privileged profile just for the initContainer.

A better option

I recently stumbled across this blog post which describes a brilliant and simple solution to this problem: use a DaemonSet with a node affinity which stops it scheduling pods on a node with a node-configured label. Then have the pods in the DaemonSet set up the node, add the node-configured label to the node they are running on, and exit. The pods in the DaemonSet will no longer schedule on the same node due to the affinity.

This achieves the goals of running once, immediately on node boot, and not leaving pods running forever. Even better, because it uses the Kubernetes API only, this solution is both Kubernetes-native and cloud-agnostic! It also plays nicely with Pod Security Standards because it allows you to run the node configuration DaemonSet in a “Privileged” workspace, while keeping your workload pods with the special node requirements in a “Restricted” namespace.

You could also extend this to ensure workloads don’t schedule on nodes until they are configured by adding/removing node Taints in the configuration pod, or adding an affinity to your workload pods requiring the node-configured label.

Here is my solution for configuring nodes in a cluster to run OpenSearch, which requires the sysctl vm.max_map_count >= 262144.

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: aio-configure-node
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
  - patch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: aio-configure-node
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: aio-configure-node
subjects:
- kind: ServiceAccount
  name: aio-configure-node
  namespace: default
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: aio-configure-node
  namespace: default
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: aio-configure-node
  labels:
    app.kubernetes.io/name: aio-configure-node
    app.kubernetes.io/component: configurator
  namespace: default
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: aio-configure-node
      app.kubernetes.io/component: configurator
  template:
    metadata:
      name: aio-configure-node
      labels:
        app.kubernetes.io/name: aio-configure-node
        app.kubernetes.io/component: configurator
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: k8s.amazee.io/node-configured
                operator: DoesNotExist
      serviceAccount: aio-configure-node
      containers:
      - name: sysctl
        image: alpine/k8s:1.25.3
        imagePullPolicy: IfNotPresent
        env:
        - name: MY_NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        command:
        - sh
        - -c
        - |
          set -xe
          DESIRED="262144"
          CURRENT=$(sysctl -n vm.max_map_count)
          if [ "$DESIRED" -gt "$CURRENT" ]; then
            sysctl -w vm.max_map_count=$DESIRED
          fi
          kubectl label node "$MY_NODE_NAME" k8s.amazee.io/node-configured=$(date +%s)          
        securityContext:
          runAsUser: 0
          privileged: true
tags: k8s opensearch