Managed Kubernetes restrictions
Managed Kubernetes takes a huge operational burden off your shoulders. You can have perfectly elastic infrastructure where nodes scale in and out of your cluster as workload requirements change. It’s convenient, but it does come with limitations.
One I ran into recently is running setup commands on the nodes in your cluster when they start.
For example, what if you need to configure a kernel parameter on your nodes for one of your workloads? Upstream Kubernetes has a short list of “safe” sysctls, and some managed clouds expand that list slightly. But if the sysctl you want to configure is not in that list it doesn’t seem possible to configure it using the provided managed Kubernetes interface.
Kubernetes issues and suggested workarounds
There are several Kubernetes issues describing a similar requirement, the oldest from 2015:
Some suggested solutions are:
DaemonSetpods to have a
RestartPolicy: OnFailure. However that is not implemented and it doesn’t look like it will be anytime soon. Plus it’s quite an inelegant solution as the
RestartPolicyand the historical exit status of a pod is not obviously exposed in the Kubernetes API.
- Run a script that waits for
DaemonSetpods to run once and then deletes the
DaemonSet(link, link). But it requires the script to be run externally to the cluster.
- Run a
initContainerto do the node setup, and then the main container of the pod “pauses”. This is better than the previous solutions because it doesn’t require any changes to Kubernetes and is an in-cluster solution. But it means that there are pause pods hanging around forever taking precious resources such as IP addresses on the node.
Another solution that I’ve seen is for workload pods that need special node configuration to run a privileged
initContainer that checks the node and configures it if necessary. However that isn’t ideal either because now you need to allow privileged pods in your workload namespace. And with Pod Security Standards recently becoming the way manage security restrictions on a namespace level, you don’t want to have to allow pods in the namespace to run using the Privileged profile just for the
A better option
I recently stumbled across this blog post which describes a brilliant and simple solution to this problem: use a
DaemonSet with a node affinity which stops it scheduling pods on a node with a
Then have the pods in the
DaemonSet set up the node, add the
node-configured label to the node they are running on, and exit.
The pods in the
DaemonSet will no longer schedule on the same node due to the affinity.
This achieves the goals of running once, immediately on node boot, and not leaving pods running forever.
Even better, because it uses the Kubernetes API only, this solution is both Kubernetes-native and cloud-agnostic!
It also plays nicely with Pod Security Standards because it allows you to run the node configuration
DaemonSet in a “Privileged” workspace, while keeping your workload pods with the special node requirements in a “Restricted” namespace.
You could also extend this to ensure workloads don’t schedule on nodes until they are configured by adding/removing node Taints in the configuration pod, or adding an affinity to your workload pods requiring the
Here is my solution for configuring nodes in a cluster to run OpenSearch, which requires the sysctl
vm.max_map_count >= 262144.
--- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: aio-configure-node rules: - apiGroups: - "" resources: - nodes verbs: - get - patch --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: aio-configure-node roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: aio-configure-node subjects: - kind: ServiceAccount name: aio-configure-node namespace: default --- apiVersion: v1 kind: ServiceAccount metadata: name: aio-configure-node namespace: default --- apiVersion: apps/v1 kind: DaemonSet metadata: name: aio-configure-node labels: app.kubernetes.io/name: aio-configure-node app.kubernetes.io/component: configurator namespace: default spec: selector: matchLabels: app.kubernetes.io/name: aio-configure-node app.kubernetes.io/component: configurator template: metadata: name: aio-configure-node labels: app.kubernetes.io/name: aio-configure-node app.kubernetes.io/component: configurator spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: k8s.amazee.io/node-configured operator: DoesNotExist serviceAccount: aio-configure-node containers: - name: sysctl image: alpine/k8s:1.25.3 imagePullPolicy: IfNotPresent env: - name: MY_NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName command: - sh - -c - | set -xe DESIRED="262144" CURRENT=$(sysctl -n vm.max_map_count) if [ "$DESIRED" -gt "$CURRENT" ]; then sysctl -w vm.max_map_count=$DESIRED fi kubectl label node "$MY_NODE_NAME" k8s.amazee.io/node-configured=$(date +%s) securityContext: runAsUser: 0 privileged: true