Description of problem:
MCO uses openshift/kubernetes-drain for draining nodes before rebooting. The library had a serious bug where the apiserver is flooded with requests. The bug has been fixed in https://github.com/openshift/kubernetes-drain/pull/3 but a deadlock fix went in with https://github.com/openshift/kubernetes-drain/pull/4. However, https://github.com/openshift/kubernetes-drain/pull/4 never landed and instead the library has been forked for supportability to cluster-api/pkg/drain.
The MCO has to switch to the cluster-api fork to fix the deadlock and verify the drain fix is in place.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
$ cat pdb.yaml
$ oc create -f pdb.yaml
$ cat nginxrs.yaml
- name: nginx
$ oc create -f nginxrs.yaml
$ oc edit mcp/worker <- set "maxUnavailable: 3" in "spec"
# find the node where the nginx pod has landed on
$ oc describe pod nginx-h7q5b | grep -i node
# grab the MCD for that node
$ oc get pods -l k8s-app=machine-config-daemon --field-selector spec.nodeName=<NODE> -nopenshift-machine-config-operator
# start streaming the logs
$ oc logs -f <MCD_FOR_NODE>
# create a test MC to trigger a drain
$ cat file.yaml 130 ↵
$ oc create -f file.yaml
# now look at the logs from the point above
eviction is tried for the nginx pod over and over w/o timeout
eviction is tried for the nginx pod over and over but with a 5s delay between requests
Having a pdb=1 and rsreplica=1 is a logical bug also, so it's expected the drain tries over and over, that's not a bug.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.