Bug 1749070

Summary: cluster creating too many eviction requests a second
Product: OpenShift Container Platform Reporter: Antonio Murdaca <amurdaca>
Component: Machine Config OperatorAssignee: Antonio Murdaca <amurdaca>
Status: CLOSED ERRATA QA Contact: Michael Nguyen <mnguyen>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.1.zCC: mnguyen
Target Milestone: ---   
Target Release: 4.1.z   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1748844 Environment:
Last Closed: 2019-09-25 07:27:53 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1748844    
Bug Blocks:    

Description Antonio Murdaca 2019-09-04 19:38:58 UTC
+++ This bug was initially created as a clone of Bug #1748844 +++

Description of problem:

MCO uses openshift/kubernetes-drain for draining nodes before rebooting. The library had a serious bug where the apiserver is flooded with requests. The bug has been fixed in https://github.com/openshift/kubernetes-drain/pull/3 but a deadlock fix went in with https://github.com/openshift/kubernetes-drain/pull/4. However, https://github.com/openshift/kubernetes-drain/pull/4 never landed and instead the library has been forked for supportability to cluster-api/pkg/drain.
The MCO has to switch to the cluster-api fork to fix the deadlock and verify the drain fix is in place.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

$ cat pdb.yaml 
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
  name: nginx-pdb
  minAvailable: 1
      "app": "nginx"

$ oc create -f pdb.yaml

$ cat nginxrs.yaml 
apiVersion: apps/v1
kind: ReplicaSet
  name: nginx
    app: nginx
  replicas: 1
      app: nginx
        app: nginx
      - name: nginx
        image: nginx

$ oc create -f nginxrs.yaml


$ oc edit mcp/worker <- set "maxUnavailable: 3" in "spec"

# find the node where the nginx pod has landed on

$ oc describe pod nginx-h7q5b | grep -i node

# grab the MCD for that node

$ oc get pods -l k8s-app=machine-config-daemon  --field-selector spec.nodeName=<NODE> -nopenshift-machine-config-operator

# start streaming the logs

$ oc logs -f <MCD_FOR_NODE>


# create a test MC to trigger a drain

$ cat file.yaml                                                                                                                                                                                                                   130 ↵
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
    machineconfiguration.openshift.io/role: worker
  name: test-file
      version: 2.2.0
      - contents:
          source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
        filesystem: root
        mode: 0644
        path: /etc/test

$ oc create -f file.yaml

# now look at the logs from the point above

Actual results:

eviction is tried for the nginx pod over and over w/o timeout 

Expected results:

eviction is tried for the nginx pod over and over but with a 5s delay between requests

Additional info:

Having a pdb=1 and rsreplica=1 is a logical bug also, so it's expected the drain tries over and over, that's not a bug.

Comment 2 Michael Nguyen 2019-09-18 14:12:52 UTC
Verified on 4.1.0-0.nightly-2019-09-16-165032

eviction is tried for the nginx pod over and over with a 5 second delay

I0918 14:10:28.275984    2730 update.go:89] pod "grafana-bf8f7bdf5-phr4c" removed (evicted)
I0918 14:10:29.244958    2730 update.go:89] error when evicting pod "nginx-pdwfg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0918 14:10:34.249058    2730 update.go:89] error when evicting pod "nginx-pdwfg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0918 14:10:39.253445    2730 update.go:89] error when evicting pod "nginx-pdwfg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0918 14:10:44.257521    2730 update.go:89] error when evicting pod "nginx-pdwfg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0918 14:10:49.261931    2730 update.go:89] error when evicting pod "nginx-pdwfg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0918 14:10:54.266075    2730 update.go:89] error when evicting pod "nginx-pdwfg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0918 14:10:59.271507    2730 update.go:89] error when evicting pod "nginx-pdwfg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0918 14:11:04.276535    2730 update.go:89] error when evicting pod "nginx-pdwfg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0918 14:11:09.280714    2730 update.go:89] error when evicting pod "nginx-pdwfg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0918 14:11:14.284921    2730 update.go:89] error when evicting pod "nginx-pdwfg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0918 14:11:19.289207    2730 update.go:89] error when evicting pod "nginx-pdwfg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
I0918 14:11:24.292810    2730 update.go:89] error when evicting pod "nginx-pdwfg" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

Comment 4 errata-xmlrpc 2019-09-25 07:27:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.
