1748844 – cluster creating too many eviction requests a second

Bug 1748844 - cluster creating too many eviction requests a second

Summary: cluster creating too many eviction requests a second

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Antonio Murdaca
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1749070
TreeView+	depends on / blocked

Reported:	2019-09-04 09:36 UTC by Antonio Murdaca
Modified:	2019-10-16 06:40 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1749070 (view as bug list)
Environment:
Last Closed:	2019-10-16 06:40:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1098	0	None	closed	Bug 1748844: vendor: use cluster-api's drain lib	2020-04-13 03:44:18 UTC
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:40:41 UTC

Description Antonio Murdaca 2019-09-04 09:36:57 UTC

Description of problem:

MCO uses openshift/kubernetes-drain for draining nodes before rebooting. The library had a serious bug where the apiserver is flooded with requests. The bug has been fixed in https://github.com/openshift/kubernetes-drain/pull/3 but a deadlock fix went in with https://github.com/openshift/kubernetes-drain/pull/4. However, https://github.com/openshift/kubernetes-drain/pull/4 never landed and instead the library has been forked for supportability to cluster-api/pkg/drain.
The MCO has to switch to the cluster-api fork to fix the deadlock and verify the drain fix is in place.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.

$ cat pdb.yaml 
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: nginx-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      "app": "nginx"

$ oc create -f pdb.yaml

$ cat nginxrs.yaml 
apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx

$ oc create -f nginxrs.yaml

2.

$ oc edit mcp/worker <- set "maxUnavailable: 3" in "spec"

# find the node where the nginx pod has landed on

$ oc describe pod nginx-h7q5b | grep -i node

# grab the MCD for that node

$ oc get pods -l k8s-app=machine-config-daemon  --field-selector spec.nodeName=<NODE> -nopenshift-machine-config-operator

# start streaming the logs

$ oc logs -f <MCD_FOR_NODE>

3.

# create a test MC to trigger a drain

$ cat file.yaml                                                                                                                                                                                                                   130 ↵
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: test-file
spec:
  config:
    ignition:
      version: 2.2.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
        filesystem: root
        mode: 0644
        path: /etc/test

$ oc create -f file.yaml

# now look at the logs from the point above

Actual results:

eviction is tried for the nginx pod over and over w/o timeout 

Expected results:

eviction is tried for the nginx pod over and over but with a 5s delay between requests


Additional info:

Having a pdb=1 and rsreplica=1 is a logical bug also, so it's expected the drain tries over and over, that's not a bug.

Comment 3 errata-xmlrpc 2019-10-16 06:40:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.