1968759 – drain timeout and pool degrading period is too short

Bug 1968759 - drain timeout and pool degrading period is too short

Summary: drain timeout and pool degrading period is too short

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.z
Assignee:	Kirsten Garrison
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1906254 (view as bug list)
Depends On:	1968019
Blocks:	1987221
TreeView+	depends on / blocked

Reported:	2021-06-07 22:31 UTC by Kirsten Garrison
Modified:	2021-07-30 23:41 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-29 04:19:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2605	0	None	open	[release-4.7] Bug 1968759: Bump drain timeout to 1h	2021-06-08 23:30:02 UTC
Red Hat Product Errata	RHBA-2021:2502	0	None	None	None	2021-06-29 04:20:08 UTC

Description Kirsten Garrison 2021-06-07 22:31:34 UTC

This bug was initially created as a copy of Bug #1968019

I am copying this bug because: 



Description of problem:

The drain period and timeout which causes a degraded is too short for an average cluster.  This leads to reported upgrade failures and alerts when the user would just need a bit more time (esp given that any degraded pool is now surface as an upgrade blocker). It isn't uncommon for nodes to need 15m<X<1hr so bump the timeouts to only fire alerts and report a failure at at least 1hr of drain attempt.


Actual results:
nodes that need a reasonable amt of time error out


Expected results:

a node that needs an hour to drain correctly should be able to and not cause error

Comment 1 Kirsten Garrison 2021-06-07 22:32:52 UTC

Doing a manual backport for this one.

Comment 4 OpenShift Automated Release Tooling 2021-06-17 12:29:08 UTC

OpenShift engineering has decided to not ship Red Hat OpenShift Container Platform 4.7.17 due a regression https://bugzilla.redhat.com/show_bug.cgi?id=1973006. All the fixes which were part of 4.7.17 will be now part of 4.7.18 and planned to be available in candidate channel on June 23 2021 and in fast channel on June 28th.

Comment 7 Michael Nguyen 2021-06-21 15:03:16 UTC

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-06-20-093308   True        False         30m     Cluster version is 4.7.0-0.nightly-2021-06-20-093308
$ cat pdb.yaml 
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: dontevict
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: dontevict
$ oc create -f pdb.yaml 
poddisruptionbudget.policy/dontevict created
$ oc get pdb
NAME        MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
dontevict   1               N/A               0                     4s
$ oc get node
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-143-129.us-west-2.compute.internal   Ready    master   52m   v1.20.0+87cc9a4
ip-10-0-145-198.us-west-2.compute.internal   Ready    worker   47m   v1.20.0+87cc9a4
ip-10-0-162-222.us-west-2.compute.internal   Ready    worker   47m   v1.20.0+87cc9a4
ip-10-0-173-218.us-west-2.compute.internal   Ready    master   53m   v1.20.0+87cc9a4
ip-10-0-220-204.us-west-2.compute.internal   Ready    master   53m   v1.20.0+87cc9a4
ip-10-0-222-240.us-west-2.compute.internal   Ready    worker   42m   v1.20.0+87cc9a4
$ oc get node/ip-10-0-222-240.us-west-2.compute.internal -o yaml | grep hostname
    kubernetes.io/hostname: ip-10-0-222-240
          f:kubernetes.io/hostname: {}
$ oc run --restart=Never --labels app=dontevict --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "ip-10-0-222-240"} } }' --image=quay.io/prometheus/busybox dont-evict-this-pod -- sleep 4h
pod/dont-evict-this-pod created
$ oc get pod
NAME                  READY   STATUS              RESTARTS   AGE
dont-evict-this-pod   0/1     ContainerCreating   0          4s
$ cat << EOF > file-ig3.yaml
> apiVersion: machineconfiguration.openshift.io/v1
> kind: MachineConfig
> metadata:
>   labels:
>     machineconfiguration.openshift.io/role: worker
>   name: test-file
> spec:
>   config:
>     ignition:
>       version: 3.1.0
>     storage:
>       files:
>       - contents:
>           source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
>         filesystem: root
>         mode: 0644
>         path: /etc/test
> EOF
$ oc create file -f file-ig3.yaml 
error: Unexpected args: [file]
See 'oc create -h' for help and examples.
$ oc create  -f file-ig3.yaml 
machineconfig.machineconfiguration.openshift.io/test-file created
$ oc get mc
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          8530c27d3d9b6155923d348058bc025a6a98ec3c   3.2.0             53m
00-worker                                          8530c27d3d9b6155923d348058bc025a6a98ec3c   3.2.0             53m
01-master-container-runtime                        8530c27d3d9b6155923d348058bc025a6a98ec3c   3.2.0             53m
01-master-kubelet                                  8530c27d3d9b6155923d348058bc025a6a98ec3c   3.2.0             53m
01-worker-container-runtime                        8530c27d3d9b6155923d348058bc025a6a98ec3c   3.2.0             53m
01-worker-kubelet                                  8530c27d3d9b6155923d348058bc025a6a98ec3c   3.2.0             53m
99-master-generated-registries                     8530c27d3d9b6155923d348058bc025a6a98ec3c   3.2.0             53m
99-master-ssh                                                                                 3.2.0             62m
99-worker-generated-registries                     8530c27d3d9b6155923d348058bc025a6a98ec3c   3.2.0             53m
99-worker-ssh                                                                                 3.2.0             62m
rendered-master-13a8e238f99c2aae13c24eac159a2db2   8530c27d3d9b6155923d348058bc025a6a98ec3c   3.2.0             53m
rendered-worker-96cc244920a4ac522616e39024e1d35d   8530c27d3d9b6155923d348058bc025a6a98ec3c   3.2.0             53m
test-file                                                                                     3.1.0             3s
$ oc get pods -A --field-selector spec.nodeName=ip-10-0-222-240.us-west-2.compute.internal | grep machine-config-daemon
openshift-machine-config-operator         machine-config-daemon-6gkv4                2/2     Running             0          45m



$ oc -n openshift-machine-config-operator logs -f machine-config-daemon-6gkv4 -c machine-config-daemon | grep 'Draining failed with'
I0621 14:15:24.136453    1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0621 14:17:54.532931    1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0621 14:20:24.923638    1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0621 14:22:55.313792    1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0621 14:25:25.750570    1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0621 14:27:56.145450    1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0621 14:34:26.537000    1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0621 14:40:56.929332    1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0621 14:47:27.320342    1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0621 14:53:57.705077    1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying

Comment 9 errata-xmlrpc 2021-06-29 04:19:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.18 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2502

Comment 10 Kirsten Garrison 2021-07-15 00:17:56 UTC

*** Bug 1906254 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.