Hide Forgot
This bug was initially created as a copy of Bug #1968019 I am copying this bug because: Description of problem: The drain period and timeout which causes a degraded is too short for an average cluster. This leads to reported upgrade failures and alerts when the user would just need a bit more time (esp given that any degraded pool is now surface as an upgrade blocker). It isn't uncommon for nodes to need 15m<X<1hr so bump the timeouts to only fire alerts and report a failure at at least 1hr of drain attempt. Actual results: nodes that need a reasonable amt of time error out Expected results: a node that needs an hour to drain correctly should be able to and not cause error
Doing a manual backport for this one.
OpenShift engineering has decided to not ship Red Hat OpenShift Container Platform 4.7.17 due a regression https://bugzilla.redhat.com/show_bug.cgi?id=1973006. All the fixes which were part of 4.7.17 will be now part of 4.7.18 and planned to be available in candidate channel on June 23 2021 and in fast channel on June 28th.
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-06-20-093308 True False 30m Cluster version is 4.7.0-0.nightly-2021-06-20-093308 $ cat pdb.yaml apiVersion: policy/v1beta1 kind: PodDisruptionBudget metadata: name: dontevict spec: minAvailable: 1 selector: matchLabels: app: dontevict $ oc create -f pdb.yaml poddisruptionbudget.policy/dontevict created $ oc get pdb NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE dontevict 1 N/A 0 4s $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-143-129.us-west-2.compute.internal Ready master 52m v1.20.0+87cc9a4 ip-10-0-145-198.us-west-2.compute.internal Ready worker 47m v1.20.0+87cc9a4 ip-10-0-162-222.us-west-2.compute.internal Ready worker 47m v1.20.0+87cc9a4 ip-10-0-173-218.us-west-2.compute.internal Ready master 53m v1.20.0+87cc9a4 ip-10-0-220-204.us-west-2.compute.internal Ready master 53m v1.20.0+87cc9a4 ip-10-0-222-240.us-west-2.compute.internal Ready worker 42m v1.20.0+87cc9a4 $ oc get node/ip-10-0-222-240.us-west-2.compute.internal -o yaml | grep hostname kubernetes.io/hostname: ip-10-0-222-240 f:kubernetes.io/hostname: {} $ oc run --restart=Never --labels app=dontevict --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "ip-10-0-222-240"} } }' --image=quay.io/prometheus/busybox dont-evict-this-pod -- sleep 4h pod/dont-evict-this-pod created $ oc get pod NAME READY STATUS RESTARTS AGE dont-evict-this-pod 0/1 ContainerCreating 0 4s $ cat << EOF > file-ig3.yaml > apiVersion: machineconfiguration.openshift.io/v1 > kind: MachineConfig > metadata: > labels: > machineconfiguration.openshift.io/role: worker > name: test-file > spec: > config: > ignition: > version: 3.1.0 > storage: > files: > - contents: > source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK > filesystem: root > mode: 0644 > path: /etc/test > EOF $ oc create file -f file-ig3.yaml error: Unexpected args: [file] See 'oc create -h' for help and examples. $ oc create -f file-ig3.yaml machineconfig.machineconfiguration.openshift.io/test-file created $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 00-worker 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 01-master-container-runtime 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 01-master-kubelet 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 01-worker-container-runtime 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 01-worker-kubelet 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 99-master-generated-registries 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 99-master-ssh 3.2.0 62m 99-worker-generated-registries 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 99-worker-ssh 3.2.0 62m rendered-master-13a8e238f99c2aae13c24eac159a2db2 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m rendered-worker-96cc244920a4ac522616e39024e1d35d 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m test-file 3.1.0 3s $ oc get pods -A --field-selector spec.nodeName=ip-10-0-222-240.us-west-2.compute.internal | grep machine-config-daemon openshift-machine-config-operator machine-config-daemon-6gkv4 2/2 Running 0 45m $ oc -n openshift-machine-config-operator logs -f machine-config-daemon-6gkv4 -c machine-config-daemon | grep 'Draining failed with' I0621 14:15:24.136453 1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0621 14:17:54.532931 1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0621 14:20:24.923638 1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0621 14:22:55.313792 1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0621 14:25:25.750570 1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0621 14:27:56.145450 1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0621 14:34:26.537000 1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0621 14:40:56.929332 1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0621 14:47:27.320342 1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0621 14:53:57.705077 1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.18 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2502
*** Bug 1906254 has been marked as a duplicate of this bug. ***