This bug was initially created as a copy of Bug #1971080 I am copying this bug because: Need to do a manual backport of fix into 4.7 tl;dr: The MCO's recent change to a 5m sleep between drains [4] occasionally causes drains that used to take ~1m to now take ~6m. That can cause some alerts to fire, but the main issue in this bug is that we should realize the drain is complete with lower latency, because delays on the scale of 5m will add up as we roll nodes. Long form: 4.7.16 -> 4.8.0-0.ci-2021-06-11-170229 AWS update [1] failed on: disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success 1h21m18s Jun 11 19:11:28.654: Unexpected alerts fired or pending during the upgrade: alert KubePodNotReady fired for 60 seconds with labels: {namespace="openshift-apiserver", pod="apiserver-5966c9c56c-jn8tx", severity="warning"} I'm not entirely sure why it says "fired" there instead of "pending". In PromeCIeus, I don't actually see the alert firing [2]. It is shown as pending from 18:54 through 19:09. Looking at the e2e-interval chart, one thing that stands out is that control plane nodes 44 and 164 both have slow drains, and the 164 reboot corresponds with that API-server's alert. So: 1. MCD cordons the node 2. MCD drains off the the API-server pod. 3. MCD takes a while to drain something else. 4. Drain completes. 5. MCD reboots the node. 6. MCD uncordons the node. 7. Scheduler drops the API-server pods back on. or something close to that. The slow drain in 3 makes the 2-to-7 window too long, and KubePodNotReady trips into firing. So why is the drain slow? Moving to Loki [3]: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216"} | unpack | pod_name="machine-config-daemon-bjpgv" |= "drain" gives: I0611 19:04:44.405866 106706 drain.go:120] Successful drain took 394.758357545 seconds And: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216"} | unpack | pod_name="machine-config-daemon-bjpgv" |= "openshift-kube-controller-manager/revision-pruner" gives: I0611 18:58:11.465867 106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-6-ip-10-0-150-164.ec2.internal I0611 18:58:11.465931 106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-150-164.ec2.internal I0611 18:58:11.465948 106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-3-ip-10-0-150-164.ec2.internal I0611 18:58:11.465966 106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-5-ip-10-0-150-164.ec2.internal I0611 18:58:11.466004 106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-7-ip-10-0-150-164.ec2.internal I0611 18:58:33.702383 106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-6-ip-10-0-150-164.ec2.internal I0611 18:58:34.500684 106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-150-164.ec2.internal I0611 18:58:34.901070 106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-3-ip-10-0-150-164.ec2.internal I0611 18:58:35.300499 106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-5-ip-10-0-150-164.ec2.internal I0611 18:58:36.100496 106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-7-ip-10-0-150-164.ec2.internal I0611 19:04:44.356203 106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-150-164.ec2.internal I0611 19:04:44.398227 106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-150-164.ec2.internal So a whole bunch of chatter around 18:58, then a gap, and then it mentions the 11 pod again ~6m later. Moving to the node/kubelet logs: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216/artifacts/e2e-aws-upgrade/gather-extra/artifacts/nodes/ip-10-0-150-164.ec2.internal/journal | gunzip | grep '^Jun 11 \(18:5[89]\|19:0[1-4]\):.*revision-pruner-' | grep openshift-kube-controller-manager | tail -n4 Jun 11 18:59:09.145453 ip-10-0-150-164 crio[1429]: time="2021-06-11 18:59:09.145380970Z" level=info msg="Removed container 8f7d498e1eacac92c086bf46996d8098ce072f2442137d7ef8be5271cb4e921c: openshift-kube-controller-manager/revision-pruner-5-ip-10-0-150-164.ec2.internal/pruner" id=a25faefd-9673-4fda-bd6e-20dd156a8069 name=/runtime.v1alpha2.RuntimeService/RemoveContainer Jun 11 19:04:44.382551 ip-10-0-150-164 hyperkube[1465]: I0611 19:04:44.382521 1465 kubelet.go:1931] SyncLoop (DELETE, "api"): "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)" Jun 11 19:04:44.411420 ip-10-0-150-164 hyperkube[1465]: I0611 19:04:44.411394 1465 kubelet.go:1925] SyncLoop (REMOVE, "api"): "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)" Jun 11 19:04:44.411420 ip-10-0-150-164 hyperkube[1465]: I0611 19:04:44.411454 1465 kubelet.go:2115] Failed to delete pod "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)", err: pod not found So expanding the timeline and guessing for some gaps: 1. MCD cordons the node 2. MCD drains off the the API-server pod. 3. MCD asks to drain revision-pruner-11-... 4. Kubelet/CRI-O takes a bit, maybe because we were deleting a whole bunch of pods during the drain, and then timed out the kubelet. 5. Kubelet tells MCD, "yeah, still working on revision-pruner-11-...". 6. MCD asks again. Kubelet tells MCD, "uhh, yeah, no pods like that around here anymore". 7. Drain completes. 8. MCD reboots the node. 9. MCD uncordons the node. 10. Scheduler drops the API-server pods back on. Recent change in this space is the MCD having a 5m sleep between drain attempts [4]; we probably need to at least start with something shorter. Also would be nice if the kubelet/CRI-O didn't lose these pods, but that can be a separate bug. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216 [2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216 [3]: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-24h%22,%22now%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade%2F1403398609562505216%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%3D%5C%22machine-config-daemon-bjpgv%5C%22%20%7C%3D%20%5C%22drain%5C%22%22%7D%5D [4]: https://github.com/openshift/machine-config-operator/pull/2597/files#diff-58db213cf6f6259a621f8d3421b5e8af55ed71c9f52c15b255f66f0b1e78a946R60
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-06-20-093308 True False 30m Cluster version is 4.7.0-0.nightly-2021-06-20-093308 $ cat pdb.yaml apiVersion: policy/v1beta1 kind: PodDisruptionBudget metadata: name: dontevict spec: minAvailable: 1 selector: matchLabels: app: dontevict $ oc create -f pdb.yaml poddisruptionbudget.policy/dontevict created $ oc get pdb NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE dontevict 1 N/A 0 4s $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-143-129.us-west-2.compute.internal Ready master 52m v1.20.0+87cc9a4 ip-10-0-145-198.us-west-2.compute.internal Ready worker 47m v1.20.0+87cc9a4 ip-10-0-162-222.us-west-2.compute.internal Ready worker 47m v1.20.0+87cc9a4 ip-10-0-173-218.us-west-2.compute.internal Ready master 53m v1.20.0+87cc9a4 ip-10-0-220-204.us-west-2.compute.internal Ready master 53m v1.20.0+87cc9a4 ip-10-0-222-240.us-west-2.compute.internal Ready worker 42m v1.20.0+87cc9a4 $ oc get node/ip-10-0-222-240.us-west-2.compute.internal -o yaml | grep hostname kubernetes.io/hostname: ip-10-0-222-240 f:kubernetes.io/hostname: {} $ oc run --restart=Never --labels app=dontevict --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "ip-10-0-222-240"} } }' --image=quay.io/prometheus/busybox dont-evict-this-pod -- sleep 4h pod/dont-evict-this-pod created $ oc get pod NAME READY STATUS RESTARTS AGE dont-evict-this-pod 0/1 ContainerCreating 0 4s $ cat << EOF > file-ig3.yaml > apiVersion: machineconfiguration.openshift.io/v1 > kind: MachineConfig > metadata: > labels: > machineconfiguration.openshift.io/role: worker > name: test-file > spec: > config: > ignition: > version: 3.1.0 > storage: > files: > - contents: > source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK > filesystem: root > mode: 0644 > path: /etc/test > EOF $ oc create file -f file-ig3.yaml error: Unexpected args: [file] See 'oc create -h' for help and examples. $ oc create -f file-ig3.yaml machineconfig.machineconfiguration.openshift.io/test-file created $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 00-worker 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 01-master-container-runtime 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 01-master-kubelet 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 01-worker-container-runtime 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 01-worker-kubelet 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 99-master-generated-registries 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 99-master-ssh 3.2.0 62m 99-worker-generated-registries 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m 99-worker-ssh 3.2.0 62m rendered-master-13a8e238f99c2aae13c24eac159a2db2 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m rendered-worker-96cc244920a4ac522616e39024e1d35d 8530c27d3d9b6155923d348058bc025a6a98ec3c 3.2.0 53m test-file 3.1.0 3s $ oc get pods -A --field-selector spec.nodeName=ip-10-0-222-240.us-west-2.compute.internal | grep machine-config-daemon openshift-machine-config-operator machine-config-daemon-6gkv4 2/2 Running 0 45m $ oc -n openshift-machine-config-operator logs -f machine-config-daemon-6gkv4 -c machine-config-daemon | grep 'Draining failed with' I0621 14:15:24.136453 1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0621 14:17:54.532931 1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0621 14:20:24.923638 1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0621 14:22:55.313792 1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0621 14:25:25.750570 1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0621 14:27:56.145450 1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0621 14:34:26.537000 1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0621 14:40:56.929332 1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0621 14:47:27.320342 1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0621 14:53:57.705077 1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.18 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2502