tl;dr: The MCO's recent change to a 5m sleep between drains [4] occasionally causes drains that used to take ~1m to now take ~6m. That can cause some alerts to fire, but the main issue in this bug is that we should realize the drain is complete with lower latency, because delays on the scale of 5m will add up as we roll nodes. Long form: 4.7.16 -> 4.8.0-0.ci-2021-06-11-170229 AWS update [1] failed on: disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success 1h21m18s Jun 11 19:11:28.654: Unexpected alerts fired or pending during the upgrade: alert KubePodNotReady fired for 60 seconds with labels: {namespace="openshift-apiserver", pod="apiserver-5966c9c56c-jn8tx", severity="warning"} I'm not entirely sure why it says "fired" there instead of "pending". In PromeCIeus, I don't actually see the alert firing [2]. It is shown as pending from 18:54 through 19:09. Looking at the e2e-interval chart, one thing that stands out is that control plane nodes 44 and 164 both have slow drains, and the 164 reboot corresponds with that API-server's alert. So: 1. MCD cordons the node 2. MCD drains off the the API-server pod. 3. MCD takes a while to drain something else. 4. Drain completes. 5. MCD reboots the node. 6. MCD uncordons the node. 7. Scheduler drops the API-server pods back on. or something close to that. The slow drain in 3 makes the 2-to-7 window too long, and KubePodNotReady trips into firing. So why is the drain slow? Moving to Loki [3]: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216"} | unpack | pod_name="machine-config-daemon-bjpgv" |= "drain" gives: I0611 19:04:44.405866 106706 drain.go:120] Successful drain took 394.758357545 seconds And: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216"} | unpack | pod_name="machine-config-daemon-bjpgv" |= "openshift-kube-controller-manager/revision-pruner" gives: I0611 18:58:11.465867 106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-6-ip-10-0-150-164.ec2.internal I0611 18:58:11.465931 106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-150-164.ec2.internal I0611 18:58:11.465948 106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-3-ip-10-0-150-164.ec2.internal I0611 18:58:11.465966 106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-5-ip-10-0-150-164.ec2.internal I0611 18:58:11.466004 106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-7-ip-10-0-150-164.ec2.internal I0611 18:58:33.702383 106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-6-ip-10-0-150-164.ec2.internal I0611 18:58:34.500684 106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-150-164.ec2.internal I0611 18:58:34.901070 106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-3-ip-10-0-150-164.ec2.internal I0611 18:58:35.300499 106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-5-ip-10-0-150-164.ec2.internal I0611 18:58:36.100496 106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-7-ip-10-0-150-164.ec2.internal I0611 19:04:44.356203 106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-150-164.ec2.internal I0611 19:04:44.398227 106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-150-164.ec2.internal So a whole bunch of chatter around 18:58, then a gap, and then it mentions the 11 pod again ~6m later. Moving to the node/kubelet logs: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216/artifacts/e2e-aws-upgrade/gather-extra/artifacts/nodes/ip-10-0-150-164.ec2.internal/journal | gunzip | grep '^Jun 11 \(18:5[89]\|19:0[1-4]\):.*revision-pruner-' | grep openshift-kube-controller-manager | tail -n4 Jun 11 18:59:09.145453 ip-10-0-150-164 crio[1429]: time="2021-06-11 18:59:09.145380970Z" level=info msg="Removed container 8f7d498e1eacac92c086bf46996d8098ce072f2442137d7ef8be5271cb4e921c: openshift-kube-controller-manager/revision-pruner-5-ip-10-0-150-164.ec2.internal/pruner" id=a25faefd-9673-4fda-bd6e-20dd156a8069 name=/runtime.v1alpha2.RuntimeService/RemoveContainer Jun 11 19:04:44.382551 ip-10-0-150-164 hyperkube[1465]: I0611 19:04:44.382521 1465 kubelet.go:1931] SyncLoop (DELETE, "api"): "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)" Jun 11 19:04:44.411420 ip-10-0-150-164 hyperkube[1465]: I0611 19:04:44.411394 1465 kubelet.go:1925] SyncLoop (REMOVE, "api"): "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)" Jun 11 19:04:44.411420 ip-10-0-150-164 hyperkube[1465]: I0611 19:04:44.411454 1465 kubelet.go:2115] Failed to delete pod "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)", err: pod not found So expanding the timeline and guessing for some gaps: 1. MCD cordons the node 2. MCD drains off the the API-server pod. 3. MCD asks to drain revision-pruner-11-... 4. Kubelet/CRI-O takes a bit, maybe because we were deleting a whole bunch of pods during the drain, and then timed out the kubelet. 5. Kubelet tells MCD, "yeah, still working on revision-pruner-11-...". 6. MCD asks again. Kubelet tells MCD, "uhh, yeah, no pods like that around here anymore". 7. Drain completes. 8. MCD reboots the node. 9. MCD uncordons the node. 10. Scheduler drops the API-server pods back on. Recent change in this space is the MCD having a 5m sleep between drain attempts [4]; we probably need to at least start with something shorter. Also would be nice if the kubelet/CRI-O didn't lose these pods, but that can be a separate bug. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216 [2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216 [3]: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-24h%22,%22now%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade%2F1403398609562505216%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%3D%5C%22machine-config-daemon-bjpgv%5C%22%20%7C%3D%20%5C%22drain%5C%22%22%7D%5D [4]: https://github.com/openshift/machine-config-operator/pull/2597/files#diff-58db213cf6f6259a621f8d3421b5e8af55ed71c9f52c15b255f66f0b1e78a946R60
(In reply to W. Trevor King from comment #0) > 19:04:44.411454 1465 kubelet.go:2115] Failed to delete pod "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)", err: pod not found This is apparently NOTABUG. I dunno if I agree, but bug 1948920 is the place to discuss.
Verified on 4.8.0-0.nightly-2021-06-13-101614. Created pdb and started pod, then initiated drain. The timeouts between went from 2 minutes to 7 minutes after 5 retries. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-06-13-101614 True False 97m Cluster version is 4.8.0-0.nightly-2021-06-13-101614 $ oc get nodes NAME STATUS ROLES AGE VERSION mnguyen061410-7m2fq-master-0 Ready master 120m v1.21.0-rc.0+120883f mnguyen061410-7m2fq-master-1 Ready master 118m v1.21.0-rc.0+120883f mnguyen061410-7m2fq-master-2 Ready master 120m v1.21.0-rc.0+120883f mnguyen061410-7m2fq-worker-nsf26 Ready worker 112m v1.21.0-rc.0+120883f mnguyen061410-7m2fq-worker-qrhvq Ready worker 112m v1.21.0-rc.0+120883f mnguyen061410-7m2fq-worker-s4jjd Ready worker 112m v1.21.0-rc.0+120883f $ cat pdb.yaml apiVersion: policy/v1beta1 kind: PodDisruptionBudget metadata: name: dontevict spec: minAvailable: 1 selector:wait a matchLabels: app: dontevict $ oc create -f pdb.yaml poddisruptionbudget.policy/dontevict created $ oc get pdb NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE dontevict 1 N/A 0 3s $ oc run --restart=Never --labels app=dontevict --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "mnguyen061410-7m2fq-worker-nsf26"} } }' --image=quay.io/prometheus/busybox dont-evict-this-pod -- sleep 4h pod/dont-evict-this-pod created $ oc get pods NAME READY STATUS RESTARTS AGE dont-evict-this-pod 1/1 Running 0 15s $ cat file-ig3.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: test-file spec: config: ignition: version: 3.1.0 storage: files: - contents: source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK filesystem: root mode: 0644 path: /etc/test $ oc create -f file-ig3.yaml machineconfig.machineconfiguration.openshift.io/test-file created $ oc get pods -A --field-selector spec.nodeName=mnguyen061410-7m2fq-worker-nsf26 | grep machine-config-daemon openshift-machine-config-operator machine-config-daemon-88wxz 2/2 Running 0 113m == wait a long time == $ oc -n openshift-machine-config-operator logs machine-config-daemon-88wxz -c machine-config-daemon | grep 'Draining failed with' I0614 16:13:03.399452 2751 drain.go:67] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0614 16:15:34.092940 2751 drain.go:67] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0614 16:18:04.734727 2751 drain.go:67] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0614 16:20:35.402593 2751 drain.go:67] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0614 16:23:06.035815 2751 drain.go:67] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0614 16:25:36.665004 2751 drain.go:67] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0614 16:32:07.296922 2751 drain.go:67] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying I0614 16:38:37.957136 2751 drain.go:67] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438