1971080 – 4.7->4.8 CI: KubePodNotReady due to MCD's 5m sleep between drain attempts

Bug 1971080 - 4.7->4.8 CI: KubePodNotReady due to MCD's 5m sleep between drain attempts

Summary: 4.7->4.8 CI: KubePodNotReady due to MCD's 5m sleep between drain attempts

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Kirsten Garrison
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1971374
TreeView+	depends on / blocked

Reported:	2021-06-11 21:44 UTC by W. Trevor King
Modified:	2021-08-04 15:41 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 23:12:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2611	0	None	open	[WIP] Bug 1971080: daemon/drain.go: bump initial drain sleeps down to 1min	2021-06-11 23:11:29 UTC
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 23:12:57 UTC

Description W. Trevor King 2021-06-11 21:44:55 UTC

tl;dr: The MCO's recent change to a 5m sleep between drains [4] occasionally causes drains that used to take ~1m to now take ~6m.  That can cause some alerts to fire, but the main issue in this bug is that we should realize the drain is complete with lower latency, because delays on the scale of 5m will add up as we roll nodes.

Long form:

4.7.16 -> 4.8.0-0.ci-2021-06-11-170229 AWS update [1] failed on:

  disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success	1h21m18s
    Jun 11 19:11:28.654: Unexpected alerts fired or pending during the upgrade:

    alert KubePodNotReady fired for 60 seconds with labels: {namespace="openshift-apiserver", pod="apiserver-5966c9c56c-jn8tx", severity="warning"}

I'm not entirely sure why it says "fired" there instead of "pending".  In PromeCIeus, I don't actually see the alert firing [2].  It is shown as pending from 18:54 through 19:09.

Looking at the e2e-interval chart, one thing that stands out is that control plane nodes 44 and 164 both have slow drains, and the 164 reboot corresponds with that API-server's alert.  So:

1. MCD cordons the node
2. MCD drains off the the API-server pod.
3. MCD takes a while to drain something else.
4. Drain completes.
5. MCD reboots the node.
6. MCD uncordons the node.
7. Scheduler drops the API-server pods back on.

or something close to that.  The slow drain in 3 makes the 2-to-7 window too long, and KubePodNotReady trips into firing.  So why is the drain slow?  Moving to Loki [3]:

  {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216"} | unpack | pod_name="machine-config-daemon-bjpgv" |= "drain"

gives:

  I0611 19:04:44.405866  106706 drain.go:120] Successful drain took 394.758357545 seconds

And:

  {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216"} | unpack | pod_name="machine-config-daemon-bjpgv" |= "openshift-kube-controller-manager/revision-pruner"

gives:

I0611 18:58:11.465867  106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-6-ip-10-0-150-164.ec2.internal
I0611 18:58:11.465931  106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-150-164.ec2.internal
I0611 18:58:11.465948  106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-3-ip-10-0-150-164.ec2.internal
I0611 18:58:11.465966  106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-5-ip-10-0-150-164.ec2.internal
I0611 18:58:11.466004  106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-7-ip-10-0-150-164.ec2.internal
I0611 18:58:33.702383  106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-6-ip-10-0-150-164.ec2.internal
I0611 18:58:34.500684  106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-150-164.ec2.internal
I0611 18:58:34.901070  106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-3-ip-10-0-150-164.ec2.internal
I0611 18:58:35.300499  106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-5-ip-10-0-150-164.ec2.internal
I0611 18:58:36.100496  106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-7-ip-10-0-150-164.ec2.internal
I0611 19:04:44.356203  106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-150-164.ec2.internal
I0611 19:04:44.398227  106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-150-164.ec2.internal

So a whole bunch of chatter around 18:58, then a gap, and then it mentions the 11 pod again ~6m later.  Moving to the node/kubelet logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216/artifacts/e2e-aws-upgrade/gather-extra/artifacts/nodes/ip-10-0-150-164.ec2.internal/journal | gunzip | grep '^Jun 11 \(18:5[89]\|19:0[1-4]\):.*revision-pruner-' | grep openshift-kube-controller-manager | tail -n4
  Jun 11 18:59:09.145453 ip-10-0-150-164 crio[1429]: time="2021-06-11 18:59:09.145380970Z" level=info msg="Removed container 8f7d498e1eacac92c086bf46996d8098ce072f2442137d7ef8be5271cb4e921c: openshift-kube-controller-manager/revision-pruner-5-ip-10-0-150-164.ec2.internal/pruner" id=a25faefd-9673-4fda-bd6e-20dd156a8069 name=/runtime.v1alpha2.RuntimeService/RemoveContainer
  Jun 11 19:04:44.382551 ip-10-0-150-164 hyperkube[1465]: I0611 19:04:44.382521    1465 kubelet.go:1931] SyncLoop (DELETE, "api"): "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)"
  Jun 11 19:04:44.411420 ip-10-0-150-164 hyperkube[1465]: I0611 19:04:44.411394    1465 kubelet.go:1925] SyncLoop (REMOVE, "api"): "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)"
  Jun 11 19:04:44.411420 ip-10-0-150-164 hyperkube[1465]: I0611 19:04:44.411454    1465 kubelet.go:2115] Failed to delete pod "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)", err: pod not found

So expanding the timeline and guessing for some gaps:

1. MCD cordons the node
2. MCD drains off the the API-server pod.
3. MCD asks to drain revision-pruner-11-...
4. Kubelet/CRI-O takes a bit, maybe because we were deleting a whole bunch of pods during the drain, and then timed out the kubelet.
5. Kubelet tells MCD, "yeah, still working on revision-pruner-11-...".
6. MCD asks again.  Kubelet tells MCD, "uhh, yeah, no pods like that around here anymore".
7. Drain completes.
8. MCD reboots the node.
9. MCD uncordons the node.
10. Scheduler drops the API-server pods back on.

Recent change in this space is the MCD having a 5m sleep between drain attempts [4]; we probably need to at least start with something shorter.  Also would be nice if the kubelet/CRI-O didn't lose these pods, but that can be a separate bug.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216
[2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216
[3]: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-24h%22,%22now%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade%2F1403398609562505216%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%3D%5C%22machine-config-daemon-bjpgv%5C%22%20%7C%3D%20%5C%22drain%5C%22%22%7D%5D
[4]: https://github.com/openshift/machine-config-operator/pull/2597/files#diff-58db213cf6f6259a621f8d3421b5e8af55ed71c9f52c15b255f66f0b1e78a946R60

Comment 1 W. Trevor King 2021-06-11 21:59:03 UTC

(In reply to W. Trevor King from comment #0)
> 19:04:44.411454    1465 kubelet.go:2115] Failed to delete pod "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)", err: pod not found

This is apparently NOTABUG.  I dunno if I agree, but bug 1948920 is the place to discuss.

Comment 4 Michael Nguyen 2021-06-14 16:46:32 UTC

Verified on 4.8.0-0.nightly-2021-06-13-101614.  Created pdb and started pod, then initiated drain.  The timeouts between went from 2 minutes to 7 minutes after 5 retries.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-06-13-101614   True        False         97m     Cluster version is 4.8.0-0.nightly-2021-06-13-101614
$ oc get nodes
NAME                               STATUS   ROLES    AGE    VERSION
mnguyen061410-7m2fq-master-0       Ready    master   120m   v1.21.0-rc.0+120883f
mnguyen061410-7m2fq-master-1       Ready    master   118m   v1.21.0-rc.0+120883f
mnguyen061410-7m2fq-master-2       Ready    master   120m   v1.21.0-rc.0+120883f
mnguyen061410-7m2fq-worker-nsf26   Ready    worker   112m   v1.21.0-rc.0+120883f
mnguyen061410-7m2fq-worker-qrhvq   Ready    worker   112m   v1.21.0-rc.0+120883f
mnguyen061410-7m2fq-worker-s4jjd   Ready    worker   112m   v1.21.0-rc.0+120883f
$ cat pdb.yaml 
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: dontevict
spec:
  minAvailable: 1
  selector:wait a 
    matchLabels:
      app: dontevict
$ oc create -f pdb.yaml 
poddisruptionbudget.policy/dontevict created
$ oc get pdb
NAME        MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
dontevict   1               N/A               0                     3s
$ oc run --restart=Never --labels app=dontevict --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "mnguyen061410-7m2fq-worker-nsf26"} } }' --image=quay.io/prometheus/busybox dont-evict-this-pod -- sleep 4h
pod/dont-evict-this-pod created
$ oc get pods
NAME                  READY   STATUS    RESTARTS   AGE
dont-evict-this-pod   1/1     Running   0          15s

$ cat file-ig3.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: test-file
spec:
  config:
    ignition:
      version: 3.1.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
        filesystem: root
        mode: 0644
        path: /etc/test

$ oc create -f file-ig3.yaml 
machineconfig.machineconfiguration.openshift.io/test-file created

$ oc get pods -A --field-selector spec.nodeName=mnguyen061410-7m2fq-worker-nsf26 | grep machine-config-daemon
openshift-machine-config-operator        machine-config-daemon-88wxz                   2/2     Running   0          113m

== wait a long time ==

$ oc -n openshift-machine-config-operator logs  machine-config-daemon-88wxz -c machine-config-daemon | grep 'Draining failed with'
I0614 16:13:03.399452    2751 drain.go:67] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0614 16:15:34.092940    2751 drain.go:67] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0614 16:18:04.734727    2751 drain.go:67] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0614 16:20:35.402593    2751 drain.go:67] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0614 16:23:06.035815    2751 drain.go:67] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0614 16:25:36.665004    2751 drain.go:67] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0614 16:32:07.296922    2751 drain.go:67] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0614 16:38:37.957136    2751 drain.go:67] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying

Comment 7 errata-xmlrpc 2021-07-27 23:12:40 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.