1971374 – 4.7->4.8 CI: KubePodNotReady due to MCD's 5m sleep between drain attempts

Bug 1971374 - 4.7->4.8 CI: KubePodNotReady due to MCD's 5m sleep between drain attempts

Summary: 4.7->4.8 CI: KubePodNotReady due to MCD's 5m sleep between drain attempts

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.z
Assignee:	Kirsten Garrison
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:	1971080
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-13 21:36 UTC by Kirsten Garrison
Modified:	2021-06-29 04:20 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-06-29 04:20:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2612	0	None	closed	[release-4.7] Bug 1971374: daemon/drain.go: bump initial drain sleeps down to 1min	2021-07-19 22:20:59 UTC
Red Hat Product Errata	RHBA-2021:2502	0	None	None	None	2021-06-29 04:20:35 UTC

Description Kirsten Garrison 2021-06-13 21:36:10 UTC

This bug was initially created as a copy of Bug #1971080

I am copying this bug because: 
Need to do a manual backport of fix into 4.7



tl;dr: The MCO's recent change to a 5m sleep between drains [4] occasionally causes drains that used to take ~1m to now take ~6m.  That can cause some alerts to fire, but the main issue in this bug is that we should realize the drain is complete with lower latency, because delays on the scale of 5m will add up as we roll nodes.

Long form:

4.7.16 -> 4.8.0-0.ci-2021-06-11-170229 AWS update [1] failed on:

  disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success	1h21m18s
    Jun 11 19:11:28.654: Unexpected alerts fired or pending during the upgrade:

    alert KubePodNotReady fired for 60 seconds with labels: {namespace="openshift-apiserver", pod="apiserver-5966c9c56c-jn8tx", severity="warning"}

I'm not entirely sure why it says "fired" there instead of "pending".  In PromeCIeus, I don't actually see the alert firing [2].  It is shown as pending from 18:54 through 19:09.

Looking at the e2e-interval chart, one thing that stands out is that control plane nodes 44 and 164 both have slow drains, and the 164 reboot corresponds with that API-server's alert.  So:

1. MCD cordons the node
2. MCD drains off the the API-server pod.
3. MCD takes a while to drain something else.
4. Drain completes.
5. MCD reboots the node.
6. MCD uncordons the node.
7. Scheduler drops the API-server pods back on.

or something close to that.  The slow drain in 3 makes the 2-to-7 window too long, and KubePodNotReady trips into firing.  So why is the drain slow?  Moving to Loki [3]:

  {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216"} | unpack | pod_name="machine-config-daemon-bjpgv" |= "drain"

gives:

  I0611 19:04:44.405866  106706 drain.go:120] Successful drain took 394.758357545 seconds

And:

  {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216"} | unpack | pod_name="machine-config-daemon-bjpgv" |= "openshift-kube-controller-manager/revision-pruner"

gives:

I0611 18:58:11.465867  106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-6-ip-10-0-150-164.ec2.internal
I0611 18:58:11.465931  106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-150-164.ec2.internal
I0611 18:58:11.465948  106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-3-ip-10-0-150-164.ec2.internal
I0611 18:58:11.465966  106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-5-ip-10-0-150-164.ec2.internal
I0611 18:58:11.466004  106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-7-ip-10-0-150-164.ec2.internal
I0611 18:58:33.702383  106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-6-ip-10-0-150-164.ec2.internal
I0611 18:58:34.500684  106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-150-164.ec2.internal
I0611 18:58:34.901070  106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-3-ip-10-0-150-164.ec2.internal
I0611 18:58:35.300499  106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-5-ip-10-0-150-164.ec2.internal
I0611 18:58:36.100496  106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-7-ip-10-0-150-164.ec2.internal
I0611 19:04:44.356203  106706 daemon.go:330] evicting pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-150-164.ec2.internal
I0611 19:04:44.398227  106706 daemon.go:315] Evicted pod openshift-kube-controller-manager/revision-pruner-11-ip-10-0-150-164.ec2.internal

So a whole bunch of chatter around 18:58, then a gap, and then it mentions the 11 pod again ~6m later.  Moving to the node/kubelet logs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216/artifacts/e2e-aws-upgrade/gather-extra/artifacts/nodes/ip-10-0-150-164.ec2.internal/journal | gunzip | grep '^Jun 11 \(18:5[89]\|19:0[1-4]\):.*revision-pruner-' | grep openshift-kube-controller-manager | tail -n4
  Jun 11 18:59:09.145453 ip-10-0-150-164 crio[1429]: time="2021-06-11 18:59:09.145380970Z" level=info msg="Removed container 8f7d498e1eacac92c086bf46996d8098ce072f2442137d7ef8be5271cb4e921c: openshift-kube-controller-manager/revision-pruner-5-ip-10-0-150-164.ec2.internal/pruner" id=a25faefd-9673-4fda-bd6e-20dd156a8069 name=/runtime.v1alpha2.RuntimeService/RemoveContainer
  Jun 11 19:04:44.382551 ip-10-0-150-164 hyperkube[1465]: I0611 19:04:44.382521    1465 kubelet.go:1931] SyncLoop (DELETE, "api"): "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)"
  Jun 11 19:04:44.411420 ip-10-0-150-164 hyperkube[1465]: I0611 19:04:44.411394    1465 kubelet.go:1925] SyncLoop (REMOVE, "api"): "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)"
  Jun 11 19:04:44.411420 ip-10-0-150-164 hyperkube[1465]: I0611 19:04:44.411454    1465 kubelet.go:2115] Failed to delete pod "revision-pruner-11-ip-10-0-150-164.ec2.internal_openshift-kube-controller-manager(6eae1b36-09f5-4a04-bcff-f61c40f2d92e)", err: pod not found

So expanding the timeline and guessing for some gaps:

1. MCD cordons the node
2. MCD drains off the the API-server pod.
3. MCD asks to drain revision-pruner-11-...
4. Kubelet/CRI-O takes a bit, maybe because we were deleting a whole bunch of pods during the drain, and then timed out the kubelet.
5. Kubelet tells MCD, "yeah, still working on revision-pruner-11-...".
6. MCD asks again.  Kubelet tells MCD, "uhh, yeah, no pods like that around here anymore".
7. Drain completes.
8. MCD reboots the node.
9. MCD uncordons the node.
10. Scheduler drops the API-server pods back on.

Recent change in this space is the MCD having a 5m sleep between drain attempts [4]; we probably need to at least start with something shorter.  Also would be nice if the kubelet/CRI-O didn't lose these pods, but that can be a separate bug.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216
[2]: https://promecieus.dptools.openshift.org/?search=https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1403398609562505216
[3]: https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%22now-24h%22,%22now%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade%2F1403398609562505216%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%3D%5C%22machine-config-daemon-bjpgv%5C%22%20%7C%3D%20%5C%22drain%5C%22%22%7D%5D
[4]: https://github.com/openshift/machine-config-operator/pull/2597/files#diff-58db213cf6f6259a621f8d3421b5e8af55ed71c9f52c15b255f66f0b1e78a946R60

Comment 3 Michael Nguyen 2021-06-21 15:03:27 UTC

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-06-20-093308   True        False         30m     Cluster version is 4.7.0-0.nightly-2021-06-20-093308
$ cat pdb.yaml 
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: dontevict
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: dontevict
$ oc create -f pdb.yaml 
poddisruptionbudget.policy/dontevict created
$ oc get pdb
NAME        MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
dontevict   1               N/A               0                     4s
$ oc get node
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-143-129.us-west-2.compute.internal   Ready    master   52m   v1.20.0+87cc9a4
ip-10-0-145-198.us-west-2.compute.internal   Ready    worker   47m   v1.20.0+87cc9a4
ip-10-0-162-222.us-west-2.compute.internal   Ready    worker   47m   v1.20.0+87cc9a4
ip-10-0-173-218.us-west-2.compute.internal   Ready    master   53m   v1.20.0+87cc9a4
ip-10-0-220-204.us-west-2.compute.internal   Ready    master   53m   v1.20.0+87cc9a4
ip-10-0-222-240.us-west-2.compute.internal   Ready    worker   42m   v1.20.0+87cc9a4
$ oc get node/ip-10-0-222-240.us-west-2.compute.internal -o yaml | grep hostname
    kubernetes.io/hostname: ip-10-0-222-240
          f:kubernetes.io/hostname: {}
$ oc run --restart=Never --labels app=dontevict --overrides='{ "spec": { "nodeSelector": { "kubernetes.io/hostname": "ip-10-0-222-240"} } }' --image=quay.io/prometheus/busybox dont-evict-this-pod -- sleep 4h
pod/dont-evict-this-pod created
$ oc get pod
NAME                  READY   STATUS              RESTARTS   AGE
dont-evict-this-pod   0/1     ContainerCreating   0          4s
$ cat << EOF > file-ig3.yaml
> apiVersion: machineconfiguration.openshift.io/v1
> kind: MachineConfig
> metadata:
>   labels:
>     machineconfiguration.openshift.io/role: worker
>   name: test-file
> spec:
>   config:
>     ignition:
>       version: 3.1.0
>     storage:
>       files:
>       - contents:
>           source: data:text/plain;charset=utf;base64,c2VydmVyIGZvby5leGFtcGxlLm5ldCBtYXhkZWxheSAwLjQgb2ZmbGluZQpzZXJ2ZXIgYmFyLmV4YW1wbGUubmV0IG1heGRlbGF5IDAuNCBvZmZsaW5lCnNlcnZlciBiYXouZXhhbXBsZS5uZXQgbWF4ZGVsYXkgMC40IG9mZmxpbmUK
>         filesystem: root
>         mode: 0644
>         path: /etc/test
> EOF
$ oc create file -f file-ig3.yaml 
error: Unexpected args: [file]
See 'oc create -h' for help and examples.
$ oc create  -f file-ig3.yaml 
machineconfig.machineconfiguration.openshift.io/test-file created
$ oc get mc
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          8530c27d3d9b6155923d348058bc025a6a98ec3c   3.2.0             53m
00-worker                                          8530c27d3d9b6155923d348058bc025a6a98ec3c   3.2.0             53m
01-master-container-runtime                        8530c27d3d9b6155923d348058bc025a6a98ec3c   3.2.0             53m
01-master-kubelet                                  8530c27d3d9b6155923d348058bc025a6a98ec3c   3.2.0             53m
01-worker-container-runtime                        8530c27d3d9b6155923d348058bc025a6a98ec3c   3.2.0             53m
01-worker-kubelet                                  8530c27d3d9b6155923d348058bc025a6a98ec3c   3.2.0             53m
99-master-generated-registries                     8530c27d3d9b6155923d348058bc025a6a98ec3c   3.2.0             53m
99-master-ssh                                                                                 3.2.0             62m
99-worker-generated-registries                     8530c27d3d9b6155923d348058bc025a6a98ec3c   3.2.0             53m
99-worker-ssh                                                                                 3.2.0             62m
rendered-master-13a8e238f99c2aae13c24eac159a2db2   8530c27d3d9b6155923d348058bc025a6a98ec3c   3.2.0             53m
rendered-worker-96cc244920a4ac522616e39024e1d35d   8530c27d3d9b6155923d348058bc025a6a98ec3c   3.2.0             53m
test-file                                                                                     3.1.0             3s
$ oc get pods -A --field-selector spec.nodeName=ip-10-0-222-240.us-west-2.compute.internal | grep machine-config-daemon
openshift-machine-config-operator         machine-config-daemon-6gkv4                2/2     Running             0          45m



$ oc -n openshift-machine-config-operator logs -f machine-config-daemon-6gkv4 -c machine-config-daemon | grep 'Draining failed with'
I0621 14:15:24.136453    1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0621 14:17:54.532931    1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0621 14:20:24.923638    1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0621 14:22:55.313792    1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0621 14:25:25.750570    1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0621 14:27:56.145450    1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0621 14:34:26.537000    1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0621 14:40:56.929332    1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0621 14:47:27.320342    1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying
I0621 14:53:57.705077    1747 update.go:241] Draining failed with: error when evicting pods/"dont-evict-this-pod" -n "default": global timeout reached: 1m30s, retrying

Comment 5 errata-xmlrpc 2021-06-29 04:20:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.18 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2502

Note You need to log in before you can comment on or make changes to this bug.