1834948 – timeout when evicting pod openshift-ingress/router

Bug 1834948 - timeout when evicting pod openshift-ingress/router

Summary: timeout when evicting pod openshift-ingress/router

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Kirsten Garrison
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-12 18:15 UTC by Kirsten Garrison
Modified:	2020-07-13 17:38 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-13 17:37:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1739	0	None	closed	Bug 1834948: increase eviction time to avoid preventable timeouts	2021-02-19 02:19:23 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:38:20 UTC

Description Kirsten Garrison 2020-05-12 18:15:18 UTC

Description of problem:
In MCO CI gcp-op jobs we are seeing drain error when evicting openshift-ingress/route pods:

From MCD logs: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1689/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2179/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-285mk_machine-config-daemon.log :

```
0512 15:41:18.323584    2050 daemon.go:321] evicting pod openshift-ingress/router-default-5fdf7f869b-4xr8t
I0512 15:41:38.390385    2050 update.go:173] Draining failed with: error when waiting for pod "router-default-5fdf7f869b-4xr8t" terminating: global timeout reached: 20s, retrying
E0512 15:41:58.454983    2050 daemon.go:321] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-tqjp9, openshift-dns/dns-default-wx4q8, openshift-image-registry/node-ca-xgj7k, openshift-machine-config-operator/machine-config-daemon-285mk, openshift-monitoring/node-exporter-r2mqj, openshift-multus/multus-b8j4k, openshift-sdn/ovs-6chm8, openshift-sdn/sdn-9bb72
I0512 15:41:58.459153    2050 daemon.go:321] evicting pod openshift-ingress/router-default-5fdf7f869b-4xr8t
```

When I looking at the worker journal I see:
```
May 12 15:41:04.376038 ci-op-l4dwv-w-c-lb6zt.c.openshift-gce-devel-ci.internal hyperkube[1506]: I0512 15:41:04.376024    1506 prober.go:124] Readiness probe for "router-default-5fdf7f869b-4xr8t_openshift-ingress(03f4313d-f784-441b-96e9-c4faf0835398):router" failed (failure): HTTP probe failed with statuscode: 500
May 12 15:41:04.376477 ci-op-l4dwv-w-c-lb6zt.c.openshift-gce-devel-ci.internal hyperkube[1506]: I0512 15:41:04.376117    1506 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-ingress", Name:"router-default-5fdf7f869b-4xr8t", UID:"03f4313d-f784-441b-96e9-c4faf0835398", APIVersion:"v1", ResourceVersion:"36000", FieldPath:"spec.containers{router}"}): type: 'Warning' reason: 'Unhealthy' Readiness probe failed: HTTP probe failed with statuscode: 500
```

```
May 12 15:41:14.391468 ci-op-l4dwv-w-c-lb6zt.c.openshift-gce-devel-ci.internal hyperkube[1506]: I0512 15:41:14.391280    1506 status_manager.go:570] Patch status for pod "router-default-5fdf7f869b-4xr8t_openshift-ingress(03f4313d-f784-441b-96e9-c4faf0835398)" with "{\"metadata\":{\"uid\":\"03f4313d-f784-441b-96e9-c4faf0835398\"},\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"lastTransitionTime\":\"2020-05-12T15:41:14Z\",\"message\":\"containers with unready status: [router]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"Ready\"},{\"lastTransitionTime\":\"2020-05-12T15:41:14Z\",\"message\":\"containers with unready status: [router]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"ContainersReady\"}],\"containerStatuses\":[{\"containerID\":\"cri-o://da19a43c44747e261a02c55828e3b3f3f4a67b8eebca44ba8eb05b1de950cd5e\",\"image\":\"registry.svc.ci.openshift.org/ci-op-0jbg37l7/stable@sha256:d776ac0bdb6265a3b11c04034845187d5f6738d30dd4e8e8625f7eb215457327\",\"imageID\":\"registry.svc.ci.openshift.org/ci-op-0jbg37l7/stable@sha256:d776ac0bdb6265a3b11c04034845187d5f6738d30dd4e8e8625f7eb215457327\",\"lastState\":{},\"name\":\"router\",\"ready\":false,\"restartCount\":0,\"started\":true,\"state\":{\"running\":{\"startedAt\":\"2020-05-12T15:26:54Z\"}}}]}}"
```

Example runs: 
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1689/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2179

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1724/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2165

Comment 1 Andrew McDermott 2020-05-13 15:52:39 UTC

Is there any connection to this bug? https://bugzilla.redhat.com/show_bug.cgi?id=1834989

Comment 2 Kirsten Garrison 2020-05-13 16:45:53 UTC

It doesn't look the same as this is mostly about terminating the pod? I also don't see "updated router deployment" in the logs above?

Comment 3 Andrew McDermott 2020-05-14 17:37:33 UTC

I'm wondering if 20s for the default timeout is enough here,
assuming I read this as allowing 20s for a graceful eviction:

Looking at https://github.com/openshift/router/pull/94, this suggests
a graceful shutdown of the router can be up to 45s:

  When the main process receives SIGTERM or SIGINT, wait
  ROUTER_GRACEFUL_SHUTDOWN_DELAY (default: 45)

Comment 4 Miciah Dashiel Butler Masters 2020-05-18 21:38:04 UTC

Is the expectation really that the pod can be evicted in 20 seconds?  That definitely seems too short a timeout:  The "router-default" deployment has spec.replicas: 2, and it uses the rolling update strategy with maxSurge: 25% and maxUnavailable: 50%, which means that the deployment controller tries to scale up a new pod before deleting the old pod, and the new pod can take a couple minutes to schedule, pull, initialize, synch routes, and start reporting ready.

Comment 5 Andrew McDermott 2020-05-19 16:14:28 UTC

Based on comments in #3 and #4 and the timing description in https://github.com/openshift/router/pull/94 we need more than 20s. Moving back to MCO team.

Comment 6 Clayton Coleman 2020-05-19 18:41:38 UTC

I0512 15:41:38.390385    2050 update.go:173] Draining failed with: error when waiting for pod "router-default-5fdf7f869b-4xr8t" terminating: global timeout reached: 20s, retrying

This message *should* mean "i waited 20 seconds, I'm going to wait longer"

It should not be an error (if that is still true).  But if someone is explicitly saying "wait only 20s" (which the retrying makes me think is not happening) then yes, this would be a bug there.

Comment 7 Kirsten Garrison 2020-05-20 18:23:21 UTC

@Andrew

So I bumped this to 45s but I still see the router pod not getting evicted in the first try:

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1739/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2310/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-nwsnn_machine-config-daemon.log

from https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1739/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2310

Is this expected?

Comment 8 Kirsten Garrison 2020-05-20 18:25:57 UTC

*** Bug 1833284 has been marked as a duplicate of this bug. ***

Comment 9 Kirsten Garrison 2020-05-20 18:34:52 UTC

To clarify what is happening as well: the MCO waits 20s (or 45s in my pr), error, retry, forever basically..

Comment 10 Kirsten Garrison 2020-05-20 23:25:16 UTC

bumped it up to 60s and still seeing:

I0520 19:32:05.132252    4522 update.go:173] Draining failed with: error when waiting for pod "router-default-d8786759b-6nmcg" terminating: global timeout reached: 1m0s, retrying

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1739/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2315/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-twlk5_machine-config-daemon.log

Comment 11 Kirsten Garrison 2020-05-21 18:18:58 UTC

@Andrew any idea why this is happening?

Comment 12 Kirsten Garrison 2020-05-21 19:53:55 UTC

Moving back to Routing for further investigation as we still see errors even with 60s.

Comment 13 Andrew McDermott 2020-05-26 16:34:55 UTC

(In reply to Kirsten Garrison from comment #11)
> @Andrew any idea why this is happening?

We still believe this will need a couple of minutes at least. Can we please try with 5min? Looking at the logs it looks like the 60s does not retry beyond that.

Comment 14 Kirsten Garrison 2020-05-27 01:21:22 UTC

We do in fact retry as stated above. 

Tried 5min (which is obviously too long) and it worked seems that it's taking ~90s, I updated PR will look into this further tomorrow to figure out how we'll approach this.

@Andrew do these numbers seem correct to you?

Logs at 5min for your ref: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1739/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2330/artifacts/e2e-gcp-op/pods/.. test running again with 90s

Comment 15 Miciah Dashiel Butler Masters 2020-05-27 16:38:13 UTC

(In reply to Kirsten Garrison from comment #14)
> We do in fact retry as stated above. 

Since we retry, is there really a problem here, or does the operation eventually succeed (albeit with scary log messages about timeouts)?

> Tried 5min (which is obviously too long) and it worked seems that it's
> taking ~90s, I updated PR will look into this further tomorrow to figure out
> how we'll approach this.

90 seconds is reasonable.  We expect a replacement pod to report ready before we delete the old pod, and it can easily take 90 seconds to schedule the replacement pod, pull the image, and synchronize routes.

Comment 16 Kirsten Garrison 2020-05-27 17:16:44 UTC

@Miciah the op does succeed, but wanted to verify that this was correct timing so as to not paper over any potential issue here.  Sounds like it is expected so will update to ~90s to avoid forseeable errors as 20 is too low.

thanks!

Comment 19 Hongan Li 2020-06-23 07:07:16 UTC

Verified with 4.5.0-0.nightly-2020-06-22-201506 and issue has been fixed.

$ oc adm drain hongli-pl506-7pbxp-worker-a-zvk6p.c.openshift-qe.internal
...

$ oc -n openshift-machine-config-operator logs machine-config-daemon-n6dnt -c machine-config-daemon | grep router

Comment 20 errata-xmlrpc 2020-07-13 17:37:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.