Description of problem: In MCO CI gcp-op jobs we are seeing drain error when evicting openshift-ingress/route pods: From MCD logs: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1689/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2179/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-285mk_machine-config-daemon.log : ``` 0512 15:41:18.323584 2050 daemon.go:321] evicting pod openshift-ingress/router-default-5fdf7f869b-4xr8t I0512 15:41:38.390385 2050 update.go:173] Draining failed with: error when waiting for pod "router-default-5fdf7f869b-4xr8t" terminating: global timeout reached: 20s, retrying E0512 15:41:58.454983 2050 daemon.go:321] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-tqjp9, openshift-dns/dns-default-wx4q8, openshift-image-registry/node-ca-xgj7k, openshift-machine-config-operator/machine-config-daemon-285mk, openshift-monitoring/node-exporter-r2mqj, openshift-multus/multus-b8j4k, openshift-sdn/ovs-6chm8, openshift-sdn/sdn-9bb72 I0512 15:41:58.459153 2050 daemon.go:321] evicting pod openshift-ingress/router-default-5fdf7f869b-4xr8t ``` When I looking at the worker journal I see: ``` May 12 15:41:04.376038 ci-op-l4dwv-w-c-lb6zt.c.openshift-gce-devel-ci.internal hyperkube[1506]: I0512 15:41:04.376024 1506 prober.go:124] Readiness probe for "router-default-5fdf7f869b-4xr8t_openshift-ingress(03f4313d-f784-441b-96e9-c4faf0835398):router" failed (failure): HTTP probe failed with statuscode: 500 May 12 15:41:04.376477 ci-op-l4dwv-w-c-lb6zt.c.openshift-gce-devel-ci.internal hyperkube[1506]: I0512 15:41:04.376117 1506 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-ingress", Name:"router-default-5fdf7f869b-4xr8t", UID:"03f4313d-f784-441b-96e9-c4faf0835398", APIVersion:"v1", ResourceVersion:"36000", FieldPath:"spec.containers{router}"}): type: 'Warning' reason: 'Unhealthy' Readiness probe failed: HTTP probe failed with statuscode: 500 ``` ``` May 12 15:41:14.391468 ci-op-l4dwv-w-c-lb6zt.c.openshift-gce-devel-ci.internal hyperkube[1506]: I0512 15:41:14.391280 1506 status_manager.go:570] Patch status for pod "router-default-5fdf7f869b-4xr8t_openshift-ingress(03f4313d-f784-441b-96e9-c4faf0835398)" with "{\"metadata\":{\"uid\":\"03f4313d-f784-441b-96e9-c4faf0835398\"},\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"lastTransitionTime\":\"2020-05-12T15:41:14Z\",\"message\":\"containers with unready status: [router]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"Ready\"},{\"lastTransitionTime\":\"2020-05-12T15:41:14Z\",\"message\":\"containers with unready status: [router]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"ContainersReady\"}],\"containerStatuses\":[{\"containerID\":\"cri-o://da19a43c44747e261a02c55828e3b3f3f4a67b8eebca44ba8eb05b1de950cd5e\",\"image\":\"registry.svc.ci.openshift.org/ci-op-0jbg37l7/stable@sha256:d776ac0bdb6265a3b11c04034845187d5f6738d30dd4e8e8625f7eb215457327\",\"imageID\":\"registry.svc.ci.openshift.org/ci-op-0jbg37l7/stable@sha256:d776ac0bdb6265a3b11c04034845187d5f6738d30dd4e8e8625f7eb215457327\",\"lastState\":{},\"name\":\"router\",\"ready\":false,\"restartCount\":0,\"started\":true,\"state\":{\"running\":{\"startedAt\":\"2020-05-12T15:26:54Z\"}}}]}}" ``` Example runs: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1689/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2179 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1724/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2165
Is there any connection to this bug? https://bugzilla.redhat.com/show_bug.cgi?id=1834989
It doesn't look the same as this is mostly about terminating the pod? I also don't see "updated router deployment" in the logs above?
I'm wondering if 20s for the default timeout is enough here, assuming I read this as allowing 20s for a graceful eviction: Looking at https://github.com/openshift/router/pull/94, this suggests a graceful shutdown of the router can be up to 45s: When the main process receives SIGTERM or SIGINT, wait ROUTER_GRACEFUL_SHUTDOWN_DELAY (default: 45)
Is the expectation really that the pod can be evicted in 20 seconds? That definitely seems too short a timeout: The "router-default" deployment has spec.replicas: 2, and it uses the rolling update strategy with maxSurge: 25% and maxUnavailable: 50%, which means that the deployment controller tries to scale up a new pod before deleting the old pod, and the new pod can take a couple minutes to schedule, pull, initialize, synch routes, and start reporting ready.
Based on comments in #3 and #4 and the timing description in https://github.com/openshift/router/pull/94 we need more than 20s. Moving back to MCO team.
I0512 15:41:38.390385 2050 update.go:173] Draining failed with: error when waiting for pod "router-default-5fdf7f869b-4xr8t" terminating: global timeout reached: 20s, retrying This message *should* mean "i waited 20 seconds, I'm going to wait longer" It should not be an error (if that is still true). But if someone is explicitly saying "wait only 20s" (which the retrying makes me think is not happening) then yes, this would be a bug there.
@Andrew So I bumped this to 45s but I still see the router pod not getting evicted in the first try: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1739/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2310/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-nwsnn_machine-config-daemon.log from https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1739/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2310 Is this expected?
*** Bug 1833284 has been marked as a duplicate of this bug. ***
To clarify what is happening as well: the MCO waits 20s (or 45s in my pr), error, retry, forever basically..
bumped it up to 60s and still seeing: I0520 19:32:05.132252 4522 update.go:173] Draining failed with: error when waiting for pod "router-default-d8786759b-6nmcg" terminating: global timeout reached: 1m0s, retrying https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1739/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2315/artifacts/e2e-gcp-op/pods/openshift-machine-config-operator_machine-config-daemon-twlk5_machine-config-daemon.log
@Andrew any idea why this is happening?
Moving back to Routing for further investigation as we still see errors even with 60s.
(In reply to Kirsten Garrison from comment #11) > @Andrew any idea why this is happening? We still believe this will need a couple of minutes at least. Can we please try with 5min? Looking at the logs it looks like the 60s does not retry beyond that.
We do in fact retry as stated above. Tried 5min (which is obviously too long) and it worked seems that it's taking ~90s, I updated PR will look into this further tomorrow to figure out how we'll approach this. @Andrew do these numbers seem correct to you? Logs at 5min for your ref: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1739/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op/2330/artifacts/e2e-gcp-op/pods/.. test running again with 90s
(In reply to Kirsten Garrison from comment #14) > We do in fact retry as stated above. Since we retry, is there really a problem here, or does the operation eventually succeed (albeit with scary log messages about timeouts)? > Tried 5min (which is obviously too long) and it worked seems that it's > taking ~90s, I updated PR will look into this further tomorrow to figure out > how we'll approach this. 90 seconds is reasonable. We expect a replacement pod to report ready before we delete the old pod, and it can easily take 90 seconds to schedule the replacement pod, pull the image, and synchronize routes.
@Miciah the op does succeed, but wanted to verify that this was correct timing so as to not paper over any potential issue here. Sounds like it is expected so will update to ~90s to avoid forseeable errors as 20 is too low. thanks!
Verified with 4.5.0-0.nightly-2020-06-22-201506 and issue has been fixed. $ oc adm drain hongli-pl506-7pbxp-worker-a-zvk6p.c.openshift-qe.internal ... $ oc -n openshift-machine-config-operator logs machine-config-daemon-n6dnt -c machine-config-daemon | grep router
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409