1843998 – node pool upgrade failed due to failure to evict pod

Bug 1843998 - node pool upgrade failed due to failure to evict pod

Summary: node pool upgrade failed due to failure to evict pod

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Aniket Bhat
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-04 14:38 UTC by Ben Parees
Modified:	2024-10-01 16:38 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-04-27 00:12:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ben Parees 2020-06-04 14:38:00 UTC

Description of problem:
Node ip-10-0-156-130.us-east-2.compute.internal is reporting: \"failed to drain node (5 tries): timed out waiting for the condition: [error when evicting pod \\\"router-default-d4cf759c6-tpgqp\\\": global timeout reached: 1m30s, error when evicting pod \\\"service-test-trx4c\\\": global timeout reached: 1m30s

from https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.4-stable-to-4.5-ci/272/artifacts/e2e-aws-upgrade/machineconfigpools.json

in job:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.4-stable-to-4.5-ci/272

Ultimately this caused:
[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]

to fail since the upgrade never completed:
fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:138]: during upgrade to registry.svc.ci.openshift.org/ocp/release:4.5.0-0.ci-2020-05-29-203954
Unexpected error:
    <*errors.errorString | 0xc0023a6660>: {
        s: "Pools did not complete upgrade: timed out waiting for the condition",
    }
    Pools did not complete upgrade: timed out waiting for the condition
occurred



Version-Release number of selected component (if applicable):
4.4.6 upgrading to 4.5-ci

This has definitely occurred a few times lately so it's not a one-off:

https://search.apps.build01.ci.devcluster.openshift.com/?search=Pools+did+not+complete+upgrade%3A+timed+out+waiting+for+the+condition&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 1 Ryan Phillips 2020-06-04 19:20:21 UTC

In the workers journal at 07:51:01 the router-default pod attempts a liveness check and gets a header timeout exceeded.

07:51:01.089986 Starting openshift-sdn network plugin\\nI0530 07:51:00.793469  110813 ovs.go:169] Error executing ovs-ofctl: ovs-ofctl: /var/run/openvswitch/br0.mgmt: failed to open socket

07:51:01.089986 ip-10-0-156-130 hyperkube[1395]: I0530 07:51:00.952075  110813 ovs.go:169] Error executing ovs-vsctl: ovs-vsctl: cannot create a bridge named br0 because a bridge named br0 already exists 

The liveness check is failing because of this blip at the same time in the networking. Perhaps the SDN team can triage this further.

Comment 3 Aniket Bhat 2020-06-08 20:08:34 UTC

I am not sure if the liveness failure is necessarily related. When the pods are being asked to evict, why does the liveness check matter at all?

Comment 4 Ryan Phillips 2020-06-08 20:17:29 UTC

The search CI failure rate is .08%... Perhaps we should close this BZ?

https://search.apps.build01.ci.devcluster.openshift.com/?search=Pools+did+not+complete+upgrade%3A+timed+out+waiting+for+the+condition&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 5 Lalatendu Mohanty 2020-06-17 13:49:45 UTC

We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions.

Who is impacted?
  Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
  All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time
What is the impact?
  Up to 2 minute disruption in edge routing
  Up to 90seconds of API downtime
  etcd loses quorum and you have to restore from backup
How involved is remediation?
  Issue resolves itself after five minutes
  Admin uses oc to fix things
  Admin must SSH to hosts, restore from backups, or other non standard admin activities
Is this a regression?
  No, it’s always been like this we just never noticed
  Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 6 Scott Dodson 2020-06-23 19:11:06 UTC

Can we get an update on this bug, in particular just the latest assessment as to the presumed impact here and whether or not this is a significant negative contributor to upgrade success?

Comment 8 Antonio Murdaca 2020-06-23 23:18:14 UTC

> In any case, this does not look like a networking issue if the pod eviction is failing. 

This can be right, but I'm gonna bounce it again as I think the router deployment or kube itself is misbehaving somehow. What the MCD does is retrying so unless the drain logic is flawed it doesn't seem to be related to MCO.
I'm not sure who owns the "router" pod but bringing this back to Networking to maybe find a better owner.

I'm pretty confident that whoever owns "service-test-<hash>" pod may face the very same issue as the router pod (from MCD logs on that node):

E0623 19:04:13.872536  123930 daemon.go:336] error when evicting pod "service-test-plr2j" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

The thing that we can see from the MCDs logs are just:

E0623 18:28:30.614854  123930 daemon.go:336] error when evicting pod "router-default-97bfd9685-dc5ms" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

over and over, example job's log: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1860/pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade/1275471709129936896/artifacts/e2e-gcp-upgrade/

The worker machine-config-pools show that there are 3 nodes (https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1860/pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade/1275471709129936896/artifacts/e2e-gcp-upgrade/machineconfigpools.json):

degradedMachineCount: 1,
machineCount: 3,
observedGeneration: 3,
readyMachineCount: 0,
unavailableMachineCount: 1,
updatedMachineCount: 0

The worker nodes look healthy as well https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/1860/pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade/1275471709129936896/artifacts/e2e-gcp-upgrade/nodes.json

So, the thing is the MCO is doing its job to keep draining as we've always been told that we should retry draining over and over to avoid disruption with hard reboots that can cause data loss. The MCD and drain are adhering to the PDBs and we won't never hard reboot on that.

It's clear that there are 2 other nodes which are Schedulable where the router pod can definitely be rescheduled to (but I'm not the expert there) so, why is the router failing to be evicted and/or rescheduled?

I'll move this back to Networking as what you see in the MCD is a symptom of something which doesn't look related to the MCD but we're just loud enough to make it noticeable.

Comment 9 Ben Bennett 2020-06-24 12:20:24 UTC

Net Edge -- Can you guys take a look and see if we can work out why the pod does not start on the other nodes?  Thanks.

Comment 12 Stephen Greene 2020-06-24 18:16:17 UTC

Found a CI run that matches this failure where a router-default pod is safely evicted (new pod is properly scheduled), but a service-test pod is not, causing the test to fail.

MCD Logs:

https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25185/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1275783093302595584/artifacts/e2e-gcp-upgrade/pods/openshift-machine-config-operator_machine-config-daemon-2hjhc_machine-config-daemon.log

I0624 14:55:01.141776   73486 daemon.go:322] Evicted pod openshift-ingress/router-default-67cd8b5d68-4r2nh


While a majority of the recent CI failures pertaining to this BZ show problems with evicting/scheduling router pods, this run does not, so it may be of interest.

Comment 13 Daneyon Hansen 2020-06-24 18:44:34 UTC

I'm working with Antonio to understand if/how MCO respects the router's 'terminationGracePeriodSeconds` that was set to 60 min by [1]. This may be a regression introduced by [2], but still verifying.

[1] https://github.com/openshift/cluster-ingress-operator/pull/366
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1809665

Comment 14 Michael Gugino 2020-06-24 21:19:36 UTC

This is a combination of things.

The primary problem is, the pod you're attempting to drain is not ready:

router-default-97bfd9685-dc5ms

{
    "lastProbeTime": null,
    "lastTransitionTime": "2020-06-23T18:08:48Z",
    "message": "containers with unready status: [router]",
    "reason": "ContainersNotReady",
    "status": "False",
    "type": "ContainersReady"
},


...

                "containerStatuses": [
                    {
                        "containerID": "cri-o://3ec61f3696b896731b791256eb0134735e4a6f971dbb0ecfbb65c9f6ce533538",
                        "image": "registry.svc.ci.openshift.org/ci-op-vll3k6yw/stable@sha256:81bae9207a6e0061fae165873e94213b0e7b6329d8055ef814c053bd97de35a4",
                        "imageID": "registry.svc.ci.openshift.org/ci-op-vll3k6yw/stable-initial@sha256:81bae9207a6e0061fae165873e94213b0e7b6329d8055ef814c053bd97de35a4",
                        "lastState": {
                            "terminated": {
                                "containerID": "cri-o://2a4d5dbb93ae1c76dcf6fe05bb4e6b5b40b04b4222b836f1a76d0bce2fb932c8",
                                "exitCode": 0,
                                "finishedAt": "2020-06-23T18:09:56Z",
                                "reason": "Completed",
                                "startedAt": "2020-06-23T18:00:12Z"
                            }
                        },
                        "name": "router",
                        "ready": false,
                        "restartCount": 1,
                        "started": true,
                        "state": {
                            "running": {
                                "startedAt": "2020-06-23T18:09:57Z"
                            }
                        }
                    }
                ],


This means your PDB count is already N-1.  You might think that the eviction API should take this into account.  We're trying to evict the thing that's not Ready, so clearly PDBs don't apply here.  Unfortunately, that is not currently the case: https://github.com/kubernetes/kubernetes/issues/80389

I've already done some of the work upstream to make PDBs more sane here: https://github.com/kubernetes/kubernetes/pull/83906  But that PR does not cover every case, only pending pods that are stuck in pending.  We should also allow eviction to succeed for any non-ready pod.  Another PR that helps solve a related issue: https://github.com/kubernetes/kubernetes/pull/91342

Once that PR merges, I was planning on expanding the coverage to all non-ready Pods.  Unfortunately, it's a little slow going upstream.

How to fix for now:  Determine why the router pod is going notReady on the node you're attempting to drain.

Comment 15 Daneyon Hansen 2020-06-24 21:43:30 UTC

@mgugino thanks for assisting with this BZ.

> How to fix for now:  Determine why the router pod is going notReady on the node you're attempting to drain.

router-default-d4cf759c6-tpgqp is not ready because it's unable to reach the api server ("dial tcp 172.30.0.1:443: connect: no route to host"):


                "containerStatuses": [
                    {
                        "containerID": "cri-o://29586123399c400b60ffb2f8f2ed2ddfbd8db9c0af1d24fa29ec70d0b7cde960",
                        "image": "registry.svc.ci.openshift.org/ocp/4.5-2020-05-29-203954@sha256:5908265eb0041cea9a9ec36ad7b2bc82dd45346fc9e0f1b34b0e38a0f43f9f18",
                        "imageID": "registry.svc.ci.openshift.org/ocp/4.5-2020-05-29-203954@sha256:5908265eb0041cea9a9ec36ad7b2bc82dd45346fc9e0f1b34b0e38a0f43f9f18",
                        "lastState": {
                            "terminated": {
                                "containerID": "cri-o://22b120829babdd63453fd75369b5503490275a9ce9c50e651f123de8078cf809",
                                "exitCode": 137,
                                "finishedAt": "2020-05-30T08:52:57Z",
                                "message": ".1:443/api/v1/endpoints?limit=500\u0026resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host\nE0530 08:51:18.579501       1 reflector.go:178] github.com/openshift/router/pkg/router/template/service_lookup.go:33: Failed to list *v1.Service: Get https://172.30.0.1:443/api/v1/services?limit=500\u0026resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host\nE0530 08:51:27.731469       1 reflector.go:178] github.com/openshift/router/pkg/router/controller/factory/factory.go:116: Failed to list *v1.Route: Get https://172.30.0.1:443/apis/route.openshift.io/v1/routes?limit=500\u0026resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host\nE0530 08:51:27.731469       1 reflector.go:178] github.com/openshift/router/pkg/router/controller/factory/factory.go:116: Failed to list *v1.Endpoints: Get https://172.30.0.1:443/api/v1/endpoints?limit=500\u0026resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host\nE0530 08:52:07.155469       1 reflector.go:178] github.com/openshift/router/pkg/router/controller/factory/factory.go:116: Failed to list *v1.Endpoints: Get https://172.30.0.1:443/api/v1/endpoints?limit=500\u0026resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host\nE0530 08:52:07.155488       1 reflector.go:178] github.com/openshift/router/pkg/router/template/service_lookup.go:33: Failed to list *v1.Service: Get https://172.30.0.1:443/api/v1/services?limit=500\u0026resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host\nE0530 08:52:10.227538       1 reflector.go:178] github.com/openshift/router/pkg/router/controller/factory/factory.go:116: Failed to list *v1.Route: Get https://172.30.0.1:443/apis/route.openshift.io/v1/routes?limit=500\u0026resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host\nE0530 08:52:52.723560       1 reflector.go:178] github.com/openshift/router/pkg/router/controller/factory/factory.go:116: Failed to list *v1.Endpoints: Get https://172.30.0.1:443/api/v1/endpoints?limit=500\u0026resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host\n",

Comment 16 Daneyon Hansen 2020-06-24 21:55:54 UTC

Reassigning to the SDN team for further investigation since the router pod is unable to reach the api server. The api server and service/pod networks should remain available until everything has drained.

Comment 21 Michael Gugino 2020-09-07 14:08:50 UTC

I'm not a fan of closing this bug until we determine how and why the particular component went not ready.  If this fails for a user, it will block an upgrade, which is something we never want.  I'd rather leave this BZ open until we have that info.  I don't own this particular component, so I don't have the final say here, but my recommendation is to try to figure out why it failed.

Comment 22 Aniket Bhat 2020-09-08 18:18:42 UTC

@Michael: I think your assessment is fair. However at this point, this doesn't seem to be networking related at all. I am assigning this to the API server team to triage if there is a bug in this area around some race condition between namespace deletion and PDBs.

Your comment https://bugzilla.redhat.com/show_bug.cgi?id=1843998#c19 indicates that the namespace deletion was not an issue. However, I don't know how else to interpret these errors in https://storage.googleapis.com/origin-ci-test/logs/endurance-upgrade-aws-4.5/1295148420012445696/build-log.txt

Aug 17 01:42:32.148: INFO: stdout: "{\"status\":\"success\",\"data\":{\"resultType\":\"vector\",\"result\":[{\"metric\":{\"alertname\":\"MCDDrainError\",\"alertstate\":\"firing\",\"drain_time\":\"150.535547114 sec\",\"endpoint\":\"metrics\",\"err\":\"5 tries: [error when evicting pod \\\"inline-volume-tester2-htdh4\\\": pods \\\"inline-volume-tester2-htdh4\\\" is forbidden: unable to create new content in namespace e2e-ephemeral-3638 because it is being terminated, error when evicting pod \\\"inline-volume-tester-r9wqr\\\": pods \\\"inline-volume-tester-r9wqr\\\" is forbidden: unable to create new content in namespace e2e-ephemeral-1901 because it is being terminated, error when evicting pod \\\"inline-volume-tester-2fl4v\\\": pods \\\"inline-volume-tester-2fl4v\\\" is forbidden: unable to create new content in namespace e2e-ephemeral-3638 because it is being terminated, error when evicting pod \\\"pod-subpath-test-dynamicpv-xqxw\\\": pods \\\"pod-subpath-test-dynamicpv-xqxw\\\" is forbidden: unable to create new content in namespace e2e-provisioning-2371 because it is being terminated]\",\"instance\":\"10.0.212.130:9001\",\"job\":\"machine-config-daemon\",\"namespace\":\"openshift-machine-config-operator\",\"pod\":\"machine-config-daemon-vl22p\",\"service\":\"machine-config-daemon\",\"severity\":\"critical\"},\"value\":[1597628552.084,\"2\"]},{\"metric\":{\"alertname\":\"MCDDrainError\",\"alertstate\":\"firing\",\"drain_time\":\"150.799667369 sec\",\"endpoint\":\"metrics\",\"err\":\"5 tries: [error when evicting pod \\\"inline-volume-tester-2fl4v\\\": pods \\\"inline-volume-tester-2fl4v\\\" is forbidden: unable to create new content in namespace e2e-ephemeral-3638 because it is being terminated, error when evicting pod \\\"inline-volume-tester2-htdh4\\\": pods \\\"inline-volume-tester2-htdh4\\\" is forbidden: unable to create new content in namespace e2e-ephemeral-3638 because it is being terminated, error when evicting pod \\\"pod-subpath-test-dynamicpv-xqxw\\\": pods \\\"pod-subpath-test-dynamicpv-xqxw\\\" is forbidden: unable to create new content in namespace e2e-provisioning-2371 because it is being terminated, error when evicting pod \\\"inline-volume-tester-r9wqr\\\": pods \\\"inline-volume-tester-r9wqr\\\" is forbidden: unable to create new content in namespace e2e-ephemeral-1901 because it is being terminated]\",\"instance\":\"10.0.212.130:9001\",\"job\":\"machine-config-daemon\",\"namespace\":\"openshift-machine-config-operator\",\"pod\":\"machine-config-daemon-vl22p\",\"service\":\"machine-config-daemon\",\"severity\":\"critical\"},\"value\":[1597628552.084,\"2\"]},{\"metric\":{\"alertname\":\"MCDDrainError\",\"alertstate\":\"firing\",\"drain_time\":\"150.821525219 sec\",\"endpoint\":\"metrics\",\"err\":\"5 tries: [error when evicting pod \\\"inline-volume-tester-2fl4v\\\": pods \\\"inline-volume-tester-2fl4v\\\" is forbidden: unable to create new content in namespace e2e-ephemeral-3638 because it is being terminated, error when evicting pod \\\"inline-volume-tester2-htdh4\\\": pods \\\"inline-volume-tester2-htdh4\\\" is forbidden: unable to create new content in namespace e2e-ephemeral-3638 because it is being terminated, error when evicting pod \\\"pod-subpath-test-dynamicpv-xqxw\\\": pods \\\"pod-subpath-test-dynamicpv-xqxw\\\" is forbidden: unable to create new content in namespace e2e-provisioning-2371 because it is being terminated, error when evicting pod \\\"inline-volume-tester-r9wqr\\\": pods \\\"inline-volume-tester-r9wqr\\\" is forbidden: unable to create new content in namespace e2e-ephemeral-1901 because it is being terminated]\",\"instance\":\"10.0.212.130:9001\",\"job\":\"machine-config-daemon\",\"namespace\":\"openshift-machine-config-operator\",\"pod\":\"machine-config-daemon-vl22p\",\"service\":\"machine-config-daemon\",\"severity\":\"critical\"},\"value\":[1597628552.084,\"2\"]},{\"metric\":{\"alertname


API server team - can someone PTAL?

Comment 23 Michael Gugino 2020-09-09 01:09:20 UTC

Yes, an individual eviction request will fail if a namespace has been deleted.  However, once the pod is actually gone, the drain operation for that node will complete.  If there is a pod that is hanging around indefinitely after a namespace attempts to delete it, that's a bug in itself, most likely due to whatever pod is hanging around.

In previous instances (as mentioned earlier in the comments in this BZ) the pod in question was unready, but was not marked deleted.  It was failing to drain due to PDBs.  We need to sort out why that pod went unready, it wasn't because a namespace was deleted.

So, it seems this BZ is now covering two distinct issues.  First we need to figure out why this happened: https://bugzilla.redhat.com/show_bug.cgi?id=1843998#c15  No route to host implies the networking failed for that pod.

The other issue is a flake in the e2e caused by the pod inline-volume-tester-r9wqr and possibly others never going away when the owning namespace is created.  Maybe these pods have a really high grace period.  I'm unsure, but the owner of those pods needs to investigate.

I don't think either condition is related to the API server.

Comment 24 Stefan Schimanski 2020-09-09 09:43:18 UTC

Not being able to create something in a namespace that's marked for deletion is expected and working as intended.

I don't see anything apiserver related here. Moving back.

Comment 27 Clayton Coleman 2020-09-14 14:25:54 UTC

Regarding apiserver and drain (stefan's comment).  A drain invocation to a pod (POST pod subresource evict) should be able to be invoked during namespace shutdown, OR if we decide to disallow it, it must return a clear error that the client code can use to handle that condition.  If we already return the clear error (namespace terminating, which I fixed in most places but not all) then the drain code should handle that specific reason silently instead of reporting an error (because namespace teardown results in pods being deleted anyway).

Comment 28 Michael Gugino 2020-09-14 14:33:56 UTC

(In reply to Clayton Coleman from comment #27)
> Regarding apiserver and drain (stefan's comment).  A drain invocation to a
> pod (POST pod subresource evict) should be able to be invoked during
> namespace shutdown, OR if we decide to disallow it, it must return a clear
> error that the client code can use to handle that condition.  If we already
> return the clear error (namespace terminating, which I fixed in most places
> but not all) then the drain code should handle that specific reason silently
> instead of reporting an error (because namespace teardown results in pods
> being deleted anyway).

Drain library code has been updated here upstream (merged): https://github.com/kubernetes/kubernetes/pull/93947  This will solve the symptom of drain appearing to fail.  In reality, drains are going to go just as fast.

This patch will allow eviction to succeed (needs reviews, please chime in) when you have 0 disruption budget left but still have enough healthy pods (eg, draining a disrupted pod): https://github.com/kubernetes/kubernetes/pull/94381

The second patch will mask these types of issues outlined here.  We need to determine *why* these pods are going unready on otherwise healthy nodes.  The patch will fix the UX for the real world clusters, but make finding these types of issues prior to release more difficult.

Note You need to log in before you can comment on or make changes to this bug.