The upgrade of a cluster in https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4/285 times out with: Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator openshift-apiserver is still updating Diving into the root cause, this is found: - the openshift-apiserver-operator updated its openshift-apiserver/apiserver deployment at 05:24:07 - the pod openshift-apiserver/apiserver-5b9677479-z84j7 on node ci-op-prmz6-m-2.c.openshift-gce-devel-ci.internal is marked with deletionTimestamp "2020-04-21T05:24:56Z" - 10 minutes later the new ReplicaSet openshift-apiserver/openshift-apiserver stops progressing: lastTransitionTime: "2020-04-21T05:34:04Z" lastUpdateTime: "2020-04-21T05:34:04Z" message: ReplicaSet "apiserver-7c9bf658b" has timed out progressing. reason: ProgressDeadlineExceeded status: "False" type: Progressing - the pod openshift-apiserver/apiserver-5b9677479-z84j7 has not terminated: conditions: - lastProbeTime: null lastTransitionTime: "2020-04-21T05:09:57Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2020-04-21T05:11:51Z" message: 'containers with unready status: [openshift-apiserver]' reason: ContainersNotReady status: "False" type: Ready - lastProbeTime: null lastTransitionTime: "2020-04-21T05:11:51Z" message: 'containers with unready status: [openshift-apiserver]' reason: ContainersNotReady status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2020-04-21T05:00:36Z" status: "True" type: PodScheduled containerStatuses: - containerID: cri-o://b5775923b72e47cd18fa007edcdada6d9088dfd0ca9b500ef2977a2aeacff89c image: registry.svc.ci.openshift.org/ocp/4.4-2020-04-17-081422@sha256:6f08359ab47e36269a7247a545ac2ef36999782a77108abba90ed03f735a0bcd imageID: registry.svc.ci.openshift.org/ocp/4.4-2020-04-17-081422@sha256:6f08359ab47e36269a7247a545ac2ef36999782a77108abba90ed03f735a0bcd lastState: {} name: openshift-apiserver ready: false restartCount: 0 started: true state: running: startedAt: "2020-04-21T05:09:58Z" - the new pod openshift-apiserver/apiserver-7c9bf658b-nvzpx cannot be scheduled due to anti-affinity with the non-terminated pod: conditions: - lastProbeTime: null lastTransitionTime: "2020-04-21T05:24:03Z" message: '0/6 nodes are available: 3 node(s) didn''t match node selector, 3 node(s) didn''t match pod affinity/anti-affinity.' reason: Unschedulable status: "False" type: PodScheduled
I double checked another test from https://testgrid.k8s.io/redhat-openshift-ocp-release-4.4-informing#release-openshift-origin-installer-e2e-gcp-upgrade-4.4: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4/280 Here ingress, storage and openshift-controller-manager are not finishing upgrades. I checked the latter: - the openshift-controller-manager/controller-manager DaemonSet does not finish a rollout started at around 23:08:45 - the pod controller-manager-b5nlc starts deletion at 2020-04-19T23:08:45Z conditions: - lastProbeTime: null lastTransitionTime: "2020-04-19T22:51:19Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2020-04-19T23:08:46Z" message: 'containers with unready status: [controller-manager]' reason: ContainersNotReady status: "False" type: Ready - lastProbeTime: null lastTransitionTime: "2020-04-19T23:08:46Z" message: 'containers with unready status: [controller-manager]' reason: ContainersNotReady status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2020-04-19T22:51:11Z" status: "True" type: PodScheduled - it never terminates
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the UpgradeBlocker flag has been added to this bug. It will be removed if the assessment indicates that this should not block upgrade edges. The expectation is that the assignee answers these questions. Who is impacted? Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Up to 2 minute disruption in edge routing Up to 90seconds of API downtime etcd loses quorum and you have to restore from backup How involved is remediation? Issue resolves itself after five minutes Admin uses oc to fix things Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression? No, itβs always been like this we just never noticed Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
Could this be related to BZ1825946?
To be clear, this looks like the root cause breaking 90% of gcp 4.4 upgrades in https://testgrid.k8s.io/redhat-openshift-ocp-release-4.4-informing#release-openshift-origin-installer-e2e-gcp-upgrade-4.4.
We see i/o timeouts on GCP which is the first thing that leads us into this state. We have opened a bug for early context cancellation of CNI (https://bugzilla.redhat.com/show_bug.cgi?id=1826075) calls but still need an explanation for: 238090:Apr 21 05:10:19.741093 ci-op-prmz6-m-2.c.openshift-gce-devel-ci.internal crio[1325]: 2020-04-21T05:10:19Z [error] SetNetworkStatus: failed to query the pod apiserver-5b9677479-z84j7 in out of cluster comm: Get https://api-int.ci-op-igpf6bw7-86ace.origin-ci-int-gce.dev.openshift.com:6443/api/v1/namespaces/openshift-apiserver/pods/apiserver-5b9677479-z84j7: dial tcp 10.0.0.2:6443: i/o timeout 238091:Apr 21 05:10:19.741706 ci-op-prmz6-m-2.c.openshift-gce-devel-ci.internal crio[1325]: 2020-04-21T05:10:19Z [error] Multus: error unsetting the networks status: SetNetworkStatus: failed to query the pod apiserver-5b9677479-z84j7 in out of cluster comm: Get https://api-int.ci-op-igpf6bw7-86ace.origin-ci-int-gce.dev.openshift.com:6443/api/v1/namespaces/openshift-apiserver/pods/apiserver-5b9677479-z84j7: dial tcp 10.0.0.2:6443: i/o timeout https://search.svc.ci.openshift.org/?search=dial+tcp+10.0.0.2%3A6443%3A+i%2Fo+timeout&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
*** Bug 1825946 has been marked as a duplicate of this bug. ***
We met earlier today and believe that fixing 1 below should fix the upgrade issue. 2 will prevent long i/o timeouts or other long responses from api leading us to such a state and 3 is for the kubelet better handling us going into such a state. 2 may be fine for 4.4.z and 3 may be fine for 4.5.0 with an eventual backport to 4.4.z. We will first have to see upgrade runs after 1 lands. 1. https://bugzilla.redhat.com/show_bug.cgi?id=1802534 2. https://bugzilla.redhat.com/show_bug.cgi?id=1826075 3. https://bugzilla.redhat.com/show_bug.cgi?id=1826498 I think we should continue using this bug as an overall tracker for GCP upgrade issues and create other bugs as needed.
Who is impacted? 90% of gcp upgrade 4.4 jobs in the CI What is the impact? Node in a bad state from which it doesn't recover automatically How involved is remediation? Admin will have to ssh and run some commands to remove the pods/containers stuck in this state. Is this a regression? See this comment from Stefan - https://bugzilla.redhat.com/show_bug.cgi?id=1822603#c12
> Who is impacted? > 90% of gcp upgrade 4.4 jobs in the CI This doesn't tell us about the customer impact, which is what affects whether we pull edges from Cincinnati or not. Is this really 90% of 4.4 GCP clusters, or do we have some reason to think this is specific to CI and that customers will not be impacted? > What is the impact? > Node in a bad state from which it doesn't recover automatically What is the impact of the bad state? Is it just blocking drain? Is it taking a node completely out of the scheduling pool? Is it leaving the node in the scheduling pool, but freezing some fraction of the pods scheduled there? Could this take down multiple control-plane machines and cause my etcd to lose quroum? If I was a customer and this bit my cluster, how afraid do I need to be?
> What is the impact of the bad state? What I saw in CI would mean that the upgrade would be blocked. Operators try to terminate old version pods and start new ones (the ones from the upgrade), and the new ones never launch. So the upgrade is never finished until manual intervention. I don't think this status means we are not available, but the cluster is at least degraded.
Also, it appears that we don't have a gcp 4.3 to 4.4 upgrade test, so don't have data whether this blocks that upgrade.
4.3.15 -> 4.4.0-rc.11 update job [1] (found via [2,3]). At least that job is green, but one job is not great statistics. [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/661 [2]: https://openshift-release.svc.ci.openshift.org/#4-stable [3]: https://openshift-release.svc.ci.openshift.org/releasestream/4-stable/release/4.4.0-rc.11#upgrades-from
Checking via CI search over the past 14d (CI search history might not actually be that deep, depending on pod restarts) [1]: release-openshift-origin-installer-e2e-gcp-upgrade-4.4 - 52 runs, 77% failed, 10% of failures match That's a lot better than 90%. Do we expect this to manifest differently than: fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:137]: during upgrade to registry.svc.ci.openshift.org/ocp/release:4.4.0-0.ci-2020-04-20-135811 Unexpected error: <*errors.errorString | 0xc001945ea0>: { s: "Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator openshift-apiserver is still updating", } Cluster did not complete upgrade: timed out waiting for the condition: Cluster operator openshift-apiserver is still updating occurred failed: (1h18m40s) 2020-04-21T06:36:09 "[Disruptive] Cluster upgrade [Top Level] [Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Serial] [Suite:openshift]" which is how the originally-linked [2] died? [1]: https://search.svc.ci.openshift.org/?search=Cluster+did+not+complete+upgrade.*Cluster+operator+openshift-apiserver+is+still+updating&maxAge=336h&context=1&type=bug%2Bjunit&name=release-openshift-origin-installer-e2e-gcp-upgrade-4.4&maxMatches=5&maxBytes=20971520&groupBy=job [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4/285
Out of band, Stefan said he'd also seen openshift-controller-manager, marketplace, and other operators time out from this same root cause as well, so maybe that explains my 10% vs. your 90%. You could extend "timed out waiting for the condition..." to include those other operators, or all operators, but then you'll probably pull in a few other failure modes that also end up surfacing as operator timeouts. So this is just one of the things that is not getting bubbled up into a place CI search is indexing, and it's hard to put good numbers on those.
This bug has been serving as a tracking bug. The ones needed resolved in 4.4.0 are merged and the others are set to 4.4.z/4.5 as appropriate. Closing this one out.
*** Bug 1828031 has been marked as a duplicate of this bug. ***
Reopening due to https://bugzilla.redhat.com/show_bug.cgi?id=1828031.
I was looking at https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade/661 where in masters-journal: There were 25 logs indicating addition attempt for apiserver: Apr 23 04:24:51.451805 ci-op-x5dhs-m-2.c.openshift-gce-devel-ci.internal hyperkube[1426]: I0423 04:24:51.451775 1426 kubelet.go:1913] SyncLoop (ADD, "api"): "apiserver-74fcb69cb5-p5qkp_openshift-apiserver(aa693864-82e3-4542-bade-b5ae9026c607)" There were 25 logs indicating apiserver sandbox not found: Apr 23 04:24:52.107027 ci-op-x5dhs-m-2.c.openshift-gce-devel-ci.internal hyperkube[1426]: I0423 04:24:52.107040 1426 kuberuntime_manager.go:424] No sandbox for pod "apiserver-74fcb69cb5-p5qkp_openshift-apiserver(aa693864-82e3-4542-bade-b5ae9026c607)" can be found. Need to start a new one
Two out of the 25 attempts encountered CNI error in the following form: Apr 23 03:27:53.309901 ci-op-x5dhs-m-0.c.openshift-gce-devel-ci.internal hyperkube[1390]: E0423 03:27:53.309301 1390 pod_workers.go:191] Error syncing pod 66360513-0108-40e3-92b1-d14a82561539 ("apiserver-d2z92_openshift-apiserver(66360513-0108-40e3-92b1-d14a82561539)"), skipping: failed to "CreatePodSandbox" for "apiserver-d2z92_openshift-apiserver(66360513-0108-40e3-92b1-d14a82561539)" with CreatePodSandboxError: "CreatePodSandbox for pod \"apiserver-d2z92_openshift-apiserver(66360513-0108-40e3-92b1-d14a82561539)\" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_apiserver-d2z92_openshift-apiserver_66360513-0108-40e3-92b1-d14a82561539_0(515be0aefea2ba1c31f5cca44c84a76e22df4bbc6506a7db472dd02d02fabf16): Multus: [openshift-apiserver/apiserver-d2z92]: error adding container to network \"openshift-sdn\": delegateAdd: error invoking DelegateAdd - \"openshift-sdn\": error in getting result from AddNetwork: CNI request failed with status 400: 'pods \"apiserver-d2z92\" not found\n'" Apr 23 03:27:53.311157 ci-op-x5dhs-m-0.c.openshift-gce-devel-ci.internal hyperkube[1390]: I0423 03:27:53.310891 1390 event.go:255] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-apiserver", Name:"apiserver-d2z92", UID:"66360513-0108-40e3-92b1-d14a82561539", APIVersion:"v1", ResourceVersion:"6565", FieldPath:""}): type: 'Warning' reason: 'FailedCreatePodSandBox' Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_apiserver-d2z92_openshift-apiserver_66360513-0108-40e3-92b1-d14a82561539_0(515be0aefea2ba1c31f5cca44c84a76e22df4bbc6506a7db472dd02d02fabf16): Multus: [openshift-apiserver/apiserver-d2z92]: error adding container to network "openshift-sdn": delegateAdd: error invoking DelegateAdd - "openshift-sdn": error in getting result from AddNetwork: CNI request failed with status 400: 'pods "apiserver-d2z92" not found Apr 23 03:27:53.316285 ci-op-x5dhs-m-0.c.openshift-gce-devel-ci.internal hyperkube[1390]: E0423 03:27:53.316227 1390 pod_workers.go:191] Error syncing pod 4bb79676-b9f5-4cb8-ab34-fe10893b054a ("apiserver-grphr_openshift-apiserver(4bb79676-b9f5-4cb8-ab34-fe10893b054a)"), skipping: failed to "CreatePodSandbox" for "apiserver-grphr_openshift-apiserver(4bb79676-b9f5-4cb8-ab34-fe10893b054a)" with CreatePodSandboxError: "CreatePodSandbox for pod \"apiserver-grphr_openshift-apiserver(4bb79676-b9f5-4cb8-ab34-fe10893b054a)\" failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_apiserver-grphr_openshift-apiserver_4bb79676-b9f5-4cb8-ab34-fe10893b054a_0(e6903ac3505e788a27f43aa59b75f77b28d9140d9d9e76eff2dc152d99b1ad63): Multus: [openshift-apiserver/apiserver-grphr]: error adding container to network \"openshift-sdn\": delegateAdd: error invoking DelegateAdd - \"openshift-sdn\": error in getting result from AddNetwork: CNI request failed with status 400: 'pods \"apiserver-grphr\" not found\n'" Apr 23 03:27:53.316285 ci-op-x5dhs-m-0.c.openshift-gce-devel-ci.internal hyperkube[1390]: I0423 03:27:53.316256 1390 event.go:255] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-apiserver", Name:"apiserver-grphr", UID:"4bb79676-b9f5-4cb8-ab34-fe10893b054a", APIVersion:"v1", ResourceVersion:"7645", FieldPath:""}): type: 'Warning' reason: 'FailedCreatePodSandBox' Failed create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_apiserver-grphr_openshift-apiserver_4bb79676-b9f5-4cb8-ab34-fe10893b054a_0(e6903ac3505e788a27f43aa59b75f77b28d9140d9d9e76eff2dc152d99b1ad63): Multus: [openshift-apiserver/apiserver-grphr]: error adding container to network "openshift-sdn": delegateAdd: error invoking DelegateAdd - "openshift-sdn": error in getting result from AddNetwork: CNI request failed with status 400: 'pods "apiserver-grphr" not found Checking what happened to the other 23 attempts.
For apiserver-xmnv6 here were some related log lines: ``` Apr 23 03:27:55.949836 ci-op-x5dhs-m-0.c.openshift-gce-devel-ci.internal hyperkube[1390]: I0423 03:27:55.949780 1390 event.go:255] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-apiserver", Name:"apiserver-xmnv6", UID:"f7ad4c71-51a9-408b-951b-de802c6be3ee", APIVersion:"v1", ResourceVersion:"7941", FieldPath:"spec.initContainers{fix-audit-permissions}"}): type: 'Normal' reason: 'Created' Created container fix-audit-permissions Apr 23 03:27:56.065769 ci-op-x5dhs-m-0.c.openshift-gce-devel-ci.internal systemd[1]: crio-fa9cb99862cb516d0f3ffcfec12ab780699748727c488ad2e3b1a7258ed57678.scope: Consumed 57ms CPU time Apr 23 03:28:10.993252 ci-op-x5dhs-m-0.c.openshift-gce-devel-ci.internal hyperkube[1390]: I0423 03:28:10.993142 1390 status_manager.go:575] Status for pod "apiserver-xmnv6_openshift-apiserver(f7ad4c71-51a9-408b-951b-de802c6be3ee)" updated successfully: (4, {Phase:Running Conditions:[{Type:Initialized Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-23 03:27:56 +0000 UTC Reason: Message:} {Type:Ready Status:True LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2020-04-23 03:28:10 +0000 UTC Reason: Message:} {Type:ContainersReady Status:True Apr 23 03:28:18.843211 ci-op-x5dhs-m-0.c.openshift-gce-devel-ci.internal hyperkube[1390]: I0423 03:28:18.843169 1390 kubelet.go:1924] SyncLoop (DELETE, "api"): "apiserver-xmnv6_openshift-apiserver(f7ad4c71-51a9-408b-951b-de802c6be3ee)" Apr 23 03:28:20.934208 ci-op-x5dhs-m-0.c.openshift-gce-devel-ci.internal hyperkube[1390]: I0423 03:28:20.934069 1390 event.go:255] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-apiserver", Name:"apiserver-xmnv6", UID:"f7ad4c71-51a9-408b-951b-de802c6be3ee", APIVersion:"v1", ResourceVersion:"7941", FieldPath:"spec.containers{openshift-apiserver}"}): type: 'Warning' reason: 'Unhealthy' Readiness probe failed: Get https://10.129.0.15:8443/healthz: dial tcp 10.129.0.15:8443: connect: connection refused Apr 23 03:28:21.423394 ci-op-x5dhs-m-0.c.openshift-gce-devel-ci.internal crio[1285]: 2020-04-23T03:28:21Z [verbose] Del: openshift-apiserver:apiserver-xmnv6:openshift-sdn:eth0 {"cniVersion":"0.3.1","name":"openshift-sdn","type":"openshift-sdn"} Apr 23 03:28:23.953907 ci-op-x5dhs-m-0.c.openshift-gce-devel-ci.internal hyperkube[1390]: I0423 03:28:23.951231 1390 status_manager.go:568] Patch status for pod "apiserver-xmnv6_openshift-apiserver(f7ad4c71-51a9-408b-951b-de802c6be3ee)" with "{\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"Initialized\"},{\"type\":\"Ready\"},{\"type\":\"ContainersReady\"},{\"type\":\"PodScheduled\"}],\"conditions\":[{\"lastTransitionTime\":\"2020-04-23T03:28:23Z\",\"message\":\"containers with unready status: [openshift-apiserver]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"Ready\"},{\"lastTransitionTime\":\"2020-04-23T03:28:23Z\",\"message\":\"containers with unready status: [openshift-apiserver]\",\"reason\":\"ContainersNotReady\",\"status\":\"False\",\"type\":\"ContainersReady\"}],\"containerStatuses\":[{\"image\":\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:773f1aab586673797d55ee7379bbecd65df97d3301720f99e2b57c87fe67419c\",\"imageID\":\"\",\"lastState\":{},\"name\":\"openshift-apiserver\",\"ready\":false,\"restartCount\":0,\"started\":false,\"state\":{\"waiting\":{\"reason\":\"PodInitializing\"}}}],\"phase\":\"Pending\"}}" Apr 23 03:28:27.715622 ci-op-x5dhs-m-0.c.openshift-gce-devel-ci.internal hyperkube[1390]: I0423 03:28:27.712227 1390 status_manager.go:588] Pod "apiserver-xmnv6_openshift-apiserver(f7ad4c71-51a9-408b-951b-de802c6be3ee)" fully terminated and removed from etcd ``` Please note the deletion at 03:28:18.843211 - I looked back til 03:28:10.993252 when the pod was considered running. Not much information in between. Since log level 4 was not enabled, we may not have enough information from this log file.
Mrunal can we drop https://bugzilla.redhat.com/show_bug.cgi?id=1822603 from the "Depends on" list, since that bug was moved to 4.4.z?
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days