Bug 1791162

Summary:	OpenShift API stops responding to requests / is unreachable multiple times during z-upgrade
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	openshift-apiserver	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Xingxing Xia <xxia>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.3.0	CC:	alpatel, anbhatta, aos-bugs, bparees, dmace, jkaur, joboyer, kewang, lszaszki, mfojtik, openshift-bugs-escalate, qiwan, sbatsche, scuppett, shurley, sttts, wking
Target Milestone:	---	Keywords:	Upgrades
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:	The OpenShift API server should now remain available to clients during upgrades.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-06-05 14:45:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1845412, 1943804
Bug Blocks:

Description Clayton Coleman 2020-01-15 04:01:28 UTC

A pod on the pod network (in this case openshift-apiserver) is observed to repeatedly fail to answer API requests in a 4.3 to 4.3 upgrade (the only code change is that the upgrade test correctly fails the entire test if it detects this condition).

A key part of upgrades is that they do not disrupt user workflows - a number of other errors in these upgrade logs indicate that perhaps pod shutdown is not proceeding gracefully, or some other networking or host level condition is at play. Any evidence of workload impact is a release blocker.

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/14324
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/14323

Both of the tests above fail AWS. 

On further investigation, this is failing even in normal 4.2 to 4.3 upgrades.

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3/37

demonstrates similar (although not quite identical problems).

In 4.1 and 4.2 we see a dramatically smaller incidence (one request out of a thousand may fail, and only once, likely corresponding to other conditions).  In 4.3 and 4.4 this appears to be far more serious.

This is a release blocker for 4.3.0 GA.  Please triage and route as quickly as possible.

Comment 1 Clayton Coleman 2020-01-15 04:04:35 UTC

Example of disruption from 14324

Jan 14 23:30:49.237 I ns/openshift-console-operator deployment/console-operator Status for clusteroperator/console changed: Degraded message changed from "RouteSyncDegraded: the server is currently unable to handle the request (get routes.route.openshift.io console)\nOAuthClientSyncDegraded: oauth client for console does not exist and cannot be created (the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console))" to "OAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)"
Jan 14 23:30:49.281 I openshift-apiserver OpenShift API started failing: Get https://api.ci-op-c7wrw1i9-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=3s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Jan 14 23:30:50.280 E openshift-apiserver OpenShift API is not responding to GET requests
Jan 14 23:30:50.280 - 59s   W node/ip-10-0-135-37.ec2.internal node is not ready
Jan 14 23:30:50.427 I ns/openshift-service-ca configmap/apiservice-cabundle-injector-lock de26dd07-8b25-4747-9c6c-a94818d2d42a became leader
Jan 14 23:30:50.444 I ns/openshift-console-operator deployment/console-operator Status for clusteroperator/console changed: Degraded message changed from "OAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)" to "RouteSyncDegraded: the server is currently unable to handle the request (get routes.route.openshift.io console)\nOAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)"
Jan 14 23:30:51.353 I ns/openshift-service-catalog-controller-manager-operator configmap/svcat-controller-manager-operator-lock 726bce43-79d7-4712-9352-3c7e5a4e001e became leader
Jan 14 23:30:51.633 I ns/openshift-console-operator deployment/console-operator Status for clusteroperator/console changed: Degraded message changed from "RouteSyncDegraded: the server is currently unable to handle the request (get routes.route.openshift.io console)\nOAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)" to "OAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)"
Jan 14 23:30:52.835 I ns/openshift-console-operator deployment/console-operator Status for clusteroperator/console changed: Degraded message changed from "OAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)" to "RouteSyncDegraded: the server is currently unable to handle the request (get routes.route.openshift.io console)\nOAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)" (2 times)
Jan 14 23:30:53.014 I ns/openshift-authentication-operator configmap/cluster-authentication-operator-lock e23c85d1-fb21-43c2-84b2-f1e2f51879e3 became leader
Jan 14 23:30:53.167 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Degraded message changed from "" to "RouteStatusDegraded: the server is currently unable to handle the request (get routes.route.openshift.io oauth-openshift)"
Jan 14 23:30:54.033 I ns/openshift-console-operator deployment/console-operator Status for clusteroperator/console changed: Degraded message changed from "RouteSyncDegraded: the server is currently unable to handle the request (get routes.route.openshift.io console)\nOAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)" to ""
Jan 14 23:30:54.390 I ns/openshift-kube-scheduler-operator configmap/openshift-cluster-kube-scheduler-operator-lock 364fc198-eee0-4937-9e32-880f79c50b7d became leader
Jan 14 23:30:54.413 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator not enough information provided, not all functionality is present
Jan 14 23:30:54.517 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7
Jan 14 23:30:54.527 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7 (2 times)
Jan 14 23:30:54.534 I ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator Status for clusteroperator/kube-scheduler changed: Degraded message changed from "NodeControllerDegraded: All master node(s) are ready\nStaticPodsDegraded: nodes/ip-10-0-144-85.ec2.internal pods/openshift-kube-scheduler-ip-10-0-144-85.ec2.internal container=\"scheduler\" is not ready" to "NodeControllerDegraded: All master node(s) are ready\nStaticPodsDegraded: nodes/ip-10-0-144-85.ec2.internal pods/openshift-kube-scheduler-ip-10-0-144-85.ec2.internal container=\"scheduler\" is not ready\nInstallerControllerDegraded: missing required resources: [configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7]"
Jan 14 23:30:54.539 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7 (3 times)
Jan 14 23:30:54.554 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7 (4 times)
Jan 14 23:30:54.592 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7 (5 times)
Jan 14 23:30:54.673 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7 (6 times)
Jan 14 23:30:54.832 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7
Jan 14 23:30:54.844 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7 (2 times)
Jan 14 23:30:54.850 I ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator Status for clusteroperator/kube-scheduler changed: Degraded message changed from "NodeControllerDegraded: All master node(s) are ready\nStaticPodsDegraded: nodes/ip-10-0-144-85.ec2.internal pods/openshift-kube-scheduler-ip-10-0-144-85.ec2.internal container=\"scheduler\" is not ready\nInstallerControllerDegraded: missing required resources: [configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7]" to "NodeControllerDegraded: All master node(s) are ready\nStaticPodsDegraded: nodes/ip-10-0-144-85.ec2.internal pods/openshift-kube-scheduler-ip-10-0-144-85.ec2.internal container=\"scheduler\" is not ready\nInstallerControllerDegraded: missing required resources: configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7"
Jan 14 23:30:55.164 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7 (3 times)
Jan 14 23:30:55.237 I ns/openshift-console-operator deployment/console-operator Status for clusteroperator/console changed: Degraded message changed from "" to "RouteSyncDegraded: the server is currently unable to handle the request (get routes.route.openshift.io console)" (2 times)
Jan 14 23:30:55.298 I openshift-apiserver OpenShift API started responding to GET requests
Jan 14 23:30:55.669 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Degraded message changed from "RouteStatusDegraded: the server is currently unable to handle the request (get routes.route.openshift.io oauth-openshift)" to "OperatorSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io openshift-browser-client)"

Comment 2 Clayton Coleman 2020-01-15 04:06:19 UTC

Symptoms are similar to 1791117 but without clear indication that an SDN upgrade is in progress.  May be related.  Observed on latest release-4.3 (post rc.1)

Comment 14 W. Trevor King 2020-04-07 22:50:21 UTC

Dup of bug 1809665?  Or maybe bug 1820266 (see [1])?  If not, can we update this bug to use a more-specific subject?

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1818106#c10

Comment 16 W. Trevor King 2020-04-28 23:30:39 UTC

Also in this space, although not applying to AWS or GCP, is bug 1828382.

Comment 17 Stefan Schimanski 2020-05-05 07:45:57 UTC

*** Bug 1828866 has been marked as a duplicate of this bug. ***

Comment 18 Ben Parees 2020-05-05 17:30:09 UTC

This bug is believed to be the cause of test failures in: OpenShift APIs remain available

Comment 19 Michal Fojtik 2020-05-20 10:56:49 UTC

This bug is actively worked on.

Comment 20 Stefan Schimanski 2020-05-28 09:05:27 UTC

The suspicion is that this is around etcd graceful termination behaviour. The etcd team is working on this, but this is no 4.5 blocker as this behaviour is preexisting. Moving to 4.6, with a possible backport later.

Comment 21 Ben Parees 2020-06-03 17:39:59 UTC

We are seeing substantially more failures in this space in 4.5 than we did in 4.4.  I am moving this back to 4.5 as i think it is a 4.5 upgrade blocker.

see:
https://search.apps.build01.ci.devcluster.openshift.com/?search=OpenShift+API+is+not+responding+to+GET+requests&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 22 Ben Parees 2020-06-03 17:46:39 UTC

putting the same comment here as i put in https://bugzilla.redhat.com/show_bug.cgi?id=1801885 (perhaps the two bugs should be combined as the cite the same apiserver not responding to GET message):

I question whether there is no regression here, our upgrades are failing more frequently in 4.5 than they were in 4.4, specifically with the "OpenShift API is not responding to GET requests" error:

4.5:
https://search.apps.build01.ci.devcluster.openshift.com/?search=OpenShift+API+is+not+responding+to+GET+requests&maxAge=48h&context=1&type=bug%2Bjunit&name=.*to-4.5.*&maxMatches=5&maxBytes=20971520&groupBy=job

Across 33 runs and 6 jobs (75.76% failed), matched 80.00% of failing runs and 66.67% of jobs 

4.4:
https://search.apps.build01.ci.devcluster.openshift.com/?search=OpenShift+API+is+not+responding+to+GET+requests&maxAge=48h&context=1&type=bug%2Bjunit&name=.*to-4.4.*&maxMatches=5&maxBytes=20971520&groupBy=job

Across 19 runs and 7 jobs (63.16% failed), matched 58.33% of failing runs and 57.14% of jobs 

Moving this back to 4.5 for reassessment.  If you can point to a different bug that explains why our upgrade test pass rate has gone from 60% in 4.4 to 42% in 4.5, then i can understand deferring this, but something has regressed.

Comment 23 Ben Parees 2020-06-03 18:00:26 UTC

To disambiguate https://bugzilla.redhat.com/show_bug.cgi?id=1801885 and https://bugzilla.redhat.com/show_bug.cgi?id=1791162, both of which are failures in the same "[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]" test, they have distinct failure modes/messages


https://bugzilla.redhat.com/show_bug.cgi?id=1801885 is for failures reported as:
fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:237]: Jun  2 04:14:53.680: API was unreachable during disruption for at least 8m13s of 54m30s (15%!)(MISSING):


https://bugzilla.redhat.com/show_bug.cgi?id=1791162 is for:
Jun 02 04:16:51.466 - 194s  E openshift-apiserver OpenShift API is not responding to GET requests

Comment 24 Ben Parees 2020-06-03 18:04:14 UTC

recent example:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/58

Comment 25 Ben Parees 2020-06-03 18:13:14 UTC

(to be clear, i understand we expect to have some disruption in 4.5 and will be addressing that in 4.6, but this test is failing because we are exceeding the allowed amount of disruption)

Comment 26 Lukasz Szaszkiewicz 2020-06-04 13:22:29 UTC

It looks like the same upgrade job is in much better condition on AWS - https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.4-stable-to-4.5-ci?buildId=
For GCP we opened https://github.com/openshift/machine-config-operator/pull/1780 as we think it might be the issue.

Comment 27 Stefan Schimanski 2020-06-05 14:45:31 UTC

With https://bugzilla.redhat.com/show_bug.cgi?id=1844387 for ovirt, openstack, vsphere, bm platforms and the following:

- Azure IPI: https://bugzilla.redhat.com/show_bug.cgi?id=1828382
- Azure UPI: https://bugzilla.redhat.com/show_bug.cgi?id=1836016
- AWS UPI: https://bugzilla.redhat.com/show_bug.cgi?id=1836018
- vSphere UPI: https://bugzilla.redhat.com/show_bug.cgi?id=1836017

we have a number of platform specific BZs open which address exactly these problems. This BZ here on the other hand is not actionable. Please be precise in which conditions the issues appear, do analysis of the data beforehand in order to make these BZs actionable. We have platforms which are perfectly fine. So the chance is very high that the ugprade issues have root causes in the different deployments of different platforms.

Comment 28 W. Trevor King 2020-06-05 21:54:05 UTC

All of the fixes listed in comment 27 are around /readyz, but installer-provisioned AWS LBs have been using /readyz for ages.  The initial examples from comment 0 here were both installer-provisioned AWS.  So if we are going to effectively close this bug as a dup, can we at least point to a bug that improved API reachability on installer-provisioned AWS?

Comment 29 Stefan Schimanski 2020-06-08 07:35:38 UTC

@Trevor: I will create BZs today by platform. We are getting sent new ones every week and old ones are reopened with random observations from random platforms. That's not helpful at all.

Comment 30 Stefan Schimanski 2020-06-09 07:50:50 UTC

Created umbrella bug per platform, all linked by the top-level bug https://bugzilla.redhat.com/show_bug.cgi?id=1845411.

Comment 31 W. Trevor King 2021-04-05 17:45:59 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475