Bug 1791162 - OpenShift API stops responding to requests / is unreachable multiple times during z-upgrade
Summary: OpenShift API stops responding to requests / is unreachable multiple times du...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.0
Assignee: Sam Batschelet
QA Contact: Xingxing Xia
URL:
Whiteboard:
: 1828866 (view as bug list)
Depends On: 1845412
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-15 04:01 UTC by Clayton Coleman
Modified: 2020-06-30 20:22 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
The OpenShift API server should now remain available to clients during upgrades.
Clone Of:
Environment:
Last Closed: 2020-06-05 14:45:31 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1523 None closed [wip][release-4.3] etcd-member: add terminationGracePeriodSeconds 2020-09-16 11:13:04 UTC

Description Clayton Coleman 2020-01-15 04:01:28 UTC
A pod on the pod network (in this case openshift-apiserver) is observed to repeatedly fail to answer API requests in a 4.3 to 4.3 upgrade (the only code change is that the upgrade test correctly fails the entire test if it detects this condition).

A key part of upgrades is that they do not disrupt user workflows - a number of other errors in these upgrade logs indicate that perhaps pod shutdown is not proceeding gracefully, or some other networking or host level condition is at play. Any evidence of workload impact is a release blocker.

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/14324
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/14323

Both of the tests above fail AWS. 

On further investigation, this is failing even in normal 4.2 to 4.3 upgrades.

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3/37

demonstrates similar (although not quite identical problems).

In 4.1 and 4.2 we see a dramatically smaller incidence (one request out of a thousand may fail, and only once, likely corresponding to other conditions).  In 4.3 and 4.4 this appears to be far more serious.

This is a release blocker for 4.3.0 GA.  Please triage and route as quickly as possible.

Comment 1 Clayton Coleman 2020-01-15 04:04:35 UTC
Example of disruption from 14324

Jan 14 23:30:49.237 I ns/openshift-console-operator deployment/console-operator Status for clusteroperator/console changed: Degraded message changed from "RouteSyncDegraded: the server is currently unable to handle the request (get routes.route.openshift.io console)\nOAuthClientSyncDegraded: oauth client for console does not exist and cannot be created (the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console))" to "OAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)"
Jan 14 23:30:49.281 I openshift-apiserver OpenShift API started failing: Get https://api.ci-op-c7wrw1i9-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=3s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Jan 14 23:30:50.280 E openshift-apiserver OpenShift API is not responding to GET requests
Jan 14 23:30:50.280 - 59s   W node/ip-10-0-135-37.ec2.internal node is not ready
Jan 14 23:30:50.427 I ns/openshift-service-ca configmap/apiservice-cabundle-injector-lock de26dd07-8b25-4747-9c6c-a94818d2d42a became leader
Jan 14 23:30:50.444 I ns/openshift-console-operator deployment/console-operator Status for clusteroperator/console changed: Degraded message changed from "OAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)" to "RouteSyncDegraded: the server is currently unable to handle the request (get routes.route.openshift.io console)\nOAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)"
Jan 14 23:30:51.353 I ns/openshift-service-catalog-controller-manager-operator configmap/svcat-controller-manager-operator-lock 726bce43-79d7-4712-9352-3c7e5a4e001e became leader
Jan 14 23:30:51.633 I ns/openshift-console-operator deployment/console-operator Status for clusteroperator/console changed: Degraded message changed from "RouteSyncDegraded: the server is currently unable to handle the request (get routes.route.openshift.io console)\nOAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)" to "OAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)"
Jan 14 23:30:52.835 I ns/openshift-console-operator deployment/console-operator Status for clusteroperator/console changed: Degraded message changed from "OAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)" to "RouteSyncDegraded: the server is currently unable to handle the request (get routes.route.openshift.io console)\nOAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)" (2 times)
Jan 14 23:30:53.014 I ns/openshift-authentication-operator configmap/cluster-authentication-operator-lock e23c85d1-fb21-43c2-84b2-f1e2f51879e3 became leader
Jan 14 23:30:53.167 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Degraded message changed from "" to "RouteStatusDegraded: the server is currently unable to handle the request (get routes.route.openshift.io oauth-openshift)"
Jan 14 23:30:54.033 I ns/openshift-console-operator deployment/console-operator Status for clusteroperator/console changed: Degraded message changed from "RouteSyncDegraded: the server is currently unable to handle the request (get routes.route.openshift.io console)\nOAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)" to ""
Jan 14 23:30:54.390 I ns/openshift-kube-scheduler-operator configmap/openshift-cluster-kube-scheduler-operator-lock 364fc198-eee0-4937-9e32-880f79c50b7d became leader
Jan 14 23:30:54.413 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator not enough information provided, not all functionality is present
Jan 14 23:30:54.517 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7
Jan 14 23:30:54.527 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7 (2 times)
Jan 14 23:30:54.534 I ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator Status for clusteroperator/kube-scheduler changed: Degraded message changed from "NodeControllerDegraded: All master node(s) are ready\nStaticPodsDegraded: nodes/ip-10-0-144-85.ec2.internal pods/openshift-kube-scheduler-ip-10-0-144-85.ec2.internal container=\"scheduler\" is not ready" to "NodeControllerDegraded: All master node(s) are ready\nStaticPodsDegraded: nodes/ip-10-0-144-85.ec2.internal pods/openshift-kube-scheduler-ip-10-0-144-85.ec2.internal container=\"scheduler\" is not ready\nInstallerControllerDegraded: missing required resources: [configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7]"
Jan 14 23:30:54.539 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7 (3 times)
Jan 14 23:30:54.554 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7 (4 times)
Jan 14 23:30:54.592 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7 (5 times)
Jan 14 23:30:54.673 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7 (6 times)
Jan 14 23:30:54.832 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7
Jan 14 23:30:54.844 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7 (2 times)
Jan 14 23:30:54.850 I ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator Status for clusteroperator/kube-scheduler changed: Degraded message changed from "NodeControllerDegraded: All master node(s) are ready\nStaticPodsDegraded: nodes/ip-10-0-144-85.ec2.internal pods/openshift-kube-scheduler-ip-10-0-144-85.ec2.internal container=\"scheduler\" is not ready\nInstallerControllerDegraded: missing required resources: [configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7]" to "NodeControllerDegraded: All master node(s) are ready\nStaticPodsDegraded: nodes/ip-10-0-144-85.ec2.internal pods/openshift-kube-scheduler-ip-10-0-144-85.ec2.internal container=\"scheduler\" is not ready\nInstallerControllerDegraded: missing required resources: configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7"
Jan 14 23:30:55.164 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7 (3 times)
Jan 14 23:30:55.237 I ns/openshift-console-operator deployment/console-operator Status for clusteroperator/console changed: Degraded message changed from "" to "RouteSyncDegraded: the server is currently unable to handle the request (get routes.route.openshift.io console)" (2 times)
Jan 14 23:30:55.298 I openshift-apiserver OpenShift API started responding to GET requests
Jan 14 23:30:55.669 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Degraded message changed from "RouteStatusDegraded: the server is currently unable to handle the request (get routes.route.openshift.io oauth-openshift)" to "OperatorSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io openshift-browser-client)"

Comment 2 Clayton Coleman 2020-01-15 04:06:19 UTC
Symptoms are similar to 1791117 but without clear indication that an SDN upgrade is in progress.  May be related.  Observed on latest release-4.3 (post rc.1)

Comment 14 W. Trevor King 2020-04-07 22:50:21 UTC
Dup of bug 1809665?  Or maybe bug 1820266 (see [1])?  If not, can we update this bug to use a more-specific subject?

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1818106#c10

Comment 16 W. Trevor King 2020-04-28 23:30:39 UTC
Also in this space, although not applying to AWS or GCP, is bug 1828382.

Comment 17 Stefan Schimanski 2020-05-05 07:45:57 UTC
*** Bug 1828866 has been marked as a duplicate of this bug. ***

Comment 18 Ben Parees 2020-05-05 17:30:09 UTC
This bug is believed to be the cause of test failures in: OpenShift APIs remain available

Comment 19 Michal Fojtik 2020-05-20 10:56:49 UTC
This bug is actively worked on.

Comment 20 Stefan Schimanski 2020-05-28 09:05:27 UTC
The suspicion is that this is around etcd graceful termination behaviour. The etcd team is working on this, but this is no 4.5 blocker as this behaviour is preexisting. Moving to 4.6, with a possible backport later.

Comment 21 Ben Parees 2020-06-03 17:39:59 UTC
We are seeing substantially more failures in this space in 4.5 than we did in 4.4.  I am moving this back to 4.5 as i think it is a 4.5 upgrade blocker.

see:
https://search.apps.build01.ci.devcluster.openshift.com/?search=OpenShift+API+is+not+responding+to+GET+requests&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 22 Ben Parees 2020-06-03 17:46:39 UTC
putting the same comment here as i put in https://bugzilla.redhat.com/show_bug.cgi?id=1801885 (perhaps the two bugs should be combined as the cite the same apiserver not responding to GET message):

I question whether there is no regression here, our upgrades are failing more frequently in 4.5 than they were in 4.4, specifically with the "OpenShift API is not responding to GET requests" error:

4.5:
https://search.apps.build01.ci.devcluster.openshift.com/?search=OpenShift+API+is+not+responding+to+GET+requests&maxAge=48h&context=1&type=bug%2Bjunit&name=.*to-4.5.*&maxMatches=5&maxBytes=20971520&groupBy=job

Across 33 runs and 6 jobs (75.76% failed), matched 80.00% of failing runs and 66.67% of jobs 

4.4:
https://search.apps.build01.ci.devcluster.openshift.com/?search=OpenShift+API+is+not+responding+to+GET+requests&maxAge=48h&context=1&type=bug%2Bjunit&name=.*to-4.4.*&maxMatches=5&maxBytes=20971520&groupBy=job

Across 19 runs and 7 jobs (63.16% failed), matched 58.33% of failing runs and 57.14% of jobs 

Moving this back to 4.5 for reassessment.  If you can point to a different bug that explains why our upgrade test pass rate has gone from 60% in 4.4 to 42% in 4.5, then i can understand deferring this, but something has regressed.

Comment 23 Ben Parees 2020-06-03 18:00:26 UTC
To disambiguate https://bugzilla.redhat.com/show_bug.cgi?id=1801885 and https://bugzilla.redhat.com/show_bug.cgi?id=1791162, both of which are failures in the same "[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]" test, they have distinct failure modes/messages


https://bugzilla.redhat.com/show_bug.cgi?id=1801885 is for failures reported as:
fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:237]: Jun  2 04:14:53.680: API was unreachable during disruption for at least 8m13s of 54m30s (15%!)(MISSING):


https://bugzilla.redhat.com/show_bug.cgi?id=1791162 is for:
Jun 02 04:16:51.466 - 194s  E openshift-apiserver OpenShift API is not responding to GET requests

Comment 25 Ben Parees 2020-06-03 18:13:14 UTC
(to be clear, i understand we expect to have some disruption in 4.5 and will be addressing that in 4.6, but this test is failing because we are exceeding the allowed amount of disruption)

Comment 26 Lukasz Szaszkiewicz 2020-06-04 13:22:29 UTC
It looks like the same upgrade job is in much better condition on AWS - https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.4-stable-to-4.5-ci?buildId=
For GCP we opened https://github.com/openshift/machine-config-operator/pull/1780 as we think it might be the issue.

Comment 27 Stefan Schimanski 2020-06-05 14:45:31 UTC
With https://bugzilla.redhat.com/show_bug.cgi?id=1844387 for ovirt, openstack, vsphere, bm platforms and the following:

- Azure IPI: https://bugzilla.redhat.com/show_bug.cgi?id=1828382
- Azure UPI: https://bugzilla.redhat.com/show_bug.cgi?id=1836016
- AWS UPI: https://bugzilla.redhat.com/show_bug.cgi?id=1836018
- vSphere UPI: https://bugzilla.redhat.com/show_bug.cgi?id=1836017

we have a number of platform specific BZs open which address exactly these problems. This BZ here on the other hand is not actionable. Please be precise in which conditions the issues appear, do analysis of the data beforehand in order to make these BZs actionable. We have platforms which are perfectly fine. So the chance is very high that the ugprade issues have root causes in the different deployments of different platforms.

Comment 28 W. Trevor King 2020-06-05 21:54:05 UTC
All of the fixes listed in comment 27 are around /readyz, but installer-provisioned AWS LBs have been using /readyz for ages.  The initial examples from comment 0 here were both installer-provisioned AWS.  So if we are going to effectively close this bug as a dup, can we at least point to a bug that improved API reachability on installer-provisioned AWS?

Comment 29 Stefan Schimanski 2020-06-08 07:35:38 UTC
@Trevor: I will create BZs today by platform. We are getting sent new ones every week and old ones are reopened with random observations from random platforms. That's not helpful at all.

Comment 30 Stefan Schimanski 2020-06-09 07:50:50 UTC
Created umbrella bug per platform, all linked by the top-level bug https://bugzilla.redhat.com/show_bug.cgi?id=1845411.


Note You need to log in before you can comment on or make changes to this bug.