Bug 1791162
Summary: | OpenShift API stops responding to requests / is unreachable multiple times during z-upgrade | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
Component: | openshift-apiserver | Assignee: | Sam Batschelet <sbatsche> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Xingxing Xia <xxia> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.3.0 | CC: | alpatel, anbhatta, aos-bugs, bparees, dmace, jkaur, joboyer, kewang, lszaszki, mfojtik, openshift-bugs-escalate, qiwan, sbatsche, scuppett, shurley, sttts, wking |
Target Milestone: | --- | Keywords: | Upgrades |
Target Release: | 4.5.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Enhancement | |
Doc Text: |
The OpenShift API server should now remain available to clients during upgrades.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2020-06-05 14:45:31 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1845412, 1943804 | ||
Bug Blocks: |
Description
Clayton Coleman
2020-01-15 04:01:28 UTC
Example of disruption from 14324 Jan 14 23:30:49.237 I ns/openshift-console-operator deployment/console-operator Status for clusteroperator/console changed: Degraded message changed from "RouteSyncDegraded: the server is currently unable to handle the request (get routes.route.openshift.io console)\nOAuthClientSyncDegraded: oauth client for console does not exist and cannot be created (the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console))" to "OAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)" Jan 14 23:30:49.281 I openshift-apiserver OpenShift API started failing: Get https://api.ci-op-c7wrw1i9-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=3s: context deadline exceeded (Client.Timeout exceeded while awaiting headers) Jan 14 23:30:50.280 E openshift-apiserver OpenShift API is not responding to GET requests Jan 14 23:30:50.280 - 59s W node/ip-10-0-135-37.ec2.internal node is not ready Jan 14 23:30:50.427 I ns/openshift-service-ca configmap/apiservice-cabundle-injector-lock de26dd07-8b25-4747-9c6c-a94818d2d42a became leader Jan 14 23:30:50.444 I ns/openshift-console-operator deployment/console-operator Status for clusteroperator/console changed: Degraded message changed from "OAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)" to "RouteSyncDegraded: the server is currently unable to handle the request (get routes.route.openshift.io console)\nOAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)" Jan 14 23:30:51.353 I ns/openshift-service-catalog-controller-manager-operator configmap/svcat-controller-manager-operator-lock 726bce43-79d7-4712-9352-3c7e5a4e001e became leader Jan 14 23:30:51.633 I ns/openshift-console-operator deployment/console-operator Status for clusteroperator/console changed: Degraded message changed from "RouteSyncDegraded: the server is currently unable to handle the request (get routes.route.openshift.io console)\nOAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)" to "OAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)" Jan 14 23:30:52.835 I ns/openshift-console-operator deployment/console-operator Status for clusteroperator/console changed: Degraded message changed from "OAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)" to "RouteSyncDegraded: the server is currently unable to handle the request (get routes.route.openshift.io console)\nOAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)" (2 times) Jan 14 23:30:53.014 I ns/openshift-authentication-operator configmap/cluster-authentication-operator-lock e23c85d1-fb21-43c2-84b2-f1e2f51879e3 became leader Jan 14 23:30:53.167 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Degraded message changed from "" to "RouteStatusDegraded: the server is currently unable to handle the request (get routes.route.openshift.io oauth-openshift)" Jan 14 23:30:54.033 I ns/openshift-console-operator deployment/console-operator Status for clusteroperator/console changed: Degraded message changed from "RouteSyncDegraded: the server is currently unable to handle the request (get routes.route.openshift.io console)\nOAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)" to "" Jan 14 23:30:54.390 I ns/openshift-kube-scheduler-operator configmap/openshift-cluster-kube-scheduler-operator-lock 364fc198-eee0-4937-9e32-880f79c50b7d became leader Jan 14 23:30:54.413 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator not enough information provided, not all functionality is present Jan 14 23:30:54.517 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7 Jan 14 23:30:54.527 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7 (2 times) Jan 14 23:30:54.534 I ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator Status for clusteroperator/kube-scheduler changed: Degraded message changed from "NodeControllerDegraded: All master node(s) are ready\nStaticPodsDegraded: nodes/ip-10-0-144-85.ec2.internal pods/openshift-kube-scheduler-ip-10-0-144-85.ec2.internal container=\"scheduler\" is not ready" to "NodeControllerDegraded: All master node(s) are ready\nStaticPodsDegraded: nodes/ip-10-0-144-85.ec2.internal pods/openshift-kube-scheduler-ip-10-0-144-85.ec2.internal container=\"scheduler\" is not ready\nInstallerControllerDegraded: missing required resources: [configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7]" Jan 14 23:30:54.539 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7 (3 times) Jan 14 23:30:54.554 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7 (4 times) Jan 14 23:30:54.592 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7 (5 times) Jan 14 23:30:54.673 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7 (6 times) Jan 14 23:30:54.832 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7 Jan 14 23:30:54.844 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7 (2 times) Jan 14 23:30:54.850 I ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator Status for clusteroperator/kube-scheduler changed: Degraded message changed from "NodeControllerDegraded: All master node(s) are ready\nStaticPodsDegraded: nodes/ip-10-0-144-85.ec2.internal pods/openshift-kube-scheduler-ip-10-0-144-85.ec2.internal container=\"scheduler\" is not ready\nInstallerControllerDegraded: missing required resources: [configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7, secrets: kube-scheduler-client-cert-key-7]" to "NodeControllerDegraded: All master node(s) are ready\nStaticPodsDegraded: nodes/ip-10-0-144-85.ec2.internal pods/openshift-kube-scheduler-ip-10-0-144-85.ec2.internal container=\"scheduler\" is not ready\nInstallerControllerDegraded: missing required resources: configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7" Jan 14 23:30:55.164 W ns/openshift-kube-scheduler-operator deployment/openshift-kube-scheduler-operator configmaps: config-7,kube-scheduler-pod-7,scheduler-kubeconfig-7,serviceaccount-ca-7 (3 times) Jan 14 23:30:55.237 I ns/openshift-console-operator deployment/console-operator Status for clusteroperator/console changed: Degraded message changed from "" to "RouteSyncDegraded: the server is currently unable to handle the request (get routes.route.openshift.io console)" (2 times) Jan 14 23:30:55.298 I openshift-apiserver OpenShift API started responding to GET requests Jan 14 23:30:55.669 I ns/openshift-authentication-operator deployment/authentication-operator Status for clusteroperator/authentication changed: Degraded message changed from "RouteStatusDegraded: the server is currently unable to handle the request (get routes.route.openshift.io oauth-openshift)" to "OperatorSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io openshift-browser-client)" Symptoms are similar to 1791117 but without clear indication that an SDN upgrade is in progress. May be related. Observed on latest release-4.3 (post rc.1) Dup of bug 1809665? Or maybe bug 1820266 (see [1])? If not, can we update this bug to use a more-specific subject? [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1818106#c10 Also in this space, although not applying to AWS or GCP, is bug 1828382. *** Bug 1828866 has been marked as a duplicate of this bug. *** This bug is believed to be the cause of test failures in: OpenShift APIs remain available This bug is actively worked on. The suspicion is that this is around etcd graceful termination behaviour. The etcd team is working on this, but this is no 4.5 blocker as this behaviour is preexisting. Moving to 4.6, with a possible backport later. We are seeing substantially more failures in this space in 4.5 than we did in 4.4. I am moving this back to 4.5 as i think it is a 4.5 upgrade blocker. see: https://search.apps.build01.ci.devcluster.openshift.com/?search=OpenShift+API+is+not+responding+to+GET+requests&maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job putting the same comment here as i put in https://bugzilla.redhat.com/show_bug.cgi?id=1801885 (perhaps the two bugs should be combined as the cite the same apiserver not responding to GET message): I question whether there is no regression here, our upgrades are failing more frequently in 4.5 than they were in 4.4, specifically with the "OpenShift API is not responding to GET requests" error: 4.5: https://search.apps.build01.ci.devcluster.openshift.com/?search=OpenShift+API+is+not+responding+to+GET+requests&maxAge=48h&context=1&type=bug%2Bjunit&name=.*to-4.5.*&maxMatches=5&maxBytes=20971520&groupBy=job Across 33 runs and 6 jobs (75.76% failed), matched 80.00% of failing runs and 66.67% of jobs 4.4: https://search.apps.build01.ci.devcluster.openshift.com/?search=OpenShift+API+is+not+responding+to+GET+requests&maxAge=48h&context=1&type=bug%2Bjunit&name=.*to-4.4.*&maxMatches=5&maxBytes=20971520&groupBy=job Across 19 runs and 7 jobs (63.16% failed), matched 58.33% of failing runs and 57.14% of jobs Moving this back to 4.5 for reassessment. If you can point to a different bug that explains why our upgrade test pass rate has gone from 60% in 4.4 to 42% in 4.5, then i can understand deferring this, but something has regressed. To disambiguate https://bugzilla.redhat.com/show_bug.cgi?id=1801885 and https://bugzilla.redhat.com/show_bug.cgi?id=1791162, both of which are failures in the same "[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]" test, they have distinct failure modes/messages https://bugzilla.redhat.com/show_bug.cgi?id=1801885 is for failures reported as: fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:237]: Jun 2 04:14:53.680: API was unreachable during disruption for at least 8m13s of 54m30s (15%!)(MISSING): https://bugzilla.redhat.com/show_bug.cgi?id=1791162 is for: Jun 02 04:16:51.466 - 194s E openshift-apiserver OpenShift API is not responding to GET requests recent example: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/58 (to be clear, i understand we expect to have some disruption in 4.5 and will be addressing that in 4.6, but this test is failing because we are exceeding the allowed amount of disruption) It looks like the same upgrade job is in much better condition on AWS - https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.4-stable-to-4.5-ci?buildId= For GCP we opened https://github.com/openshift/machine-config-operator/pull/1780 as we think it might be the issue. With https://bugzilla.redhat.com/show_bug.cgi?id=1844387 for ovirt, openstack, vsphere, bm platforms and the following: - Azure IPI: https://bugzilla.redhat.com/show_bug.cgi?id=1828382 - Azure UPI: https://bugzilla.redhat.com/show_bug.cgi?id=1836016 - AWS UPI: https://bugzilla.redhat.com/show_bug.cgi?id=1836018 - vSphere UPI: https://bugzilla.redhat.com/show_bug.cgi?id=1836017 we have a number of platform specific BZs open which address exactly these problems. This BZ here on the other hand is not actionable. Please be precise in which conditions the issues appear, do analysis of the data beforehand in order to make these BZs actionable. We have platforms which are perfectly fine. So the chance is very high that the ugprade issues have root causes in the different deployments of different platforms. All of the fixes listed in comment 27 are around /readyz, but installer-provisioned AWS LBs have been using /readyz for ages. The initial examples from comment 0 here were both installer-provisioned AWS. So if we are going to effectively close this bug as a dup, can we at least point to a bug that improved API reachability on installer-provisioned AWS? @Trevor: I will create BZs today by platform. We are getting sent new ones every week and old ones are reopened with random observations from random platforms. That's not helpful at all. Created umbrella bug per platform, all linked by the top-level bug https://bugzilla.redhat.com/show_bug.cgi?id=1845411. Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475 |