Bug 1721583
Summary: | [upgrade] Cluster upgrade should maintain a functioning cluster: replicaset "rs" never became ready | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Antonio Murdaca <amurdaca> |
Component: | openshift-apiserver | Assignee: | Stefan Schimanski <sttts> |
Status: | CLOSED ERRATA | QA Contact: | Xingxing Xia <xxia> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.2.0 | CC: | aos-bugs, bbennett, calfonso, ccoleman, deads, gblomqui, jokerman, kgarriso, lserven, maszulik, mawong, mfojtik, miabbott, mifiedle, mmccomas, mnguyen, mstaeble, pcameron, rkrawitz, schoudha, sponnaga, walters, wking |
Target Milestone: | --- | ||
Target Release: | 4.2.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | 1706082 | Environment: | |
Last Closed: | 2019-10-16 06:32:02 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1706082 | ||
Bug Blocks: |
Description
Antonio Murdaca
2019-06-18 15:23:12 UTC
Verified on 4.2.0-0.nightly-2019-06-27-041730 Verified presence of status.Configuration and spec.Configuration Verified upgrade to 4.2.0-0.ci-2019-07-01-153500 Verified machine counts in upgrade process Saw this again: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/4189 This really looks like an issue with Deployments - the MCC is ready and steady according to logs, master logs don't show anything wrong and the MCO just reads from the Deployment object for machine-config-controller and despite the pod being there and running, it reports that there's an unavailable replica. It doesn't look like it's MCO related. Looking around there seem to be also a bunch of route issues from other clusteroperators: message: "Failed to rollout the stack. Error: running task Updating configuration sharing failed: failed to retrieve Prometheus host: getting Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io prometheus-k8s)", message: "Available: "route.openshift.io.v1" is not ready: 0 (Get https://172.30.0.1:443/apis/route.openshift.io/v1?timeout=32s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)) Available: "security.openshift.io.v1" is not ready: 0 (Get https://172.30.0.1:443/apis/security.openshift.io/v1?timeout=32s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)) Available: "user.openshift.io.v1" is not ready: 0 (Get https://172.30.0.1:443/apis/user.openshift.io/v1?timeout=32s: context deadline exceeded (Client.Timeout exceeded while awaiting headers))", message: "OAuthClientsDegraded: the server was unable to return a response in the time allotted, but may still be processing the request (get oauthclients.oauth.openshift.io openshift-browser-client)", Not sure who owns routes, I think network? again not sure why the deployment is off despite the pod running here but doesn't look like MCO related. The route API is provided by the OpenShift API server, passing it to them to take a look. https://search.svc.ci.openshift.org/?search=never+became+ready&maxAge=336h&context=7&type=all this shows zero evidence this is happening anymore, closing. > https://search.svc.ci.openshift.org/?search=never+became+ready&maxAge=336h&context=7&type=all this shows zero evidence this is happening anymore, closing. The vanilla CI search is leaky; ci-search-next is more reliable. [1] turns it up in a few places; most recently in a PR two hours ago [2], and most recently in a release-promotion gate 4 days ago [3]. So I'm not sure we've fixed this completely, but the flake rate is certainly low, with that PR failure being the only one of the past 24 hours' 930 -e2e- jobs to see it. [1]: https://ci-search-ci-search-next.svc.ci.openshift.org/?search=replicaset%20.*rs.*%20never%20became%20ready [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/516/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-upgrade/448 [3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/6013 https://github.com/openshift/origin/pull/23944 landed, which should help with debugging this in 4.2.z Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |