Bug 1690043
| Summary: | APIServer should return a structured error and retry-after for graceful shutdown errors | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
| Component: | Master | Assignee: | Michal Fojtik <mfojtik> |
| Status: | CLOSED ERRATA | QA Contact: | zhou ying <yinzhou> |
| Severity: | medium | Docs Contact: | |
| Priority: | low | ||
| Version: | 4.1.0 | CC: | aos-bugs, bparees, jokerman, mfojtik, mmccomas, xxia, yinzhou |
| Target Milestone: | --- | ||
| Target Release: | 4.1.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-06-04 10:46:02 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Related to https://bugzilla.redhat.com/show_bug.cgi?id=1684547 Need to ensure all components are protected. *** Bug 1690167 has been marked as a duplicate of this bug. *** To mitigate: https://github.com/openshift/origin/pull/22355 Stefan believe we have bug in shutdown order, so we still need to look at that. The pick above should make the error less disturbing. Still could see errors from: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6519/artifacts/e2e-aws/junit/junit_e2e_20190406-222605.xml To match with upstream: https://github.com/openshift/origin/pull/22511 No 'apiserver is shutting down' error , but have some related error: ClusterOperatorNotAvailable: Cluster operator openshift-apiserver has not yet reported success. Not sure is same issue or not. That is different error and it has been fixed today. Checked with latest e2e test logs, no 'apiserver is shutting down' error , will verify this. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |
CMO reported a hard error (failing=true) on it's cluster operator, and this should be an error it handles and ignores/retries: Mar 18 14:54:32.602 E clusteroperator/monitoring changed Failing to True: Failed to rollout the stack. Error: running task Updating Prometheus-k8s failed: reconciling Prometheus ClusterRoleBinding failed: updating ClusterRoleBinding object failed: an error on the server ("apiserver is shutting down.") has prevented the request from succeeding (put clusterrolebindings.rbac.authorization.k8s.io prometheus-k8s) https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5804 Depends on https://github.com/kubernetes/kubernetes/pull/75368 which should make it automatic. For 4.1 we want the server to return a structured error and have client stacks gracefully retry the error, to minimize the churn caused by API restarts. Blocks GA