Bug 1595997
| Summary: | Controller manager will not start when an aggregated API service is down | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
| Component: | kube-controller-manager | Assignee: | David Eads <deads> |
| Status: | CLOSED DEFERRED | QA Contact: | Wang Haoran <haowang> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 3.10.0 | CC: | aos-bugs, deads, jokerman, mfojtik, mmccomas |
| Target Milestone: | --- | ||
| Target Release: | 3.10.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-11-20 18:58:34 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
There's a controller inside of the aggregated apiserver which tries to make contact, fails, and pulls it out of rotation. Does that never work or does it make you wait 30 seconds? The controller restarted 5-6 times over 10 minutes and got fatal every time. OCP 3.6-3.10 is no longer on full support [1]. Marking CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Target Release to the appropriate version where needed. [1]: https://access.redhat.com/support/policy/updates/openshift |
It is possible to brick a cluster by: 1. Having an aggregated API service installed (in this case metrics) 2. Having the pod fail or stop (so that there are 0 instances running) 3. Restart the controller manager The controller manager then fails on startup with: F0628 02:05:17.157813 1 controller_manager.go:194] Error starting "openshift.io/cluster-quota-reconciliation" (failed to discover resources: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: an error on the server ("Internal Server Error: \"/apis/metrics.k8s.io/v1beta1?timeout=32s\": Post https://172.30.0.1:443/apis/authorization.k8s.io/v1beta1/subjectaccessreviews: dial tcp 172.30.0.1:443: getsockopt: connection refused") has prevented the request from succeeding) And will never start until the aggregated API service is removed (because the controller can't start to schedule the pod that will provide the service).