It is possible to brick a cluster by: 1. Having an aggregated API service installed (in this case metrics) 2. Having the pod fail or stop (so that there are 0 instances running) 3. Restart the controller manager The controller manager then fails on startup with: F0628 02:05:17.157813 1 controller_manager.go:194] Error starting "openshift.io/cluster-quota-reconciliation" (failed to discover resources: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: an error on the server ("Internal Server Error: \"/apis/metrics.k8s.io/v1beta1?timeout=32s\": Post https://172.30.0.1:443/apis/authorization.k8s.io/v1beta1/subjectaccessreviews: dial tcp 172.30.0.1:443: getsockopt: connection refused") has prevented the request from succeeding) And will never start until the aggregated API service is removed (because the controller can't start to schedule the pod that will provide the service).
There's a controller inside of the aggregated apiserver which tries to make contact, fails, and pulls it out of rotation. Does that never work or does it make you wait 30 seconds?
The controller restarted 5-6 times over 10 minutes and got fatal every time.
OCP 3.6-3.10 is no longer on full support [1]. Marking CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Target Release to the appropriate version where needed. [1]: https://access.redhat.com/support/policy/updates/openshift