Bug 1595997 - Controller manager will not start when an aggregated API service is down
Summary: Controller manager will not start when an aggregated API service is down
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-controller-manager
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.10.z
Assignee: David Eads
QA Contact: Wang Haoran
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-28 02:11 UTC by Clayton Coleman
Modified: 2019-11-20 18:58 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-20 18:58:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1618873 0 unspecified CLOSED cluster-quota-reconciler terminates controller process when apiservice is not available 2021-02-22 00:41:40 UTC

Internal Links: 1618873

Description Clayton Coleman 2018-06-28 02:11:24 UTC
It is possible to brick a cluster by:

1. Having an aggregated API service installed (in this case metrics)
2. Having the pod fail or stop (so that there are 0 instances running)
3. Restart the controller manager

The controller manager then fails on startup with:

F0628 02:05:17.157813       1 controller_manager.go:194] Error starting "openshift.io/cluster-quota-reconciliation" (failed to discover resources: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: an error on the server ("Internal Server Error: \"/apis/metrics.k8s.io/v1beta1?timeout=32s\": Post https://172.30.0.1:443/apis/authorization.k8s.io/v1beta1/subjectaccessreviews: dial tcp 172.30.0.1:443: getsockopt: connection refused") has prevented the request from succeeding)

And will never start until the aggregated API service is removed (because the controller can't start to schedule the pod that will provide the service).

Comment 1 David Eads 2018-06-28 11:23:33 UTC
There's a controller inside of the aggregated apiserver which tries to make contact, fails, and pulls it out of rotation.  Does that never work or does it make you wait 30 seconds?

Comment 2 Clayton Coleman 2018-06-28 16:19:55 UTC
The controller restarted 5-6 times over 10 minutes and got fatal every time.

Comment 3 Stephen Cuppett 2019-11-20 18:58:34 UTC
OCP 3.6-3.10 is no longer on full support [1]. Marking CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Target Release to the appropriate version where needed.

[1]: https://access.redhat.com/support/policy/updates/openshift


Note You need to log in before you can comment on or make changes to this bug.