Bug 1595997

Summary:	Controller manager will not start when an aggregated API service is down
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	kube-controller-manager	Assignee:	David Eads <deads>
Status:	CLOSED DEFERRED	QA Contact:	Wang Haoran <haowang>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.10.0	CC:	aos-bugs, deads, jokerman, mfojtik, mmccomas
Target Milestone:	---
Target Release:	3.10.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-11-20 18:58:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Clayton Coleman 2018-06-28 02:11:24 UTC

It is possible to brick a cluster by:

1. Having an aggregated API service installed (in this case metrics)
2. Having the pod fail or stop (so that there are 0 instances running)
3. Restart the controller manager

The controller manager then fails on startup with:

F0628 02:05:17.157813       1 controller_manager.go:194] Error starting "openshift.io/cluster-quota-reconciliation" (failed to discover resources: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: an error on the server ("Internal Server Error: \"/apis/metrics.k8s.io/v1beta1?timeout=32s\": Post https://172.30.0.1:443/apis/authorization.k8s.io/v1beta1/subjectaccessreviews: dial tcp 172.30.0.1:443: getsockopt: connection refused") has prevented the request from succeeding)

And will never start until the aggregated API service is removed (because the controller can't start to schedule the pod that will provide the service).

Comment 1 David Eads 2018-06-28 11:23:33 UTC

There's a controller inside of the aggregated apiserver which tries to make contact, fails, and pulls it out of rotation.  Does that never work or does it make you wait 30 seconds?

Comment 2 Clayton Coleman 2018-06-28 16:19:55 UTC

The controller restarted 5-6 times over 10 minutes and got fatal every time.

Comment 3 Stephen Cuppett 2019-11-20 18:58:34 UTC

OCP 3.6-3.10 is no longer on full support [1]. Marking CLOSED DEFERRED. If you have a customer case with a support exception or have reproduced on 3.11+, please reopen and include those details. When reopening, please set the Target Release to the appropriate version where needed.

[1]: https://access.redhat.com/support/policy/updates/openshift