Bug 1618873

Summary: cluster-quota-reconciler terminates controller process when apiservice is not available
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: MasterAssignee: David Eads <deads>
Status: CLOSED ERRATA QA Contact: Xingxing Xia <xxia>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: aos-bugs, jokerman, mfojtik, mmccomas, wking, yinzhou
Target Milestone: ---Keywords: TestBlocker
Target Release: 3.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-11 07:25:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1582094    

Description Clayton Coleman 2018-08-17 20:56:47 UTC
On both the new installer code path (happening very frequently for us now) and on the existing installer, if the controller can't find the apiservice is it erring out with a glog.Fatalf which terminates the controller process, preventing forward progress:

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/145/pull-ci-origin-installer-e2e-aws/526/ failed for this reason, and I was able to recreate when running the installer myself.

Was on master from August 17th:

F0817 20:18:33.902810       1 controller_manager.go:127] Error starting “openshift.io/cluster-quota-reconciliation” (unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request)

The metrics server was up, but I didn't see anything unusual about its service status.  The openshift-controller manager pod was crash looping.  After deleting the metrics server apiservice and deleting the controller pod, the next instance loaded and ran successfully.

Comment 1 Xingxing Xia 2018-08-20 08:56:51 UTC
> The openshift-controller manager pod
From your description, you're reporting against Version 3.11 instead of 3.9?

I encountered same error, but in my side it reported about servicecatalog instead of metrics. my reproducing steps are:
Launch a 3.11 (v3.11.0-0.17.0) cluster in AWS.
Stop the master for minutes.
Start the master.
Check pods.
[root@ip-172-18-12-204 ~]# oc get po -w --all-namespaces
default                docker-registry-1-fjfqf                            1/1  Running           0  32m
default                router-1-l6h8l                                     1/1  Running           0  32m
kube-service-catalog   apiserver-gn9gc                                    1/1  Running           1  29m
kube-service-catalog   controller-manager-kxlmh                           0/1  CrashLoopBackOff  6  29m
kube-system            master-api-ip-172-18-12-204.ec2.internal           1/1  Running           1  39m
kube-system            master-controllers-ip-172-18-12-204.ec2.internal   0/1  CrashLoopBackOff  6  38m
kube-system            master-etcd-ip-172-18-12-204.ec2.internal          1/1  Running           1  38m

The controllers pod's logs (see full log in attachment):
...
I0820 08:35:59.814022       1 leaderelection.go:190] failed to acquire lease kube-system/kube-scheduler
I0820 08:35:59.823344       1 client_builder.go:233] Verified credential for cluster-quota-reconciliation-controller/openshift-infra
I0820 08:35:59.830345       1 request.go:1099] body was not decodable (unable to check for Status): couldn't get version/kind; json parse error: json: cannot unmarshal string into Go value of type struct { APIVersion string "json:\"apiVersion,omitempty\""; Kind string "json:\"kind,omitempty\"" }
F0820 08:35:59.849987       1 controller_manager.go:127] Error starting "openshift.io/cluster-quota-reconciliation" (unable to retrieve the complete list of server APIs: servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request)

In a word, the controllers pod is constantly CrashLoopBackOff with log "Error starting "openshift.io/cluster-quota-reconciliation" (unable to retrieve the complete list of server APIs: servicecatalog.k8s.io/v1beta1". I find there are other bugs, one with SAME controllers logs error: bug 1619116 . One with SIMILAR error like bug 1595997. So I linked them.

Comment 3 Xingxing Xia 2018-08-22 05:49:46 UTC
Like bug bug 1595997, when the aggregated API service is removed, the controllers pod becomes started back.

Today in QE's public test env, the issue happened and blocked the env to be normal for test. So changing some fields of the bug.

Comment 4 Michal Fojtik 2018-08-22 11:59:47 UTC
Fixed here: https://github.com/openshift/origin/pull/20693

Comment 5 Xingxing Xia 2018-08-23 07:35:15 UTC
Verified in:
openshift v3.11.0-0.20.0
kubernetes v1.11.0+d4cacc0

The controllers pod issue is fixed.

Comment 7 errata-xmlrpc 2018-10-11 07:25:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652