Bug 1618873 - cluster-quota-reconciler terminates controller process when apiservice is not available
Summary: cluster-quota-reconciler terminates controller process when apiservice is not...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 3.11.0
Assignee: David Eads
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks: 1582094
TreeView+ depends on / blocked
 
Reported: 2018-08-17 20:56 UTC by Clayton Coleman
Modified: 2018-10-11 07:25 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-11 07:25:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1595997 0 unspecified CLOSED Controller manager will not start when an aggregated API service is down 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1619116 0 high CLOSED Upgrade failed at [openshift_web_console : Verify that the console is running] 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHBA-2018:2652 0 None None None 2018-10-11 07:25:45 UTC

Internal Links: 1595997 1619116

Description Clayton Coleman 2018-08-17 20:56:47 UTC
On both the new installer code path (happening very frequently for us now) and on the existing installer, if the controller can't find the apiservice is it erring out with a glog.Fatalf which terminates the controller process, preventing forward progress:

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/145/pull-ci-origin-installer-e2e-aws/526/ failed for this reason, and I was able to recreate when running the installer myself.

Was on master from August 17th:

F0817 20:18:33.902810       1 controller_manager.go:127] Error starting “openshift.io/cluster-quota-reconciliation” (unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request)

The metrics server was up, but I didn't see anything unusual about its service status.  The openshift-controller manager pod was crash looping.  After deleting the metrics server apiservice and deleting the controller pod, the next instance loaded and ran successfully.

Comment 1 Xingxing Xia 2018-08-20 08:56:51 UTC
> The openshift-controller manager pod
From your description, you're reporting against Version 3.11 instead of 3.9?

I encountered same error, but in my side it reported about servicecatalog instead of metrics. my reproducing steps are:
Launch a 3.11 (v3.11.0-0.17.0) cluster in AWS.
Stop the master for minutes.
Start the master.
Check pods.
[root@ip-172-18-12-204 ~]# oc get po -w --all-namespaces
default                docker-registry-1-fjfqf                            1/1  Running           0  32m
default                router-1-l6h8l                                     1/1  Running           0  32m
kube-service-catalog   apiserver-gn9gc                                    1/1  Running           1  29m
kube-service-catalog   controller-manager-kxlmh                           0/1  CrashLoopBackOff  6  29m
kube-system            master-api-ip-172-18-12-204.ec2.internal           1/1  Running           1  39m
kube-system            master-controllers-ip-172-18-12-204.ec2.internal   0/1  CrashLoopBackOff  6  38m
kube-system            master-etcd-ip-172-18-12-204.ec2.internal          1/1  Running           1  38m

The controllers pod's logs (see full log in attachment):
...
I0820 08:35:59.814022       1 leaderelection.go:190] failed to acquire lease kube-system/kube-scheduler
I0820 08:35:59.823344       1 client_builder.go:233] Verified credential for cluster-quota-reconciliation-controller/openshift-infra
I0820 08:35:59.830345       1 request.go:1099] body was not decodable (unable to check for Status): couldn't get version/kind; json parse error: json: cannot unmarshal string into Go value of type struct { APIVersion string "json:\"apiVersion,omitempty\""; Kind string "json:\"kind,omitempty\"" }
F0820 08:35:59.849987       1 controller_manager.go:127] Error starting "openshift.io/cluster-quota-reconciliation" (unable to retrieve the complete list of server APIs: servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request)

In a word, the controllers pod is constantly CrashLoopBackOff with log "Error starting "openshift.io/cluster-quota-reconciliation" (unable to retrieve the complete list of server APIs: servicecatalog.k8s.io/v1beta1". I find there are other bugs, one with SAME controllers logs error: bug 1619116 . One with SIMILAR error like bug 1595997. So I linked them.

Comment 3 Xingxing Xia 2018-08-22 05:49:46 UTC
Like bug bug 1595997, when the aggregated API service is removed, the controllers pod becomes started back.

Today in QE's public test env, the issue happened and blocked the env to be normal for test. So changing some fields of the bug.

Comment 4 Michal Fojtik 2018-08-22 11:59:47 UTC
Fixed here: https://github.com/openshift/origin/pull/20693

Comment 5 Xingxing Xia 2018-08-23 07:35:15 UTC
Verified in:
openshift v3.11.0-0.20.0
kubernetes v1.11.0+d4cacc0

The controllers pod issue is fixed.

Comment 7 errata-xmlrpc 2018-10-11 07:25:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652


Note You need to log in before you can comment on or make changes to this bug.