On both the new installer code path (happening very frequently for us now) and on the existing installer, if the controller can't find the apiservice is it erring out with a glog.Fatalf which terminates the controller process, preventing forward progress: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/145/pull-ci-origin-installer-e2e-aws/526/ failed for this reason, and I was able to recreate when running the installer myself. Was on master from August 17th: F0817 20:18:33.902810 1 controller_manager.go:127] Error starting “openshift.io/cluster-quota-reconciliation” (unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request) The metrics server was up, but I didn't see anything unusual about its service status. The openshift-controller manager pod was crash looping. After deleting the metrics server apiservice and deleting the controller pod, the next instance loaded and ran successfully.
> The openshift-controller manager pod From your description, you're reporting against Version 3.11 instead of 3.9? I encountered same error, but in my side it reported about servicecatalog instead of metrics. my reproducing steps are: Launch a 3.11 (v3.11.0-0.17.0) cluster in AWS. Stop the master for minutes. Start the master. Check pods. [root@ip-172-18-12-204 ~]# oc get po -w --all-namespaces default docker-registry-1-fjfqf 1/1 Running 0 32m default router-1-l6h8l 1/1 Running 0 32m kube-service-catalog apiserver-gn9gc 1/1 Running 1 29m kube-service-catalog controller-manager-kxlmh 0/1 CrashLoopBackOff 6 29m kube-system master-api-ip-172-18-12-204.ec2.internal 1/1 Running 1 39m kube-system master-controllers-ip-172-18-12-204.ec2.internal 0/1 CrashLoopBackOff 6 38m kube-system master-etcd-ip-172-18-12-204.ec2.internal 1/1 Running 1 38m The controllers pod's logs (see full log in attachment): ... I0820 08:35:59.814022 1 leaderelection.go:190] failed to acquire lease kube-system/kube-scheduler I0820 08:35:59.823344 1 client_builder.go:233] Verified credential for cluster-quota-reconciliation-controller/openshift-infra I0820 08:35:59.830345 1 request.go:1099] body was not decodable (unable to check for Status): couldn't get version/kind; json parse error: json: cannot unmarshal string into Go value of type struct { APIVersion string "json:\"apiVersion,omitempty\""; Kind string "json:\"kind,omitempty\"" } F0820 08:35:59.849987 1 controller_manager.go:127] Error starting "openshift.io/cluster-quota-reconciliation" (unable to retrieve the complete list of server APIs: servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request) In a word, the controllers pod is constantly CrashLoopBackOff with log "Error starting "openshift.io/cluster-quota-reconciliation" (unable to retrieve the complete list of server APIs: servicecatalog.k8s.io/v1beta1". I find there are other bugs, one with SAME controllers logs error: bug 1619116 . One with SIMILAR error like bug 1595997. So I linked them.
Like bug bug 1595997, when the aggregated API service is removed, the controllers pod becomes started back. Today in QE's public test env, the issue happened and blocked the env to be normal for test. So changing some fields of the bug.
Fixed here: https://github.com/openshift/origin/pull/20693
Verified in: openshift v3.11.0-0.20.0 kubernetes v1.11.0+d4cacc0 The controllers pod issue is fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2652