Description of problem: If metrics-server is installed and /usr/share/ansible/openshift-ansible/playbooks/redeploy-certificates.yml playbook is run, it fails during service catalog certificate redeployment while verifying if catalog controller-manager pods are up. The reason is that these pods end up in crashloopbackoff and, when examining the logs, we see issues accesing OpenShift API due to issues with metrics-server extended API (metrics.k8s.io/v1beta1). While inspecting logs of metrics-server pods, I see tons of certificate-related errors. I also note that this pod has not been restarted during certificate redeployment. There is a workaround: If during controller check task, I open another shell and delete the pod in openshift-metrics-server, then catalog controller-manager pods recover and task can continue, so it ends successfully. Version-Release number of the following components: rpm -q openshift-ansible openshift-ansible-3.11.129-1.git.0.11838de.el7.noarch rpm -q ansible ansible-2.6.16-1.el7ae.noarch ansible --version ansible 2.6.16 config file = /root/ansible.cfg configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, Jun 11 2019, 12:19:05) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] How reproducible: Always, as long as both metrics-server and service catalog are installed Steps to Reproduce: 1. ISSUE: Redeploy certificates in a cluster with both metrics-server and service catalog installed. 2. WORKAROUND: Delete pods on openshift-metrics-server project during task "TASK [openshift_service_catalog : Verify that the controller-manager is running]" Actual results (without workaround): - Playbook fails at this task: 2019-07-25 12:53:37,355 p=34829 u=root | TASK [openshift_service_catalog : Verify that the controller-manager is running] ******************************************************************************************************************* ****************************************** 2019-07-25 12:53:38,777 p=34829 u=root | FAILED - RETRYING: Verify that the controller-manager is running (60 retries left). 2019-07-25 12:53:49,385 p=34829 u=root | FAILED - RETRYING: Verify that the controller-manager is running (59 retries left). 2019-07-25 12:53:59,830 p=34829 u=root | FAILED - RETRYING: Verify that the controller-manager is running (58 retries left). 2019-07-25 12:54:10,286 p=34829 u=root | FAILED - RETRYING: Verify that the controller-manager is running (57 retries left). 2019-07-25 12:54:20,805 p=34829 u=root | FAILED - RETRYING: Verify that the controller-manager is running (56 retries left). 2019-07-25 12:54:31,583 p=34829 u=root | FAILED - RETRYING: Verify that the controller-manager is running (55 retries left). (...) 2019-07-25 13:03:38,963 p=34829 u=root | FAILED - RETRYING: Verify that the controller-manager is running (3 retries left). 2019-07-25 13:03:49,385 p=34829 u=root | FAILED - RETRYING: Verify that the controller-manager is running (2 retries left). 2019-07-25 13:03:59,861 p=34829 u=root | FAILED - RETRYING: Verify that the controller-manager is running (1 retries left). 2019-07-25 13:04:10,287 p=34829 u=root | fatal: [(omitted)]: FAILED! => (omitted although this is a test cluster) - controller-manager pods in kube-service-catalog pods are in CrashLoopBackoff, showing issues while accesing OpenShift API due to issues with metrics-server API - openshift-metrics-server pod is not restarted and showing certificate-related errors Expected results: - Playbook end successfully - Catalog controller-manager pods running fine - openshift-metrics-server pod restarted and running fine Additional info: (I will upload some attachments)
*** Bug 1733422 has been marked as a duplicate of this bug. ***
Has this been resolved?
openshift-metrics-server failed to get restarted after certificates redeployed. I'd proffer this bug to be failed test Playbook ended successfully (controller-manager running) and Catalog controller-manager pods running after redeploy-certificates. playbooks/redeploy-certificates.yml [root@qe-weinliu-311-146-master-etcd-1 ~]# oc version oc v3.11.146 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://qe-weinliu-311-146-master-etcd-1:8443 openshift v3.11.146 kubernetes v1.11.0+d4cacc0 [- Playbook end successfully] TASK [openshift_service_catalog : Verify that the controller-manager is running] ********************************************************************************************************************************** task path: /usr/share/ansible/openshift-ansible/roles/openshift_service_catalog/tasks/restart_pods.yml:40 FAILED - RETRYING: Verify that the controller-manager is running (60 retries left). FAILED - RETRYING: Verify that the controller-manager is running (59 retries left). FAILED - RETRYING: Verify that the controller-manager is running (58 retries left). FAILED - RETRYING: Verify that the controller-manager is running (57 retries left). ok: [ci-vm-10-0-150-202.hosted.upshift.rdu2.redhat.com] => {"attempts": 5, "changed": false, "module_results": {"cmd": "/usr/bin/oc get daemonset controller-manager -o json -n kube-service-catalog", "results": [{"apiVersion": "extensions/v1beta1", "kind": "DaemonSet", "metadata": {"creationTimestamp": "2019-09-20T03:35:59Z", "generation": 1, "labels": {"app": "controller-manager"}, "name": "controller-manager", "namespace": "kube-service-catalog", "resourceVersion": "28400", "selfLink": "/apis/extensions/v1beta1/namespaces/kube-service-catalog/daemonsets/controller-manager", "uid": "c3655904-db57-11e9-a047-fa163eca4cbb"}, "spec": {"revisionHistoryLimit": 10, "selector": {"matchLabels": {"app": "controller-manager"}}, "template": {"metadata": {"creationTimestamp": null, "labels": {"app": "controller-manager"}}, "spec": {"containers": [{"args": ["controller-manager", "--secure-port", "6443", "-v", "3", "--leader-election-namespace", "kube-service-catalog", "--leader-elect-resource-lock", "configmaps", "--cluster-id-configmap-namespace=kube-service-catalog", "--broker-relist-interval", "5m", "--feature-gates", "OriginatingIdentity=true", "--feature-gates", "AsyncBindingOperations=true", "--feature-gates", "NamespacedServiceBroker=true"], "command": ["/usr/bin/service-catalog"], "env": [{"name": "K8S_NAMESPACE", "valueFrom": {"fieldRef": {"apiVersion": "v1", "fieldPath": "metadata.namespace"}}}], "image": "brewregistry.stage.redhat.io/openshift3/ose-service-catalog:v3.11", "imagePullPolicy": "IfNotPresent", "livenessProbe": {"failureThreshold": 3, "httpGet": {"path": "/healthz", "port": 6443, "scheme": "HTTPS"}, "initialDelaySeconds": 30, "periodSeconds": 10, "successThreshold": 1, "timeoutSeconds": 5}, "name": "controller-manager", "ports": [{"containerPort": 6443, "protocol": "TCP"}], "readinessProbe": {"failureThreshold": 1, "httpGet": {"path": "/healthz/ready", "port": 6443, "scheme": "HTTPS"}, "initialDelaySeconds": 30, "periodSeconds": 5, "successThreshold": 1, "timeoutSeconds": 5}, "resources": {}, "terminationMessagePath": "/dev/termination-log", "terminationMessagePolicy": "File", "volumeMounts": [{"mountPath": "/var/run/kubernetes-service-catalog", "name": "service-catalog-ssl", "readOnly": true}]}], "dnsPolicy": "ClusterFirst", "nodeSelector": {"node-role.kubernetes.io/master": "true"}, "restartPolicy": "Always", "schedulerName": "default-scheduler", "securityContext": {}, "serviceAccount": "service-catalog-controller", "serviceAccountName": "service-catalog-controller", "terminationGracePeriodSeconds": 30, "volumes": [{"name": "service-catalog-ssl", "secret": {"defaultMode": 420, "items": [{"key": "tls.crt", "path": "apiserver.crt"}, {"key": "tls.key", "path": "apiserver.key"}], "secretName": "controllermanager-ssl"}}]}}, "templateGeneration": 1, "updateStrategy": {"rollingUpdate": {"maxUnavailable": 1}, "type": "RollingUpdate"}}, "status": {"currentNumberScheduled": 1, "desiredNumberScheduled": 1, "numberAvailable": 1, "numberMisscheduled": 0, "numberReady": 1, "observedGeneration": 1, "updatedNumberScheduled": 1}}], "returncode": 0}, "state": "list"} PLAY RECAP ******************************************************************************************************************************************************************************************************** ci-vm-10-0-148-56.hosted.upshift.rdu2.redhat.com : ok=20 changed=2 unreachable=0 failed=0 ci-vm-10-0-150-202.hosted.upshift.rdu2.redhat.com : ok=294 changed=89 unreachable=0 failed=0 ci-vm-10-0-151-107.hosted.upshift.rdu2.redhat.com : ok=20 changed=2 unreachable=0 failed=0 localhost : ok=15 changed=0 unreachable=0 failed=0 INSTALLER STATUS ************************************************************************************************************************************************************************************************** Initialization : Complete (0:00:25) [- Catalog controller-manager pods running fine] # oc get po --all-namespaces|grep cat kube-service-catalog apiserver-56drn 1/1 Running 0 7m kube-service-catalog controller-manager-bfrgv 1/1 Running 0 7m [- openshift-metrics-server pod restarted and running fine] [root@qe-weinliu-311-146-master-etcd-1 ~]# oc get pod -n openshift-metrics-server NAME READY STATUS RESTARTS AGE metrics-server-d79b7d8d9-wsvw2 1/1 Running 0 56m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2816