Bug 1733327

Summary: redeploy-certificates crashes service catalog controllers and fails if metrics-server is installed
Product: OpenShift Container Platform Reporter: Pablo Alonso Rodriguez <palonsor>
Component: InstallerAssignee: Joseph Callen <jcallen>
Installer sub component: openshift-ansible QA Contact: Weinan Liu <weinliu>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: dsutherland1492, fshaikh, gpei, hkaneko, jcallen, nbhatt, rludva, tkimura, vlaad
Version: 3.11.0Keywords: Reopened
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-09-24 08:08:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pablo Alonso Rodriguez 2019-07-25 17:43:16 UTC
Description of problem:

If metrics-server is installed and /usr/share/ansible/openshift-ansible/playbooks/redeploy-certificates.yml playbook is run, it fails during service catalog certificate redeployment while verifying if catalog controller-manager pods are up.

The reason is that these pods end up in crashloopbackoff and, when examining the logs, we see issues accesing OpenShift API due to issues with metrics-server extended API (metrics.k8s.io/v1beta1).

While inspecting logs of metrics-server pods, I see tons of certificate-related errors. I also note that this pod has not been restarted during certificate redeployment.

There is a workaround: If during controller check task, I open another shell and delete the pod in openshift-metrics-server, then catalog controller-manager pods recover and task can continue, so it ends successfully.

Version-Release number of the following components:

rpm -q openshift-ansible
openshift-ansible-3.11.129-1.git.0.11838de.el7.noarch

rpm -q ansible
ansible-2.6.16-1.el7ae.noarch

ansible --version
ansible 2.6.16
  config file = /root/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Jun 11 2019, 12:19:05) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]


How reproducible:

Always, as long as both metrics-server and service catalog are installed

Steps to Reproduce:
1. ISSUE: Redeploy certificates in a cluster with both metrics-server and service catalog installed.
2. WORKAROUND: Delete pods on openshift-metrics-server project during task "TASK [openshift_service_catalog : Verify that the controller-manager is running]"


Actual results (without workaround):

- Playbook fails at this task:

2019-07-25 12:53:37,355 p=34829 u=root |  TASK [openshift_service_catalog : Verify that the controller-manager is running] *******************************************************************************************************************
******************************************
2019-07-25 12:53:38,777 p=34829 u=root |  FAILED - RETRYING: Verify that the controller-manager is running (60 retries left).
2019-07-25 12:53:49,385 p=34829 u=root |  FAILED - RETRYING: Verify that the controller-manager is running (59 retries left).
2019-07-25 12:53:59,830 p=34829 u=root |  FAILED - RETRYING: Verify that the controller-manager is running (58 retries left).
2019-07-25 12:54:10,286 p=34829 u=root |  FAILED - RETRYING: Verify that the controller-manager is running (57 retries left).
2019-07-25 12:54:20,805 p=34829 u=root |  FAILED - RETRYING: Verify that the controller-manager is running (56 retries left).
2019-07-25 12:54:31,583 p=34829 u=root |  FAILED - RETRYING: Verify that the controller-manager is running (55 retries left).
(...)
2019-07-25 13:03:38,963 p=34829 u=root |  FAILED - RETRYING: Verify that the controller-manager is running (3 retries left).
2019-07-25 13:03:49,385 p=34829 u=root |  FAILED - RETRYING: Verify that the controller-manager is running (2 retries left).
2019-07-25 13:03:59,861 p=34829 u=root |  FAILED - RETRYING: Verify that the controller-manager is running (1 retries left).
2019-07-25 13:04:10,287 p=34829 u=root |  fatal: [(omitted)]: FAILED! => (omitted although this is a test cluster)

- controller-manager pods in kube-service-catalog pods are in CrashLoopBackoff, showing issues while accesing OpenShift API due to issues with metrics-server API

- openshift-metrics-server pod is not restarted and showing certificate-related errors

Expected results:

- Playbook end successfully

- Catalog controller-manager pods running fine

- openshift-metrics-server pod restarted and running fine

Additional info:

(I will upload some attachments)

Comment 8 Jesus M. Rodriguez 2019-08-23 03:13:57 UTC
*** Bug 1733422 has been marked as a duplicate of this bug. ***

Comment 12 dsutherland1492 2019-09-19 18:14:22 UTC
Has this been resolved?

Comment 13 Weinan Liu 2019-09-20 07:37:42 UTC
openshift-metrics-server failed to get restarted after certificates redeployed. I'd proffer this bug to be failed test

Playbook ended successfully (controller-manager running) and Catalog controller-manager pods running after redeploy-certificates.

playbooks/redeploy-certificates.yml

[root@qe-weinliu-311-146-master-etcd-1 ~]# oc version
oc v3.11.146
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://qe-weinliu-311-146-master-etcd-1:8443
openshift v3.11.146
kubernetes v1.11.0+d4cacc0



[- Playbook end successfully]

TASK [openshift_service_catalog : Verify that the controller-manager is running] **********************************************************************************************************************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_service_catalog/tasks/restart_pods.yml:40
FAILED - RETRYING: Verify that the controller-manager is running (60 retries left).
FAILED - RETRYING: Verify that the controller-manager is running (59 retries left).
FAILED - RETRYING: Verify that the controller-manager is running (58 retries left).
FAILED - RETRYING: Verify that the controller-manager is running (57 retries left).
ok: [ci-vm-10-0-150-202.hosted.upshift.rdu2.redhat.com] => {"attempts": 5, "changed": false, "module_results": {"cmd": "/usr/bin/oc get daemonset controller-manager -o json -n kube-service-catalog", "results": [{"apiVersion": "extensions/v1beta1", "kind": "DaemonSet", "metadata": {"creationTimestamp": "2019-09-20T03:35:59Z", "generation": 1, "labels": {"app": "controller-manager"}, "name": "controller-manager", "namespace": "kube-service-catalog", "resourceVersion": "28400", "selfLink": "/apis/extensions/v1beta1/namespaces/kube-service-catalog/daemonsets/controller-manager", "uid": "c3655904-db57-11e9-a047-fa163eca4cbb"}, "spec": {"revisionHistoryLimit": 10, "selector": {"matchLabels": {"app": "controller-manager"}}, "template": {"metadata": {"creationTimestamp": null, "labels": {"app": "controller-manager"}}, "spec": {"containers": [{"args": ["controller-manager", "--secure-port", "6443", "-v", "3", "--leader-election-namespace", "kube-service-catalog", "--leader-elect-resource-lock", "configmaps", "--cluster-id-configmap-namespace=kube-service-catalog", "--broker-relist-interval", "5m", "--feature-gates", "OriginatingIdentity=true", "--feature-gates", "AsyncBindingOperations=true", "--feature-gates", "NamespacedServiceBroker=true"], "command": ["/usr/bin/service-catalog"], "env": [{"name": "K8S_NAMESPACE", "valueFrom": {"fieldRef": {"apiVersion": "v1", "fieldPath": "metadata.namespace"}}}], "image": "brewregistry.stage.redhat.io/openshift3/ose-service-catalog:v3.11", "imagePullPolicy": "IfNotPresent", "livenessProbe": {"failureThreshold": 3, "httpGet": {"path": "/healthz", "port": 6443, "scheme": "HTTPS"}, "initialDelaySeconds": 30, "periodSeconds": 10, "successThreshold": 1, "timeoutSeconds": 5}, "name": "controller-manager", "ports": [{"containerPort": 6443, "protocol": "TCP"}], "readinessProbe": {"failureThreshold": 1, "httpGet": {"path": "/healthz/ready", "port": 6443, "scheme": "HTTPS"}, "initialDelaySeconds": 30, "periodSeconds": 5, "successThreshold": 1, "timeoutSeconds": 5}, "resources": {}, "terminationMessagePath": "/dev/termination-log", "terminationMessagePolicy": "File", "volumeMounts": [{"mountPath": "/var/run/kubernetes-service-catalog", "name": "service-catalog-ssl", "readOnly": true}]}], "dnsPolicy": "ClusterFirst", "nodeSelector": {"node-role.kubernetes.io/master": "true"}, "restartPolicy": "Always", "schedulerName": "default-scheduler", "securityContext": {}, "serviceAccount": "service-catalog-controller", "serviceAccountName": "service-catalog-controller", "terminationGracePeriodSeconds": 30, "volumes": [{"name": "service-catalog-ssl", "secret": {"defaultMode": 420, "items": [{"key": "tls.crt", "path": "apiserver.crt"}, {"key": "tls.key", "path": "apiserver.key"}], "secretName": "controllermanager-ssl"}}]}}, "templateGeneration": 1, "updateStrategy": {"rollingUpdate": {"maxUnavailable": 1}, "type": "RollingUpdate"}}, "status": {"currentNumberScheduled": 1, "desiredNumberScheduled": 1, "numberAvailable": 1, "numberMisscheduled": 0, "numberReady": 1, "observedGeneration": 1, "updatedNumberScheduled": 1}}], "returncode": 0}, "state": "list"}

PLAY RECAP ********************************************************************************************************************************************************************************************************
ci-vm-10-0-148-56.hosted.upshift.rdu2.redhat.com : ok=20   changed=2    unreachable=0    failed=0
ci-vm-10-0-150-202.hosted.upshift.rdu2.redhat.com : ok=294  changed=89   unreachable=0    failed=0
ci-vm-10-0-151-107.hosted.upshift.rdu2.redhat.com : ok=20   changed=2    unreachable=0    failed=0
localhost                  : ok=15   changed=0    unreachable=0    failed=0
INSTALLER STATUS **************************************************************************************************************************************************************************************************
Initialization  : Complete (0:00:25)


[- Catalog controller-manager pods running fine]
# oc get po --all-namespaces|grep cat
kube-service-catalog                apiserver-56drn                                       1/1       Running     0          7m
kube-service-catalog                controller-manager-bfrgv                              1/1       Running     0          7m


[- openshift-metrics-server pod restarted and running fine]
[root@qe-weinliu-311-146-master-etcd-1 ~]# oc get pod -n openshift-metrics-server
NAME                             READY     STATUS    RESTARTS   AGE
metrics-server-d79b7d8d9-wsvw2   1/1       Running   0          56m

Comment 19 errata-xmlrpc 2019-09-24 08:08:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2816