1596557 – After running redeploy-certificates.yml playbook in OCP 3.9 ansible service broker stop working

Bug 1596557 - After running redeploy-certificates.yml playbook in OCP 3.9 ansible service broker stop working

Summary: After running redeploy-certificates.yml playbook in OCP 3.9 ansible service b...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.9.0
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.9.z
Assignee:	Vadim Rutkovsky
QA Contact:	Yadan Pei
Docs Contact:
URL:
Whiteboard:
Depends On:	1592303 1596233 1667981
Blocks:	1623987
TreeView+	depends on / blocked

Reported:	2018-06-29 08:50 UTC by Yadan Pei
Modified:	2019-01-21 15:54 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1592303
Clones:	1623987 (view as bug list)
Environment:
Last Closed:	2018-09-22 04:53:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
ansiblelogs (90.01 KB, text/plain) 2018-09-11 02:54 UTC, Yadan Pei	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:2658	0	None	None	None	2018-09-22 04:53:57 UTC

Comment 1 Yadan Pei 2018-06-29 08:52:30 UTC

And here are some outputs
# oc get pods --all-namespaces   // pods becomes CrashLoopBackOff
kube-service-catalog                apiserver-7vz8k                  1/1       Running            1          26m
kube-service-catalog                controller-manager-hh5pj         1/1       Running            4          26m
openshift-ansible-service-broker    asb-1-p8vh4                      0/1       CrashLoopBackOff   8          25m
openshift-ansible-service-broker    asb-etcd-1-tvq7q                 1/1       Running            2          25m
openshift-template-service-broker   apiserver-7j4n8                  1/1       Running            1          25m
openshift-template-service-broker   apiserver-9chdw                  1/1       Running            1          25m

# oc logs asb-1-p8vh4 -n openshift-ansible-service-broker
Using config file mounted to /etc/ansible-service-broker/config.yaml
2018/06/29 08:50:02 Unable to get log.logfile from config
============================================================
==           Starting Ansible Service Broker...           ==
============================================================
[2018-06-29T08:50:02.152Z] [NOTICE] - Initializing clients...
[2018-06-29T08:50:02.154Z] [INFO] - == ETCD CX ==
[2018-06-29T08:50:02.154Z] [INFO] - EtcdHost: asb-etcd.openshift-ansible-service-broker.svc
[2018-06-29T08:50:02.154Z] [INFO] - EtcdPort: 2379
[2018-06-29T08:50:02.154Z] [INFO] - Endpoints: [https://asb-etcd.openshift-ansible-service-broker.svc:2379]
[2018-06-29T08:50:02.169Z] [ERROR] - client: etcd cluster is unavailable or misconfigured; error #0: x509: certificate signed by unknown authority

We may need fixes for ansible server broker too.

Comment 2 Yadan Pei 2018-06-29 08:54:24 UTC

Above output is got after running openshift-ansible/playbooks/redeploy-certificates.yml

Comment 3 Yadan Pei 2018-06-29 09:07:52 UTC

beside ansible service broker, seems template service broker also need fix

Comment 4 Vadim Rutkovsky 2018-08-14 13:39:10 UTC

Created https://github.com/openshift/openshift-ansible/pull/9585

It also seems to fix TSB

Comment 8 Yadan Pei 2018-09-11 02:47:37 UTC

After running /usr/share/ansible/openshift-ansible/playbooks/redeploy-certificates.yml with openshift-ansible-3.9.43-1.git.0.d0bc600.el7.noarch

asb-* pods are running but apiserver-* pods was not started correctly

# oc get pods --all-namespaces
NAMESPACE                           NAME                             READY     STATUS             RESTARTS   AGE
kube-service-catalog                apiserver-4kt52                  0/1       CrashLoopBackOff   9          29m
kube-service-catalog                controller-manager-qzp64         0/1       CrashLoopBackOff   5          29m
openshift-ansible-service-broker    asb-1-ph77j                      1/1       Running            0          15m
openshift-ansible-service-broker    asb-etcd-1-4dq8t                 1/1       Running            0          15m
openshift-template-service-broker   apiserver-4rchm                  1/1       Running            2          27m
openshift-template-service-broker   apiserver-mx4pc                  1/1       Running            1          27m
openshift-web-console               webconsole-7d7cbcf74c-7w64w      1/1       Running            0          13m

# oc logs -f apiserver-4kt52 -n kube-service-catalog
I0911 02:40:14.459147       1 feature_gate.go:184] feature gates: map[OriginatingIdentity:true]
I0911 02:40:14.459291       1 hyperkube.go:188] Service Catalog version v3.9.43 (built 2018-09-08T02:18:49Z)
W0911 02:40:14.751120       1 authentication.go:229] Unable to get configmap/extension-apiserver-authentication in kube-system.  Usually fixed by 'kubectl create rolebinding -n kube-system ROLE_NAME --role=extension-apiserver-authentication-reader --serviceaccount=YOUR_NS:YOUR_SA'
Error: Get https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication: dial tcp 172.30.0.1:443: connect: network is unreachable

Comment 9 Yadan Pei 2018-09-11 02:49:15 UTC

I don't see TSB pods are re-created, Vadim, can you help confirm?

Comment 10 Yadan Pei 2018-09-11 02:54:16 UTC

Created attachment 1482254 [details]
ansiblelogs

Comment 11 Vadim Rutkovsky 2018-09-11 09:05:14 UTC

(In reply to Yadan Pei from comment #9)
> I don't see TSB pods are re-created, Vadim, can you help confirm?

These tasks have run:

>TASK [ansible_service_broker : Remove ASB pods] ********************************
>changed: [host-8-244-4.host.centralci.eng.rdu2.redhat.com] => (item=asb)
>changed: [host-8-244-4.host.centralci.eng.rdu2.redhat.com] => (item=asb-etcd)

Please attach the output of `ansible-playbooks -vvv` for more information

>dial tcp 172.30.0.1:443: connect: network is unreachable

Some network problem? Is it reproducible? 
Can new APBs be provisioned?

Comment 12 Yadan Pei 2018-09-12 09:03:25 UTC

Above network error is reproducible, we will debug and open separate bug if that's an issue. 

Despite the network errors, what I can confirm is that some secrets for ASB are re-created and pods are recreated also.

# oc get secret -n openshift-ansible-service-broker  //these secrets are re-created
NAME                         TYPE                                  DATA      AGE
asb-client                   kubernetes.io/service-account-token   4         16m
asb-tls                      kubernetes.io/tls                     2         16m
broker-etcd-auth-secret      Opaque                                2         16m
etcd-auth-secret             Opaque                                1         16m
etcd-tls                     kubernetes.io/tls                     2         16m
# oc get pods -n openshift-ansible-service-broker   // All ASB pods are running
NAME               READY     STATUS    RESTARTS   AGE
asb-1-mbhpg        1/1       Running   0          16m
asb-etcd-1-smn26   1/1       Running   0          16m

Another point I need confirm is I don't see TSB secret/pods are re-created, do we need recreate them also?
# oc get pods -n openshift-template-service-broker
NAME              READY     STATUS             RESTARTS   AGE
apiserver-k5xl5   0/1       CrashLoopBackOff   9          2h
apiserver-t54c7   1/1       Running            1          2h

Comment 14 Vadim Rutkovsky 2018-09-12 09:06:01 UTC

Please attach the following info:

1) versions
2) inventory
3) apiserver container logs

Comment 15 Yadan Pei 2018-09-12 09:39:16 UTC

The network issue is not reproduced on EC2, so it's not a issue any more.

The only remaining concern is whether we need create re-create TSB secrets/pods

Comment 17 Yadan Pei 2018-09-12 09:43:10 UTC

openshift-ansible-3.9.43-1.git.0.d0bc600.el7.noarch

Comment 18 Vadim Rutkovsky 2018-09-12 10:28:12 UTC

api container logs are still required to find out why is it broken

Comment 20 Vadim Rutkovsky 2018-09-12 10:45:28 UTC

One of the pods failed to connect to kube API server:

"dial tcp 172.30.0.1:443: connect: network is unreachable"

The other one works fine, so the fix worked, but networks issues won't let the first pod start correctly.

Comment 21 Yadan Pei 2018-09-13 01:24:36 UTC

Moving to VERIFIED per comment 12 and comment 20

Comment 23 errata-xmlrpc 2018-09-22 04:53:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2658

Note You need to log in before you can comment on or make changes to this bug.