Bug 1585648

Summary: [upgrade]should stop creating asb-etcd-migration pod after the task' wait for migration to complete' failed
Product: OpenShift Container Platform Reporter: Zihan Tang <zitang>
Component: Service BrokerAssignee: Fabian von Feilitzsch <fabian>
Status: CLOSED ERRATA QA Contact: Zihan Tang <zitang>
Severity: high Docs Contact:
Priority: medium    
Version: 3.10.0CC: aos-bugs, chezhang, fabian, ghuang, jiazha, jmatthew, wzheng, xtian, zhsun
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-30 19:16:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Zihan Tang 2018-06-04 09:45:37 UTC
Description of problem:
when OCP upgrade from 3.9 to 3.10. if the task [ansible_service_broker : wait for migration to complete] failed, it is still creating asb-etcd-migration pod in openshift-ansible-service-broker ns, which will drain off a lot of resource.

Version-Release number of selected component (if applicable):
openshift-ansible: 3.10.0-0.58.0

How reproducible:
always

Steps to Reproduce:
1. Install OCP v3.9 , etcd pod status is error, this will cause etcd task fails when upgrade to 3.10

# oc get pod 
NAME                       READY     STATUS              RESTARTS   AGE
asb-1-deploy               0/1       Error               0          1h
asb-etcd-1-deploy          0/1       Error               0          1h

2. update to 3.10
3. check TASK [ansible_service_broker : Migrate from etcd to CustomResources] 

Actual results:
3. task 'Migrate from etcd to CustomResources ' will fail, but it is still generating 
asb-etcd-migration pod even the task stop retry. oc logs -f asb-etcd-migration-wtfpv 
time="2018-06-04T09:12:57Z" level=info msg="etcd configuration: {asb-etcd.openshift-ansible-service-broker.svc /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt /var/run/asb-etcd-auth/client.crt /var/run/asb-etcd-auth/client.key 2379}"
time="2018-06-04T09:12:57Z" level=info msg="== ETCD CX =="
time="2018-06-04T09:12:57Z" level=info msg="EtcdHost: asb-etcd.openshift-ansible-service-broker.svc"
time="2018-06-04T09:12:57Z" level=info msg="EtcdPort: 2379"
time="2018-06-04T09:12:57Z" level=info msg="Endpoints: [https://asb-etcd.openshift-ansible-service-broker.svc:2379]"
2018/06/04 09:12:57 Dao::BatchGetRaw
panic: Unable to get all specs from etcd - client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://asb-etcd.openshift-ansible-service-broker.svc:2379 exceeded header timeout

goroutine 1 [running]:
main.main()
	/builddir/build/BUILD/ansible-service-broker-1.2.16/cmd/migration/main.go:90 +0x3c16

[root@qe-zitang-39-3master-etcd-1 ~]# oc get pod 
NAME                       READY     STATUS    RESTARTS   AGE
asb-1-deploy               0/1       Error     0          1h
asb-etcd-1-deploy          0/1       Error     0          1h
asb-etcd-migration-2lnqd   0/1       Error     0          3m
asb-etcd-migration-2z2cp   0/1       Error     0          51s
asb-etcd-migration-45k5g   0/1       Error     0          2m
asb-etcd-migration-4r6qb   0/1       Error     0          3m
asb-etcd-migration-4sp6q   0/1       Error     0          2m
asb-etcd-migration-585vc   0/1       Error     0          3m
asb-etcd-migration-5mgjn   0/1       Error     0          3m


[root@qe-zitang-39-3master-etcd-1 ~]# oc get pod | wc -l 
81
[root@qe-zitang-39-3master-etcd-1 ~]# oc get pod -n openshift-ansible-service-broker | wc -l 
84
[root@qe-zitang-39-3master-etcd-1 ~]# oc get pod -n openshift-ansible-service-broker | wc -l 
136

After the job completed, it is still creating migrating pod.
[root@qe-zitang-39-3master-etcd-1 ~]# oc get pod -n openshift-ansible-service-broker | wc -l 
305


Expected results:
it should stop creating asb-etcd-migration pod after the task stop retry.

Additional info:

Comment 2 Fabian von Feilitzsch 2018-06-04 16:42:49 UTC
This looks like the result of an upstream kubernetes bug related to Jobs, where the backoffLimit is no longer respected: https://github.com/kubernetes/kubernetes/issues/62382

Comment 3 Fabian von Feilitzsch 2018-06-04 17:40:38 UTC
https://github.com/openshift/openshift-ansible/pull/8625

Comment 4 openshift-github-bot 2018-06-04 19:11:09 UTC
Commits pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/b3706ea39c192e306728358a365644d2db25419f
Bug 1585648- Set timeout for ASB migration job (workaround for kubernetes/kubernetes#62382)

https://github.com/openshift/openshift-ansible/commit/72428990fb8fa27cdda26238d49a27c7daf9ad3f
Merge pull request #8625 from fabianvf/bz1585648

Bug 1585648- Set timeout for ASB migration job

Comment 6 Zihan Tang 2018-06-06 05:46:37 UTC
Using the fix as workaround , the migration job stop creating pod after timeout.
# oc get pod  -n openshift-ansible-service-broker| wc -l
145

So marked as verified.
version:  openshift-ansible-3.10.0-0.60.0

Comment 8 errata-xmlrpc 2018-07-30 19:16:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816