Description of problem: when OCP upgrade from 3.9 to 3.10. if the task [ansible_service_broker : wait for migration to complete] failed, it is still creating asb-etcd-migration pod in openshift-ansible-service-broker ns, which will drain off a lot of resource. Version-Release number of selected component (if applicable): openshift-ansible: 3.10.0-0.58.0 How reproducible: always Steps to Reproduce: 1. Install OCP v3.9 , etcd pod status is error, this will cause etcd task fails when upgrade to 3.10 # oc get pod NAME READY STATUS RESTARTS AGE asb-1-deploy 0/1 Error 0 1h asb-etcd-1-deploy 0/1 Error 0 1h 2. update to 3.10 3. check TASK [ansible_service_broker : Migrate from etcd to CustomResources] Actual results: 3. task 'Migrate from etcd to CustomResources ' will fail, but it is still generating asb-etcd-migration pod even the task stop retry. oc logs -f asb-etcd-migration-wtfpv time="2018-06-04T09:12:57Z" level=info msg="etcd configuration: {asb-etcd.openshift-ansible-service-broker.svc /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt /var/run/asb-etcd-auth/client.crt /var/run/asb-etcd-auth/client.key 2379}" time="2018-06-04T09:12:57Z" level=info msg="== ETCD CX ==" time="2018-06-04T09:12:57Z" level=info msg="EtcdHost: asb-etcd.openshift-ansible-service-broker.svc" time="2018-06-04T09:12:57Z" level=info msg="EtcdPort: 2379" time="2018-06-04T09:12:57Z" level=info msg="Endpoints: [https://asb-etcd.openshift-ansible-service-broker.svc:2379]" 2018/06/04 09:12:57 Dao::BatchGetRaw panic: Unable to get all specs from etcd - client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://asb-etcd.openshift-ansible-service-broker.svc:2379 exceeded header timeout goroutine 1 [running]: main.main() /builddir/build/BUILD/ansible-service-broker-1.2.16/cmd/migration/main.go:90 +0x3c16 [root@qe-zitang-39-3master-etcd-1 ~]# oc get pod NAME READY STATUS RESTARTS AGE asb-1-deploy 0/1 Error 0 1h asb-etcd-1-deploy 0/1 Error 0 1h asb-etcd-migration-2lnqd 0/1 Error 0 3m asb-etcd-migration-2z2cp 0/1 Error 0 51s asb-etcd-migration-45k5g 0/1 Error 0 2m asb-etcd-migration-4r6qb 0/1 Error 0 3m asb-etcd-migration-4sp6q 0/1 Error 0 2m asb-etcd-migration-585vc 0/1 Error 0 3m asb-etcd-migration-5mgjn 0/1 Error 0 3m [root@qe-zitang-39-3master-etcd-1 ~]# oc get pod | wc -l 81 [root@qe-zitang-39-3master-etcd-1 ~]# oc get pod -n openshift-ansible-service-broker | wc -l 84 [root@qe-zitang-39-3master-etcd-1 ~]# oc get pod -n openshift-ansible-service-broker | wc -l 136 After the job completed, it is still creating migrating pod. [root@qe-zitang-39-3master-etcd-1 ~]# oc get pod -n openshift-ansible-service-broker | wc -l 305 Expected results: it should stop creating asb-etcd-migration pod after the task stop retry. Additional info:
This looks like the result of an upstream kubernetes bug related to Jobs, where the backoffLimit is no longer respected: https://github.com/kubernetes/kubernetes/issues/62382
https://github.com/openshift/openshift-ansible/pull/8625
Commits pushed to master at https://github.com/openshift/openshift-ansible https://github.com/openshift/openshift-ansible/commit/b3706ea39c192e306728358a365644d2db25419f Bug 1585648- Set timeout for ASB migration job (workaround for kubernetes/kubernetes#62382) https://github.com/openshift/openshift-ansible/commit/72428990fb8fa27cdda26238d49a27c7daf9ad3f Merge pull request #8625 from fabianvf/bz1585648 Bug 1585648- Set timeout for ASB migration job
Using the fix as workaround , the migration job stop creating pod after timeout. # oc get pod -n openshift-ansible-service-broker| wc -l 145 So marked as verified. version: openshift-ansible-3.10.0-0.60.0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816