Bug 1585648 - [upgrade]should stop creating asb-etcd-migration pod after the task' wait for migration to complete' failed
Summary: [upgrade]should stop creating asb-etcd-migration pod after the task' wait for...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Service Broker
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 3.10.0
Assignee: Fabian von Feilitzsch
QA Contact: Zihan Tang
Depends On:
TreeView+ depends on / blocked
Reported: 2018-06-04 09:45 UTC by Zihan Tang
Modified: 2018-07-30 19:17 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Last Closed: 2018-07-30 19:16:54 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:1816 None None None 2018-07-30 19:17:31 UTC
Github https://github.com/kubernetes kubernetes issues 62382 None None None 2019-11-17 14:46:31 UTC

Description Zihan Tang 2018-06-04 09:45:37 UTC
Description of problem:
when OCP upgrade from 3.9 to 3.10. if the task [ansible_service_broker : wait for migration to complete] failed, it is still creating asb-etcd-migration pod in openshift-ansible-service-broker ns, which will drain off a lot of resource.

Version-Release number of selected component (if applicable):
openshift-ansible: 3.10.0-0.58.0

How reproducible:

Steps to Reproduce:
1. Install OCP v3.9 , etcd pod status is error, this will cause etcd task fails when upgrade to 3.10

# oc get pod 
NAME                       READY     STATUS              RESTARTS   AGE
asb-1-deploy               0/1       Error               0          1h
asb-etcd-1-deploy          0/1       Error               0          1h

2. update to 3.10
3. check TASK [ansible_service_broker : Migrate from etcd to CustomResources] 

Actual results:
3. task 'Migrate from etcd to CustomResources ' will fail, but it is still generating 
asb-etcd-migration pod even the task stop retry. oc logs -f asb-etcd-migration-wtfpv 
time="2018-06-04T09:12:57Z" level=info msg="etcd configuration: {asb-etcd.openshift-ansible-service-broker.svc /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt /var/run/asb-etcd-auth/client.crt /var/run/asb-etcd-auth/client.key 2379}"
time="2018-06-04T09:12:57Z" level=info msg="== ETCD CX =="
time="2018-06-04T09:12:57Z" level=info msg="EtcdHost: asb-etcd.openshift-ansible-service-broker.svc"
time="2018-06-04T09:12:57Z" level=info msg="EtcdPort: 2379"
time="2018-06-04T09:12:57Z" level=info msg="Endpoints: [https://asb-etcd.openshift-ansible-service-broker.svc:2379]"
2018/06/04 09:12:57 Dao::BatchGetRaw
panic: Unable to get all specs from etcd - client: etcd cluster is unavailable or misconfigured; error #0: client: endpoint https://asb-etcd.openshift-ansible-service-broker.svc:2379 exceeded header timeout

goroutine 1 [running]:
	/builddir/build/BUILD/ansible-service-broker-1.2.16/cmd/migration/main.go:90 +0x3c16

[root@qe-zitang-39-3master-etcd-1 ~]# oc get pod 
NAME                       READY     STATUS    RESTARTS   AGE
asb-1-deploy               0/1       Error     0          1h
asb-etcd-1-deploy          0/1       Error     0          1h
asb-etcd-migration-2lnqd   0/1       Error     0          3m
asb-etcd-migration-2z2cp   0/1       Error     0          51s
asb-etcd-migration-45k5g   0/1       Error     0          2m
asb-etcd-migration-4r6qb   0/1       Error     0          3m
asb-etcd-migration-4sp6q   0/1       Error     0          2m
asb-etcd-migration-585vc   0/1       Error     0          3m
asb-etcd-migration-5mgjn   0/1       Error     0          3m

[root@qe-zitang-39-3master-etcd-1 ~]# oc get pod | wc -l 
[root@qe-zitang-39-3master-etcd-1 ~]# oc get pod -n openshift-ansible-service-broker | wc -l 
[root@qe-zitang-39-3master-etcd-1 ~]# oc get pod -n openshift-ansible-service-broker | wc -l 

After the job completed, it is still creating migrating pod.
[root@qe-zitang-39-3master-etcd-1 ~]# oc get pod -n openshift-ansible-service-broker | wc -l 

Expected results:
it should stop creating asb-etcd-migration pod after the task stop retry.

Additional info:

Comment 2 Fabian von Feilitzsch 2018-06-04 16:42:49 UTC
This looks like the result of an upstream kubernetes bug related to Jobs, where the backoffLimit is no longer respected: https://github.com/kubernetes/kubernetes/issues/62382

Comment 3 Fabian von Feilitzsch 2018-06-04 17:40:38 UTC

Comment 4 openshift-github-bot 2018-06-04 19:11:09 UTC
Commits pushed to master at https://github.com/openshift/openshift-ansible

Bug 1585648- Set timeout for ASB migration job (workaround for kubernetes/kubernetes#62382)

Merge pull request #8625 from fabianvf/bz1585648

Bug 1585648- Set timeout for ASB migration job

Comment 6 Zihan Tang 2018-06-06 05:46:37 UTC
Using the fix as workaround , the migration job stop creating pod after timeout.
# oc get pod  -n openshift-ansible-service-broker| wc -l

So marked as verified.
version:  openshift-ansible-3.10.0-0.60.0

Comment 8 errata-xmlrpc 2018-07-30 19:16:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.