Bug 1721847

Summary: Jenkins builds (source strategy) get intermittently stuck at the git-clone operation
Product: OpenShift Container Platform Reporter: Christian Koep <ckoep>
Component: BuildAssignee: Adam Kaplan <adam.kaplan>
Status: CLOSED INSUFFICIENT_DATA QA Contact: wewang <wewang>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.11.0CC: adam.kaplan, antgarci, aos-bugs, bmilne, cmarches, dtarabor, gmontero, jdesousa, jokerman, ksalunkh, mmccomas, pweil, rbost, rhowe, rkrawitz, rphillips, sburke, wzheng, zhigwang
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: x86_64   
OS: Linux   
Whiteboard: stale
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-06 18:13:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Christian Koep 2019-06-19 07:08:40 UTC
Description of problem:
- Running a Jenkins build in OpenShift Container Platform sometimes results in the build getting stuck infinitely. As a result, subsequent builds are being cancelled with a similar error as follows:

~~~
Error running start-build on at least one item: [buildconfig/<OMITTED>];
{reference={}, err=Uploading directory "/var/lib/jenkins/jobs/Monitoring jobs/jobs/<OMITTED>/workspace" as binary input for the build ...
Unable to connect to the server: net/http: HTTP/1.x transport connection broken: write tcp 1.2.3.4:45792->5.6.7.8:443: write: connection reset by peer, verb=start-build, cmd=oc --server=https://server.example.com --insecure-skip-tls-verify --namespace=<OMITTED> --token=<OMITTED> start-build buildconfig/<OMITTED> --follow --from-dir='/var/lib/jenkins/jobs/Monitoring jobs/jobs/<OMITTED>/workspace' -o=name , out=, status=1}
~~~

An analysis of the master logs showed the following error message:

~~~
Apr 29 14:00:06 omitted.example.com atomic-openshift-node[20057]: E0429 14:00:06.114257   20057 status_manager.go:335] Status update on pod <OMITTED>/<OMITTED>-668-build aborted: terminated container git-clone attempted illegal transition to non-terminated state
~~~

Further analysis has shown that the build gets stuck during the git-clone operation:

~~~
docker_ps_-a:0471ca94f351        d3d2fbc373fb                                                                                                                     "openshift-git-clo..."   2 hours ago           Up 2 hours                                          k8s_git-clone_<OMITTED>-668-build_<OMITTED>_412b6b51-6a76-11e9-b041-00505698544b_0

docker_ps_-a:2fcf04946b3c        registry.access.redhat.com/openshift3/ose-pod:v3.11.88                                                                           "/usr/bin/pod"           2 hours ago           Up 2 hours                                          k8s_POD_<OMITTED>-668-<OMITTED>_412b6b51-6a76-11e9-b041-00505698544b_0
~~~

I will attach more data privately to this Bugzilla.

Version-Release number of selected component (if applicable):
- Red Hat OpenShift Container Platform 3.11.88

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:
- Jenkins builds (source strategy) get intermittently stuck at the git-clone operation

Expected results:
- Jenkins build complete successfully.

Additional info:
- This issue was initially reported in RHBZ#1705557

Comment 36 Caden Marchese 2019-07-18 16:39:45 UTC
Customer is not able to exec or rsh into the pods:

error: unable to upgrade connection: container not found ("docker-build")

Comment 76 Ryan Phillips 2019-09-18 13:29:24 UTC
Reassigning to Adam, because there might be another issue with the git clone within the builder tool.

Comment 84 Tony Garcia 2019-10-15 21:36:53 UTC
Hi Adam,

Have you had a chance to review the customer output Ben provided the other day?

Comment 86 Adam Kaplan 2019-10-16 13:21:30 UTC
Issue summary (since the thread is very long):

An OpenShift build with the `Binary` source strategy is initiated from a Jenkins pipeline. The build pod's `bsdtar` process appears to be hanging waiting for content to be uploaded. At present it is not clear why the build pod does not think the upload has completed - this requires further investigation.

As an immediate work around, I recommend switching the Jenkins-initiated builds to clone source from a git-compatible repository (Github, Gitlab, Bitbucket, etc.) [1]. This kind of source strategy is more fault-tolerant than `Binary` source builds. Note that this work-around is not feasible for all situations.

[1] https://docs.openshift.com/container-platform/3.11/dev_guide/builds/build_inputs.html#source-code