Bug 1442875

Summary: Build stuck in Running status
Product: OpenShift Container Platform Reporter: Vikas Laad <vlaad>
Component: BuildAssignee: Jim Minter <jminter>
Status: CLOSED ERRATA QA Contact: Vikas Laad <vlaad>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: aos-bugs, jminter, mifiedle
Target Milestone: ---   
Target Release: 3.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Source-to-image was not closing stdin/out/err pipes correctly in some error cases, causing a hang to occur. This was causing some OpenShift Builds to hang in Running status as a knock-on effect.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-28 21:53:29 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1436391, 1437121    
Attachments:
Description Flags
docker container logs
none
pod json
none
build json none

Description Vikas Laad 2017-04-17 20:13:09 UTC
Created attachment 1272172 [details]
docker container logs

Description of problem:
After running concurrent builds some builds got stuck in Running status for a long time.

NAMESPACE   NAME                       TYPE      FROM          STATUS                        STARTED             DURATION
proj11      cakephp-mysql-example-12   Source    Git@0014dde   Failed (GenericBuildFailed)   43 minutes ago      34m24s
proj2       cakephp-mysql-example-19   Source    Git@0014dde   Running                       About an hour ago   
proj33      cakephp-mysql-example-14   Source    Git@0014dde   Running                       37 minutes ago      
proj48      cakephp-mysql-example-13   Source    Git@0014dde   Running                       About an hour ago   

After stopping docker container attaching docker logs also.

Version-Release number of selected component (if applicable):
openshift v3.6.27
kubernetes v1.5.2+43a9be4
etcd 3.1.0

Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-16.el7.x86_64
 Go version:      go1.7.4
 Git commit:      3a094bd/1.12.6
 Built:           Tue Mar 21 13:30:59 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-16.el7.x86_64
 Go version:      go1.7.4
 Git commit:      3a094bd/1.12.6
 Built:           Tue Mar 21 13:30:59 2017
 OS/Arch:         linux/amd64

Steps to Reproduce:
1. create 20 cakephp projects
2. start concurrent builds in those projects
3. After some time builds are stuck in Running state

Actual results:
Builds are stuck

Expected results:
Build should finish

Additional info:
Please see attached docker container logs when container was stopped.

Comment 1 Jim Minter 2017-04-18 09:09:56 UTC
vendor/github.com/openshift/source-to-image/pkg/build/strategies/sti/sti.go:688: I can see that builder.docker.RunContainer(opts) has returned an error; the hang is happening while we're waiting for the container to close its stderr/stdout.  Also of note: the source upload ("starting the source uploading ...") has not completed.

Vikaas, please could you set BUILD_LOGLEVEL on the builds so we can see the s2i logging?  Also we need the docker state (daemon logs, docker ps -a, and container logs for the stuck build containers) would be useful.

Or, if you have a running environment that I can log into which is currently exhibiting this issue, please ping me on IRC (NB: I'm on GMT+1).

Comment 2 Jim Minter 2017-04-27 11:41:42 UTC
https://github.com/openshift/origin/pull/13817

Comment 3 Vikas Laad 2017-05-11 17:28:55 UTC
Hi Jim,

Build is stuck in Running state again, while verifying this issue. I think the problem is something else, please let me know if I need to create another bug. I am attaching information for the build which is stuck again.

root@ip-172-31-4-211: ~ # oc logs -n proj18 cakephp-mysql-example-119-build --follow
Cloning "https://github.com/redhat-performance/cakephp-ex.git" ...
        Commit: 0014ddebb91bc7dff3a1dabfbd7b51da762a6677 (made changes to enable database example)
        Author: ofthecure <robdean.smith>
        Date:   Mon Apr 25 14:33:06 2016 -0400
DEPRECATED: Use .s2i/bin instead of .sti/bin
---> Installing application source...
Pushing image 172.24.132.26:5000/proj18/cakephp-mysql-example:latest ...
error: Unable to update build status: Get https://172.24.0.1:443/oapi/v1/namespaces/proj18/builds/cakephp-mysql-example-119: dial tcp 172.24.0.1:443: getsockopt: connection refused
Registry server Address: 
Registry server User Name: serviceaccount
Registry server Email: serviceaccount
Registry server Password: <<non-empty>>
error: Unable to update build status: Get https://172.24.0.1:443/oapi/v1/namespaces/proj18/builds/cakephp-mysql-example-119: dial tcp 172.24.0.1:443: getsockopt: connection refused
error: build error: Failed to push image: unauthorized: authentication required


root@ip-172-31-4-211: ~ # oc get builds -n proj18 | grep -v Complete                                                                                                                                                                                         
NAME                        TYPE      FROM          STATUS     STARTED        DURATION
cakephp-mysql-example-119   Source    Git@0014dde   Running    2 hours ago    
cakephp-mysql-example-120   Source    Git           New                       
cakephp-mysql-example-121   Source    Git           New                       


Logs show its failed but the list shows its stuck in Running state. Attaching json for build and pod.

root@ip-172-31-4-211: ~ # openshift version
openshift v3.6.74
kubernetes v1.6.1+5115d708d7
etcd 3.1.0
root@ip-172-31-4-211: ~ # docker version
Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-16.el7.x86_64
 Go version:      go1.7.4
 Git commit:      3a094bd/1.12.6
 Built:           Tue Mar 21 13:30:59 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-16.el7.x86_64
 Go version:      go1.7.4
 Git commit:      3a094bd/1.12.6
 Built:           Tue Mar 21 13:30:59 2017
 OS/Arch:         linux/amd64

Comment 4 Vikas Laad 2017-05-11 17:29:27 UTC
Created attachment 1277984 [details]
pod json

Comment 5 Vikas Laad 2017-05-11 17:29:59 UTC
Created attachment 1277985 [details]
build json

Comment 6 Vikas Laad 2017-05-12 17:07:59 UTC
Please ignore comment #3 4 and 5. I created another bug for that since its a different problem. https://bugzilla.redhat.com/show_bug.cgi?id=1450466

Comment 7 Mike Fiedler 2017-05-25 14:33:18 UTC
I think this should be in ON_QA.  Looks like the PR merged to master over a month ago.

Comment 8 Jim Minter 2017-05-25 14:57:48 UTC
Sorry; perhaps it's because I forgot to set the target release?  Setting and moving to ON_QA.

Comment 9 Vikas Laad 2017-05-25 15:54:14 UTC
Verified on following version, builds are still getting stuck

openshift v3.6.74
kubernetes v1.6.1+5115d708d7
etcd 3.1.0

Jim, please let me know if this is the same issue or I need to create a new one.

Comment 11 Jim Minter 2017-05-25 16:08:36 UTC
That's not good.  Vikas, do you have an environment exhibiting this issue that I can take a look at?

Comment 12 Jim Minter 2017-05-25 16:49:35 UTC
Different bug.  Looking at the environment in question, all the stuck builds are stuck on the final image push.  In the sample in c10, s2i is pushing to the Docker daemon and is waiting for the Docker daemon to report completed.  I think this is most likely to be an OpenShift registry bug or a Docker daemon bug - I'm not sure which at this point.  Please open a new bz, and I suggest capturing:

- registry pod goroutines (SIGABRT)
- registry pod log
- docker daemon goroutines on a node hosting a failed build (SIGABRT)
- docker daemon log on same

Comment 13 Vikas Laad 2017-06-01 19:30:06 UTC
Verified in following version

openshift v3.6.79
kubernetes v1.6.1+5115d708d7
etcd 3.1.0


Completed 100 cycles of 30 concurrent builds. No build was stuck in Running state. Created another bug for the problem mentioned in Comment #12.

Comment 18 errata-xmlrpc 2017-11-28 21:53:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188