Bug 1425824 - Build stuck in Running status forever
Summary: Build stuck in Running status forever
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Build
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Jim Minter
QA Contact: Vikas Laad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-22 14:25 UTC by Vikas Laad
Modified: 2017-07-24 14:11 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
A race condition, which could cause builds with short-running post-commit hooks to hang, was resolved.
Clone Of:
Environment:
Last Closed: 2017-04-12 19:13:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
build logs (39.05 KB, text/plain)
2017-02-22 14:25 UTC, Vikas Laad
no flags Details
describe pod (3.16 KB, text/plain)
2017-02-22 14:48 UTC, Vikas Laad
no flags Details
go routine dump of stuck s2i builder (18.55 KB, text/plain)
2017-02-22 21:20 UTC, Cesar Wong
no flags Details
container log with goroutine dump (57.49 KB, text/plain)
2017-02-23 20:09 UTC, Cesar Wong
no flags Details
build logs on fork ami (58.27 KB, text/plain)
2017-02-27 18:27 UTC, Vikas Laad
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0884 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.5 RPM Release Advisory 2017-04-12 22:50:07 UTC

Description Vikas Laad 2017-02-22 14:25:06 UTC
Created attachment 1256463 [details]
build logs

Description of problem:
Running concurrent build test causes this problem, one of quickstart app build stuck on Running status for 11 hours.

root@ip-172-31-37-221: ~/svt # oc get pods
NAME                          READY     STATUS      RESTARTS   AGE
django-psql-example-1-build   0/1       Completed   0          12h
django-psql-example-2-build   0/1       Completed   0          12h
django-psql-example-3-build   0/1       Completed   0          11h
django-psql-example-4-build   0/1       Completed   0          11h
django-psql-example-5-build   1/1       Running     0          11h

Version-Release number of selected component (if applicable):
openshift v3.5.0.32-1+4f84c83
kubernetes v1.5.2+43a9be4
etcd 3.1.0

How reproducible:


Steps to Reproduce:
1. create 50 django apps
2. Run concurrent builds for django app
3. This happened when 40 builds were running

Env has 2 m4.xlarge worker nodes, 1 infra and 1 master.

Actual results:
Build stuck in Running status

Expected results:
Build should fail/pass

Additional info:
Build logs attached.

Comment 1 Cesar Wong 2017-02-22 14:30:21 UTC
Hi Vikas,
Would it be possible to get the state of the pod that corresponds to that build?

If the pod is in the running state, find the node where the pod is running and signal the pod's main process with -6 (SIGABRT). A goroutine dump would be output to the pod/build log.

That would give us a clue as to what's stuck.

Comment 2 Vikas Laad 2017-02-22 14:48:36 UTC
Created attachment 1256490 [details]
describe pod

Comment 3 Cesar Wong 2017-02-22 16:07:10 UTC
Jim, copying you on this bug. Ben said you may have fixed this issue already. The build pod hangs while executing the post-commit hook. 

This is the hook:
"postCommit":{"script":"./manage.py test"}

Any info you can add is greatly appreciated.

Comment 4 Jim Minter 2017-02-22 17:28:55 UTC
It's possible, but I thought the problem I was looking at only happened on hooks that terminated very quickly.  Perhaps if the box is under sufficient load it's also possible to trigger it.

References:

https://github.com/openshift/origin/issues/12587
https://bugzilla.redhat.com/show_bug.cgi?id=1420147

Vikas, what version of docker is being used please?  Does the problem recur with docker-1.12.6-10.el7 ?

Comment 5 Cesar Wong 2017-02-22 18:22:57 UTC
Jim, the node is running docker-1.12.6-8.el7.x86_64

Comment 6 Vikas Laad 2017-02-22 19:23:03 UTC
Jim,

Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-8.el7.x86_64
 Go version:      go1.7.4
 Git commit:      ddff1c3/1.12.6
 Built:           Mon Feb 20 11:27:19 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-8.el7.x86_64
 Go version:      go1.7.4
 Git commit:      ddff1c3/1.12.6
 Built:           Mon Feb 20 11:27:19 2017
 OS/Arch:         linux/amd64


I will do another run with docker-1.12.6-8.el7.x86_64 when available in openshift latest repo.

Comment 7 Cesar Wong 2017-02-22 19:25:03 UTC
Vikas, you mean docker-1.12.6-10.el7.x86_64 ?

Comment 8 Vikas Laad 2017-02-22 19:34:38 UTC
Oh sorry, yes docker-1.12.6-10.el7.x86_64

Comment 9 Cesar Wong 2017-02-22 21:20:41 UTC
Created attachment 1256713 [details]
go routine dump of stuck s2i builder

Comment 10 Vikas Laad 2017-02-22 21:23:29 UTC
raising sev since its blocking svt testing, was able to reproduce in next run.

Will test again as soon as the new rpm is available.

Comment 11 Cesar Wong 2017-02-22 21:26:26 UTC
Jim, please see the attached goroutine dump. I believe this is the same issue you referenced above. The builder thread is stuck copying from the output stream of the post-commit hook container while that container has already finished.

If it is fixed in docker-1.12.6-10.el7.x86_64, then the issue should hopefully be resolved when Vikas upgrades.

Comment 12 Jim Minter 2017-02-23 11:08:32 UTC
Cesar - agreed.

Comment 13 Cesar Wong 2017-02-23 20:09:03 UTC
Happened again, this time with the newer docker version:

root@ip-172-31-57-222: ~ # docker version
Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-10.el7.x86_64
 Go version:      go1.7.4
 Git commit:      7f3e2af/1.12.6
 Built:           Tue Feb 21 15:24:45 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-10.el7.x86_64
 Go version:      go1.7.4
 Git commit:      7f3e2af/1.12.6
 Built:           Tue Feb 21 15:24:45 2017
 OS/Arch:         linux/amd64

Attaching full container log - goroutine dump is at the end.

Comment 14 Cesar Wong 2017-02-23 20:09:41 UTC
Created attachment 1257016 [details]
container log with goroutine dump

Comment 15 Jim Minter 2017-02-24 11:10:56 UTC
I think this is https://github.com/docker/docker/issues/31323 , and I think we can work around it.  Let me see if I can get a patch together.

Comment 16 Cesar Wong 2017-02-24 12:14:23 UTC
Thanks Jim, assigning the bug to you for now.

Comment 17 Jim Minter 2017-02-24 12:32:36 UTC
https://github.com/openshift/origin/pull/13100

Comment 18 Jim Minter 2017-02-24 14:31:32 UTC
Vikas, ami-21479437 (fork_ami_openshift3_bz1425824_344) is a fork AMI which should contain the above PR.  It is built and going through post-build testing at the moment.  Please can you see if you can recreate the issue on that AMI?

Comment 19 Vikas Laad 2017-02-24 14:43:16 UTC
Sure, creating cluster with this AMI.

Comment 20 Jim Minter 2017-02-24 18:20:51 UTC
ami-b364b7a5 (fork_ami_openshift3_bz1425824_348) is now available.  Attempting to frustrate the overzealous AMI pruner, I'm making a copy of it under a different name, ami-c069bad6 (jminter_fork_ami_openshift3_bz1425824_348), which should also be available soon.

Comment 21 Vikas Laad 2017-02-24 22:04:41 UTC
Jim,
We started running into space issues after few builds in all in env using this fork AMI, can we test this code after merge?

Or else we can spend more time next creating cluster using this fork AMI.

Comment 22 Jim Minter 2017-02-27 12:35:49 UTC
Vikas, I'd rather know that the workaround solves the problem before committing it, if it is at all possible.  Ben, what do you think?

Comment 23 Ben Parees 2017-02-27 15:42:35 UTC
@Jim yeah it would be nice to see it verified.  That said, it should be easy to recreate, were you able to recreate it locally and verify your fix?

Comment 24 Jim Minter 2017-02-27 15:58:38 UTC
I was able to recreate /a/ hang issue locally by modifying the docker daemon and adding a carefully located time.Sleep(), and my PR resolves that, but I don't know for sure if my issue is the same as this issue - hence my preference for Vikas to tell me if this solves the problem he's seeing or not.

Comment 25 Vikas Laad 2017-02-27 18:27:18 UTC
Created attachment 1258182 [details]
build logs on fork ami

Tried again on fork ami, saw similar error. I have attached logs, also going to keep env around for few hours.

Comment 26 Vikas Laad 2017-02-27 18:54:05 UTC
Note : this env was creating using fork_ami_openshift3_bz1425824_348

Comment 27 Jim Minter 2017-02-28 09:45:33 UTC
Vikas, the logs suggest that the fork AMI version of openshift/origin-sti-builder was not being used.  This could be because OpenShift wasn't started with the --latest-images argument (see my e-mail to aos-devel).  Please can you double-check?

Comment 28 Vikas Laad 2017-03-01 18:20:54 UTC
I did not see this problem in the dev env created from this fork AMI. Completed 3 cycles of concurrent builds.

Comment 29 Troy Dawson 2017-03-06 19:04:08 UTC
This has been merged into ocp and is in OCP v3.5.0.40 or newer.

Comment 31 Vikas Laad 2017-03-07 18:15:20 UTC
Verified in the following version, completed 3 rounds of concurrent builds and did not notice it.

openshift v3.5.0.40
kubernetes v1.5.2+43a9be4
etcd 3.1.0

Comment 33 errata-xmlrpc 2017-04-12 19:13:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884


Note You need to log in before you can comment on or make changes to this bug.