1425824 – Build stuck in Running status forever

Bug 1425824 - Build stuck in Running status forever

Summary: Build stuck in Running status forever

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Build
Sub Component:
Version:	3.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Jim Minter
QA Contact:	Vikas Laad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-02-22 14:25 UTC by Vikas Laad
Modified:	2017-07-24 14:11 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	A race condition, which could cause builds with short-running post-commit hooks to hang, was resolved.
Clone Of:
Environment:
Last Closed:	2017-04-12 19:13:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
build logs (39.05 KB, text/plain) 2017-02-22 14:25 UTC, Vikas Laad	no flags	Details
describe pod (3.16 KB, text/plain) 2017-02-22 14:48 UTC, Vikas Laad	no flags	Details
go routine dump of stuck s2i builder (18.55 KB, text/plain) 2017-02-22 21:20 UTC, Cesar Wong	no flags	Details
container log with goroutine dump (57.49 KB, text/plain) 2017-02-23 20:09 UTC, Cesar Wong	no flags	Details
build logs on fork ami (58.27 KB, text/plain) 2017-02-27 18:27 UTC, Vikas Laad	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:0884	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.5 RPM Release Advisory	2017-04-12 22:50:07 UTC

Description Vikas Laad 2017-02-22 14:25:06 UTC

Created attachment 1256463 [details]
build logs

Description of problem:
Running concurrent build test causes this problem, one of quickstart app build stuck on Running status for 11 hours.

root@ip-172-31-37-221: ~/svt # oc get pods
NAME                          READY     STATUS      RESTARTS   AGE
django-psql-example-1-build   0/1       Completed   0          12h
django-psql-example-2-build   0/1       Completed   0          12h
django-psql-example-3-build   0/1       Completed   0          11h
django-psql-example-4-build   0/1       Completed   0          11h
django-psql-example-5-build   1/1       Running     0          11h

Version-Release number of selected component (if applicable):
openshift v3.5.0.32-1+4f84c83
kubernetes v1.5.2+43a9be4
etcd 3.1.0

How reproducible:


Steps to Reproduce:
1. create 50 django apps
2. Run concurrent builds for django app
3. This happened when 40 builds were running

Env has 2 m4.xlarge worker nodes, 1 infra and 1 master.

Actual results:
Build stuck in Running status

Expected results:
Build should fail/pass

Additional info:
Build logs attached.

Comment 1 Cesar Wong 2017-02-22 14:30:21 UTC

Hi Vikas,
Would it be possible to get the state of the pod that corresponds to that build?

If the pod is in the running state, find the node where the pod is running and signal the pod's main process with -6 (SIGABRT). A goroutine dump would be output to the pod/build log.

That would give us a clue as to what's stuck.

Comment 2 Vikas Laad 2017-02-22 14:48:36 UTC

Created attachment 1256490 [details]
describe pod

Comment 3 Cesar Wong 2017-02-22 16:07:10 UTC

Jim, copying you on this bug. Ben said you may have fixed this issue already. The build pod hangs while executing the post-commit hook. 

This is the hook:
"postCommit":{"script":"./manage.py test"}

Any info you can add is greatly appreciated.

Comment 4 Jim Minter 2017-02-22 17:28:55 UTC

It's possible, but I thought the problem I was looking at only happened on hooks that terminated very quickly.  Perhaps if the box is under sufficient load it's also possible to trigger it.

References:

https://github.com/openshift/origin/issues/12587
https://bugzilla.redhat.com/show_bug.cgi?id=1420147

Vikas, what version of docker is being used please?  Does the problem recur with docker-1.12.6-10.el7 ?

Comment 5 Cesar Wong 2017-02-22 18:22:57 UTC

Jim, the node is running docker-1.12.6-8.el7.x86_64

Comment 6 Vikas Laad 2017-02-22 19:23:03 UTC

Jim,

Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-8.el7.x86_64
 Go version:      go1.7.4
 Git commit:      ddff1c3/1.12.6
 Built:           Mon Feb 20 11:27:19 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-8.el7.x86_64
 Go version:      go1.7.4
 Git commit:      ddff1c3/1.12.6
 Built:           Mon Feb 20 11:27:19 2017
 OS/Arch:         linux/amd64


I will do another run with docker-1.12.6-8.el7.x86_64 when available in openshift latest repo.

Comment 7 Cesar Wong 2017-02-22 19:25:03 UTC

Vikas, you mean docker-1.12.6-10.el7.x86_64 ?

Comment 8 Vikas Laad 2017-02-22 19:34:38 UTC

Oh sorry, yes docker-1.12.6-10.el7.x86_64

Comment 9 Cesar Wong 2017-02-22 21:20:41 UTC

Created attachment 1256713 [details]
go routine dump of stuck s2i builder

Comment 10 Vikas Laad 2017-02-22 21:23:29 UTC

raising sev since its blocking svt testing, was able to reproduce in next run.

Will test again as soon as the new rpm is available.

Comment 11 Cesar Wong 2017-02-22 21:26:26 UTC

Jim, please see the attached goroutine dump. I believe this is the same issue you referenced above. The builder thread is stuck copying from the output stream of the post-commit hook container while that container has already finished.

If it is fixed in docker-1.12.6-10.el7.x86_64, then the issue should hopefully be resolved when Vikas upgrades.

Comment 12 Jim Minter 2017-02-23 11:08:32 UTC

Cesar - agreed.

Comment 13 Cesar Wong 2017-02-23 20:09:03 UTC

Happened again, this time with the newer docker version:

root@ip-172-31-57-222: ~ # docker version
Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-10.el7.x86_64
 Go version:      go1.7.4
 Git commit:      7f3e2af/1.12.6
 Built:           Tue Feb 21 15:24:45 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-10.el7.x86_64
 Go version:      go1.7.4
 Git commit:      7f3e2af/1.12.6
 Built:           Tue Feb 21 15:24:45 2017
 OS/Arch:         linux/amd64

Attaching full container log - goroutine dump is at the end.

Comment 14 Cesar Wong 2017-02-23 20:09:41 UTC

Created attachment 1257016 [details]
container log with goroutine dump

Comment 15 Jim Minter 2017-02-24 11:10:56 UTC

I think this is https://github.com/docker/docker/issues/31323 , and I think we can work around it.  Let me see if I can get a patch together.

Comment 16 Cesar Wong 2017-02-24 12:14:23 UTC

Thanks Jim, assigning the bug to you for now.

Comment 17 Jim Minter 2017-02-24 12:32:36 UTC

https://github.com/openshift/origin/pull/13100

Comment 18 Jim Minter 2017-02-24 14:31:32 UTC

Vikas, ami-21479437 (fork_ami_openshift3_bz1425824_344) is a fork AMI which should contain the above PR.  It is built and going through post-build testing at the moment.  Please can you see if you can recreate the issue on that AMI?

Comment 19 Vikas Laad 2017-02-24 14:43:16 UTC

Sure, creating cluster with this AMI.

Comment 20 Jim Minter 2017-02-24 18:20:51 UTC

ami-b364b7a5 (fork_ami_openshift3_bz1425824_348) is now available.  Attempting to frustrate the overzealous AMI pruner, I'm making a copy of it under a different name, ami-c069bad6 (jminter_fork_ami_openshift3_bz1425824_348), which should also be available soon.

Comment 21 Vikas Laad 2017-02-24 22:04:41 UTC

Jim,
We started running into space issues after few builds in all in env using this fork AMI, can we test this code after merge?

Or else we can spend more time next creating cluster using this fork AMI.

Comment 22 Jim Minter 2017-02-27 12:35:49 UTC

Vikas, I'd rather know that the workaround solves the problem before committing it, if it is at all possible.  Ben, what do you think?

Comment 23 Ben Parees 2017-02-27 15:42:35 UTC

@Jim yeah it would be nice to see it verified.  That said, it should be easy to recreate, were you able to recreate it locally and verify your fix?

Comment 24 Jim Minter 2017-02-27 15:58:38 UTC

I was able to recreate /a/ hang issue locally by modifying the docker daemon and adding a carefully located time.Sleep(), and my PR resolves that, but I don't know for sure if my issue is the same as this issue - hence my preference for Vikas to tell me if this solves the problem he's seeing or not.

Comment 25 Vikas Laad 2017-02-27 18:27:18 UTC

Created attachment 1258182 [details]
build logs on fork ami

Tried again on fork ami, saw similar error. I have attached logs, also going to keep env around for few hours.

Comment 26 Vikas Laad 2017-02-27 18:54:05 UTC

Note : this env was creating using fork_ami_openshift3_bz1425824_348

Comment 27 Jim Minter 2017-02-28 09:45:33 UTC

Vikas, the logs suggest that the fork AMI version of openshift/origin-sti-builder was not being used.  This could be because OpenShift wasn't started with the --latest-images argument (see my e-mail to aos-devel).  Please can you double-check?

Comment 28 Vikas Laad 2017-03-01 18:20:54 UTC

I did not see this problem in the dev env created from this fork AMI. Completed 3 cycles of concurrent builds.

Comment 29 Troy Dawson 2017-03-06 19:04:08 UTC

This has been merged into ocp and is in OCP v3.5.0.40 or newer.

Comment 31 Vikas Laad 2017-03-07 18:15:20 UTC

Verified in the following version, completed 3 rounds of concurrent builds and did not notice it.

openshift v3.5.0.40
kubernetes v1.5.2+43a9be4
etcd 3.1.0

Comment 33 errata-xmlrpc 2017-04-12 19:13:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884

Note You need to log in before you can comment on or make changes to this bug.