Created attachment 1256463 [details] build logs Description of problem: Running concurrent build test causes this problem, one of quickstart app build stuck on Running status for 11 hours. root@ip-172-31-37-221: ~/svt # oc get pods NAME READY STATUS RESTARTS AGE django-psql-example-1-build 0/1 Completed 0 12h django-psql-example-2-build 0/1 Completed 0 12h django-psql-example-3-build 0/1 Completed 0 11h django-psql-example-4-build 0/1 Completed 0 11h django-psql-example-5-build 1/1 Running 0 11h Version-Release number of selected component (if applicable): openshift v3.5.0.32-1+4f84c83 kubernetes v1.5.2+43a9be4 etcd 3.1.0 How reproducible: Steps to Reproduce: 1. create 50 django apps 2. Run concurrent builds for django app 3. This happened when 40 builds were running Env has 2 m4.xlarge worker nodes, 1 infra and 1 master. Actual results: Build stuck in Running status Expected results: Build should fail/pass Additional info: Build logs attached.
Hi Vikas, Would it be possible to get the state of the pod that corresponds to that build? If the pod is in the running state, find the node where the pod is running and signal the pod's main process with -6 (SIGABRT). A goroutine dump would be output to the pod/build log. That would give us a clue as to what's stuck.
Created attachment 1256490 [details] describe pod
Jim, copying you on this bug. Ben said you may have fixed this issue already. The build pod hangs while executing the post-commit hook. This is the hook: "postCommit":{"script":"./manage.py test"} Any info you can add is greatly appreciated.
It's possible, but I thought the problem I was looking at only happened on hooks that terminated very quickly. Perhaps if the box is under sufficient load it's also possible to trigger it. References: https://github.com/openshift/origin/issues/12587 https://bugzilla.redhat.com/show_bug.cgi?id=1420147 Vikas, what version of docker is being used please? Does the problem recur with docker-1.12.6-10.el7 ?
Jim, the node is running docker-1.12.6-8.el7.x86_64
Jim, Client: Version: 1.12.6 API version: 1.24 Package version: docker-common-1.12.6-8.el7.x86_64 Go version: go1.7.4 Git commit: ddff1c3/1.12.6 Built: Mon Feb 20 11:27:19 2017 OS/Arch: linux/amd64 Server: Version: 1.12.6 API version: 1.24 Package version: docker-common-1.12.6-8.el7.x86_64 Go version: go1.7.4 Git commit: ddff1c3/1.12.6 Built: Mon Feb 20 11:27:19 2017 OS/Arch: linux/amd64 I will do another run with docker-1.12.6-8.el7.x86_64 when available in openshift latest repo.
Vikas, you mean docker-1.12.6-10.el7.x86_64 ?
Oh sorry, yes docker-1.12.6-10.el7.x86_64
Created attachment 1256713 [details] go routine dump of stuck s2i builder
raising sev since its blocking svt testing, was able to reproduce in next run. Will test again as soon as the new rpm is available.
Jim, please see the attached goroutine dump. I believe this is the same issue you referenced above. The builder thread is stuck copying from the output stream of the post-commit hook container while that container has already finished. If it is fixed in docker-1.12.6-10.el7.x86_64, then the issue should hopefully be resolved when Vikas upgrades.
Cesar - agreed.
Happened again, this time with the newer docker version: root@ip-172-31-57-222: ~ # docker version Client: Version: 1.12.6 API version: 1.24 Package version: docker-common-1.12.6-10.el7.x86_64 Go version: go1.7.4 Git commit: 7f3e2af/1.12.6 Built: Tue Feb 21 15:24:45 2017 OS/Arch: linux/amd64 Server: Version: 1.12.6 API version: 1.24 Package version: docker-common-1.12.6-10.el7.x86_64 Go version: go1.7.4 Git commit: 7f3e2af/1.12.6 Built: Tue Feb 21 15:24:45 2017 OS/Arch: linux/amd64 Attaching full container log - goroutine dump is at the end.
Created attachment 1257016 [details] container log with goroutine dump
I think this is https://github.com/docker/docker/issues/31323 , and I think we can work around it. Let me see if I can get a patch together.
Thanks Jim, assigning the bug to you for now.
https://github.com/openshift/origin/pull/13100
Vikas, ami-21479437 (fork_ami_openshift3_bz1425824_344) is a fork AMI which should contain the above PR. It is built and going through post-build testing at the moment. Please can you see if you can recreate the issue on that AMI?
Sure, creating cluster with this AMI.
ami-b364b7a5 (fork_ami_openshift3_bz1425824_348) is now available. Attempting to frustrate the overzealous AMI pruner, I'm making a copy of it under a different name, ami-c069bad6 (jminter_fork_ami_openshift3_bz1425824_348), which should also be available soon.
Jim, We started running into space issues after few builds in all in env using this fork AMI, can we test this code after merge? Or else we can spend more time next creating cluster using this fork AMI.
Vikas, I'd rather know that the workaround solves the problem before committing it, if it is at all possible. Ben, what do you think?
@Jim yeah it would be nice to see it verified. That said, it should be easy to recreate, were you able to recreate it locally and verify your fix?
I was able to recreate /a/ hang issue locally by modifying the docker daemon and adding a carefully located time.Sleep(), and my PR resolves that, but I don't know for sure if my issue is the same as this issue - hence my preference for Vikas to tell me if this solves the problem he's seeing or not.
Created attachment 1258182 [details] build logs on fork ami Tried again on fork ami, saw similar error. I have attached logs, also going to keep env around for few hours.
Note : this env was creating using fork_ami_openshift3_bz1425824_348
Vikas, the logs suggest that the fork AMI version of openshift/origin-sti-builder was not being used. This could be because OpenShift wasn't started with the --latest-images argument (see my e-mail to aos-devel). Please can you double-check?
I did not see this problem in the dev env created from this fork AMI. Completed 3 cycles of concurrent builds.
This has been merged into ocp and is in OCP v3.5.0.40 or newer.
Verified in the following version, completed 3 rounds of concurrent builds and did not notice it. openshift v3.5.0.40 kubernetes v1.5.2+43a9be4 etcd 3.1.0
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:0884