| Summary: | Build stuck in Running state - logs show error | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Vikas Laad <vlaad> |
| Component: | Build | Assignee: | Gabe Montero <gmontero> |
| Status: | CLOSED ERRATA | QA Contact: | Wang Haoran <haowang> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.3.0 | CC: | aos-bugs, bparees, cewong, dyan, pweil, rcarvalh |
| Target Milestone: | --- | ||
| Target Release: | 3.3.1 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: A build timeout exposed a path where a golang WaitGroup was not bumped when it should have.
Consequence: The OpenShift Build object would stay stuck in running state.
Fix: The golang WaitGroup is not properly handled, including when a build timeout occurs.
Result: OpenShift Builds that experience unexpected timeouts will be appropriately marked as failed and terminated.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-10-04 12:43:45 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Vikas Laad
2016-08-25 18:41:57 UTC
Goroutine dump: http://pastebin.test.redhat.com/406235 root@ip-172-31-31-121: ~ # oc get builds NAME TYPE FROM STATUS STARTED DURATION cakephp-mysql-example-16 Source Git@701d706 Complete 13 hours ago 25s cakephp-mysql-example-17 Source Git@701d706 Complete 12 hours ago 27s cakephp-mysql-example-18 Source Git@701d706 Complete 12 hours ago 24s cakephp-mysql-example-19 Source Git@701d706 Complete 11 hours ago 25s cakephp-mysql-example-23 Source Git@701d706 Complete 9 hours ago 23s cakephp-mysql-example-24 Source Git@701d706 Failed 8 hours ago 4m32s cakephp-mysql-example-25 Source Git@701d706 Running 8 hours ago 8h18m2s cakephp-mysql-example-26 Source Git New cakephp-mysql-example-27 Source Git New cakephp-mysql-example-28 Source Git New cakephp-mysql-example-29 Source Git New cakephp-mysql-example-30 Source Git New cakephp-mysql-example-31 Source Git New cakephp-mysql-example-32 Source Git New Multiple problems. Goroutine 1 is the center of attention. - if you look at its stack trace, it calls panic; according to the golang source, this is because "sync: WaitGroup misuse: Add called concurrently with Wait" ... in any even, the wg.Done() call never returns, hence the hang. - Goroutine 1 is down this path because the RunContainer timed out; is that because of a problem in the build container, or did we not wait long enough ? ... btw, with the fsouza client version, you can see the attach/hijack thread still active, alone with the container io processing, where we have another thread blocked waiting for more IO from the container ... is the container still up? I'm not super familiar with golang waitgroups, but in github.com/openshift/source-to-image/pkg/build/strategies/sti/sti.go, I see only 1 call to Add, and multiple threads that can call Done, so it seems conceivable that the count could go negative. Minimally, we can't blindly call wg.Done() on the timeout error. I'll have to think about what the tweak would be. Also, per git blame Rodolfo originally added the waitgroup here, and there was a good amount of effort / review around it ... perhaps we can pull him in as well. Commit pushed to master at https://github.com/openshift/source-to-image https://github.com/openshift/source-to-image/commit/3c8691b36481a6cc6d48d299b31998751341dc02 increase default session timeout (related to Bug 1370265) This addresses the panic: https://github.com/openshift/source-to-image/pull/581 The PR https://github.com/openshift/origin/pull/10776 is the origin side update for the source-to-image PR noted n comment 6 OK, the OSE pull has merged. Moving to modified. Verified in openshift openshift v3.3.0.33 kubernetes v1.3.0+52492b4 etcd 2.3.0+git step: 1. Create an application $ oc new-app cakephp-mysql-example 2. Edit the runPolicy field to Parallel in buildConfig $ oc edit bc -o json 3. Trigger multi builds $ oc start-build cakephp-example Actual results: All builds are complete, no build stuck in running state Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1988 |