Bug 1370265 - Build stuck in Running state - logs show error
Summary: Build stuck in Running state - logs show error
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Build
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.3.1
Assignee: Gabe Montero
QA Contact: Wang Haoran
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-25 18:41 UTC by Vikas Laad
Modified: 2016-10-04 12:43 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: A build timeout exposed a path where a golang WaitGroup was not bumped when it should have. Consequence: The OpenShift Build object would stay stuck in running state. Fix: The golang WaitGroup is not properly handled, including when a build timeout occurs. Result: OpenShift Builds that experience unexpected timeouts will be appropriately marked as failed and terminated.
Clone Of:
Environment:
Last Closed: 2016-10-04 12:43:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1988 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.3.0.34 bug fix update 2016-10-04 16:43:31 UTC

Description Vikas Laad 2016-08-25 18:41:57 UTC
Description of problem:
After running application reliability tests for 2 days I see a build is stuck in Running state. When I see the pod logs it shows error.

Version-Release number of selected component (if applicable):
openshift v3.3.0.24-dirty
kubernetes v1.3.0+507d3a7
etcd 2.3.0+git

Steps to Reproduce:
1. Create few projects
2. keep building apps 
3. Notice build is stuck in Running state

Actual results:
Build is stuck in Running state

Expected results:
Build state should be Error

Additional info: Please see the status of build # 25 in following info

root@ip-172-31-31-121: ~ # oc get pods
NAME                              READY     STATUS      RESTARTS   AGE
cakephp-mysql-example-16-build    0/1       Completed   0          13h
cakephp-mysql-example-17-build    0/1       Completed   0          12h
cakephp-mysql-example-18-build    0/1       Completed   0          12h
cakephp-mysql-example-19-8fgab    1/1       Running     0          11h
cakephp-mysql-example-19-build    0/1       Completed   0          11h
cakephp-mysql-example-20-deploy   0/1       Error       0          9h
cakephp-mysql-example-23-build    0/1       Completed   0          9h
cakephp-mysql-example-24-build    0/1       Error       0          8h
cakephp-mysql-example-25-build    1/1       Running     0          8h
mysql-1-nswnt                     1/1       Running     0          23h


root@ip-172-31-31-121: ~ # oc logs cakephp-mysql-example-25-build  
Cloning "https://github.com/openshift/cakephp-ex.git" ...
WARNING: timed out waiting for git server, will wait 1m4s
        Commit: 701d706b7f2b50ee972d0bf76990042f6c0cda5c (Merge pull request #42 from bparees/recreate)
        Author: Ben Parees <bparees.github.com>
        Date:   Mon Aug 22 14:44:49 2016 -0400

ERROR: Error copying file "/tmp/s2i-build084323527/upload/src/.git/objects/pack/pack-bf3ad2d17ca9b13710fd1bfa7c7f1868972c33ca.pack" to tar: io: read/write on closed pipe
ERROR: Error writing tar: io: read/write on closed pipe


root@ip-172-31-31-121: ~ # oc describe build cakephp-mysql-example-25 
Name:           cakephp-mysql-example-25
Namespace:      cakephp-mysql-example-user-2-22-2-55
Created:        8 hours ago
Labels:         app=cakephp-mysql-example
                buildconfig=cakephp-mysql-example
                openshift.io/build-config.name=cakephp-mysql-example
                openshift.io/build.start-policy=Serial
                template=cakephp-mysql-example
Annotations:    openshift.io/build-config.name=cakephp-mysql-example
                openshift.io/build.number=25
                openshift.io/build.pod-name=cakephp-mysql-example-25-build

Status:         Running
Started:        Thu, 25 Aug 2016 05:49:56 EDT
Duration:       running for 8h39m30s
Build Config:   cakephp-mysql-example
Build Pod:      cakephp-mysql-example-25-build

Strategy:               Source
URL:                    https://github.com/openshift/cakephp-ex.git
Commit:                 701d706 (Merge pull request #42 from bparees/recreate)
Author/Committer:       Ben Parees / GitHub
From Image:             DockerImage registry.access.redhat.com/rhscl/php-56-rhel7:latest
Output to:              ImageStreamTag cakephp-mysql-example:latest
Push Secret:            builder-dockercfg-wqv6n

Build trigger cause:    Manually triggered

No events.


root@ip-172-31-31-121: ~ # oc describe pod cakephp-mysql-example-25-build
Name:                   cakephp-mysql-example-25-build
Namespace:              cakephp-mysql-example-user-2-22-2-55
Security Policy:        privileged
Node:                   ip-172-31-19-54.us-west-2.compute.internal/172.31.19.54
Start Time:             Thu, 25 Aug 2016 05:47:47 -0400
Labels:                 openshift.io/build.name=cakephp-mysql-example-25
Status:                 Running
IP:                     172.20.2.2
Controllers:            <none>
Containers:
  sti-build:
    Container ID:       docker://9547c6180ec5767c5de148d938ccd040864f241ebcb32444c2ac5cdd86c2e719
    Image:              registry.ops.openshift.com/openshift3/ose-sti-builder:v3.3.0.24
    Image ID:           docker://sha256:41ca0e208e82c1fc19e9645d258ef3f995d554407d22767507c50678cec345dc
    Port:
    Args:
      --loglevel=0
    State:              Running
      Started:          Thu, 25 Aug 2016 05:49:20 -0400
    Ready:              True
    Restart Count:      0
    Volume Mounts:
      /var/run/docker.sock from docker-socket (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from builder-token-dmspg (ro)
      /var/run/secrets/openshift.io/push from builder-dockercfg-wqv6n-push (ro)
    Environment Variables:
      BUILD:    {"kind":"Build","apiVersion":"v1","metadata":{"name":"cakephp-mysql-example-25","namespace":"cakephp-mysql-example-user-2-22-2-55","selfLink":"/oapi/v1/namespaces/cakephp-mysql-example-user-2-22-2-55/builds/cakephp-mysql-example-25","uid
":"e662d3af-6aa8-11e6-97a1-02411b9952db","resourceVersion":"130925","creationTimestamp":"2016-08-25T09:47:13Z","labels":{"app":"cakephp-mysql-example","buildconfig":"cakephp-mysql-example","openshift.io/build-config.name":"cakephp-mysql-example","opensh
ift.io/build.start-policy":"Serial","template":"cakephp-mysql-example"},"annotations":{"openshift.io/build-config.name":"cakephp-mysql-example","openshift.io/build.number":"25"}},"spec":{"serviceAccount":"builder","source":{"type":"Git","git":{"uri":"ht
tps://github.com/openshift/cakephp-ex.git"}},"strategy":{"type":"Source","sourceStrategy":{"from":{"kind":"DockerImage","name":"registry.access.redhat.com/rhscl/php-56-rhel7:latest"},"env":[{"name":"COMPOSER_MIRROR"}]}},"output":{"to":{"kind":"DockerIma
ge","name":"172.26.148.88:5000/cakephp-mysql-example-user-2-22-2-55/cakephp-mysql-example:latest"},"pushSecret":{"name":"builder-dockercfg-wqv6n"}},"resources":{},"postCommit":{},"triggeredBy":[{"message":"Manually triggered"}]},"status":{"phase":"New",
"outputDockerImageReference":"172.26.148.88:5000/cakephp-mysql-example-user-2-22-2-55/cakephp-mysql-example:latest","config":{"kind":"BuildConfig","namespace":"cakephp-mysql-example-user-2-22-2-55","name":"cakephp-mysql-example"}}}

      SOURCE_REPOSITORY:        https://github.com/openshift/cakephp-ex.git
      SOURCE_URI:               https://github.com/openshift/cakephp-ex.git
      ORIGIN_VERSION:           v3.3.0.24-dirty
      ALLOWED_UIDS:             1-
      DROP_CAPS:                KILL,MKNOD,SETGID,SETUID,SYS_CHROOT
      PUSH_DOCKERCFG_PATH:      /var/run/secrets/openshift.io/push
Conditions:
  Type          Status
  Initialized   True 
  Ready         True 
  PodScheduled  True 
Volumes:
  docker-socket:
    Type:       HostPath (bare host directory volume)
    Path:       /var/run/docker.sock
  builder-dockercfg-wqv6n-push:
    Type:       Secret (a volume populated by a Secret)
    SecretName: builder-dockercfg-wqv6n
  builder-token-dmspg:
    Type:       Secret (a volume populated by a Secret)
    SecretName: builder-token-dmspg
QoS Tier:       BestEffort

Comment 1 Cesar Wong 2016-08-25 18:54:09 UTC
Goroutine dump: http://pastebin.test.redhat.com/406235

Comment 2 Vikas Laad 2016-08-25 18:54:23 UTC
root@ip-172-31-31-121: ~ # oc get builds
NAME                       TYPE      FROM          STATUS     STARTED        DURATION
cakephp-mysql-example-16   Source    Git@701d706   Complete   13 hours ago   25s
cakephp-mysql-example-17   Source    Git@701d706   Complete   12 hours ago   27s
cakephp-mysql-example-18   Source    Git@701d706   Complete   12 hours ago   24s
cakephp-mysql-example-19   Source    Git@701d706   Complete   11 hours ago   25s
cakephp-mysql-example-23   Source    Git@701d706   Complete   9 hours ago    23s
cakephp-mysql-example-24   Source    Git@701d706   Failed     8 hours ago    4m32s
cakephp-mysql-example-25   Source    Git@701d706   Running    8 hours ago    8h18m2s
cakephp-mysql-example-26   Source    Git           New                       
cakephp-mysql-example-27   Source    Git           New                       
cakephp-mysql-example-28   Source    Git           New                       
cakephp-mysql-example-29   Source    Git           New                       
cakephp-mysql-example-30   Source    Git           New                       
cakephp-mysql-example-31   Source    Git           New                       
cakephp-mysql-example-32   Source    Git           New

Comment 3 Gabe Montero 2016-08-25 19:22:42 UTC
Multiple problems.  Goroutine 1 is the center of attention.

- if you look at its stack trace, it calls panic; according to the golang source, this is because "sync: WaitGroup misuse: Add called concurrently with Wait" ... in any even, the wg.Done() call never returns, hence the hang.

- Goroutine 1 is down this path because the RunContainer timed out; is that because of a problem in the build container, or did we not wait long enough ? ... btw, with the fsouza client version, you can see the attach/hijack thread still active, alone with the container io processing, where we have another thread blocked waiting for more IO from the container ... is the container still up? 

I'm not super familiar with golang waitgroups, but in github.com/openshift/source-to-image/pkg/build/strategies/sti/sti.go, I see only 1 call to Add, and multiple threads that can call Done, so it seems conceivable that the count could go negative.

Minimally, we can't blindly call wg.Done() on the timeout error.  I'll have to think about what the tweak would be.  Also, per git blame Rodolfo originally added the waitgroup here, and there was a good amount of effort / review around it ... perhaps we can pull him in as well.

Comment 5 openshift-github-bot 2016-08-26 18:03:41 UTC
Commit pushed to master at https://github.com/openshift/source-to-image

https://github.com/openshift/source-to-image/commit/3c8691b36481a6cc6d48d299b31998751341dc02
increase default session timeout (related to Bug 1370265)

Comment 6 Rodolfo Carvalho 2016-09-01 18:15:46 UTC
This addresses the panic:

https://github.com/openshift/source-to-image/pull/581

Comment 7 Gabe Montero 2016-09-01 21:27:55 UTC
The PR https://github.com/openshift/origin/pull/10776 is the origin side update for the source-to-image PR noted n comment 6

Comment 9 Gabe Montero 2016-09-28 14:28:28 UTC
OK, the OSE pull has merged.  Moving to modified.

Comment 11 Dongbo Yan 2016-09-29 10:08:29 UTC
Verified in openshift
openshift v3.3.0.33
kubernetes v1.3.0+52492b4
etcd 2.3.0+git

step:
1. Create an application
 $ oc new-app cakephp-mysql-example
2. Edit the runPolicy field to Parallel in buildConfig
 $ oc edit bc -o json
3. Trigger multi builds
 $ oc start-build cakephp-example

Actual results:
All builds are complete, no build stuck in running state

Comment 13 errata-xmlrpc 2016-10-04 12:43:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1988


Note You need to log in before you can comment on or make changes to this bug.