Bug 1450466 - Build status shows Running but Pod shows Error for Failed build
Summary: Build status shows Running but Pod shows Error for Failed build
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Build
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Cesar Wong
QA Contact: Hongkai Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-12 16:26 UTC by Vikas Laad
Modified: 2017-08-16 19:51 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Cause: When running many concurrent builds, the build controller will not update a pending build to failed when the corresponding pod fails. Consequence: The status of the build is not updated correctly. Fix: The build controller code has been refactored to avoid race conditions and update build status correctly. Result: The build status should no longer get out of sync with the corresponding pod.
Clone Of:
Environment:
Last Closed: 2017-08-14 18:45:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
pod json (7.55 KB, text/plain)
2017-05-12 16:28 UTC, Vikas Laad
no flags Details
build json (2.96 KB, text/plain)
2017-05-12 16:29 UTC, Vikas Laad
no flags Details

Description Vikas Laad 2017-05-12 16:26:40 UTC
Description of problem:
Build is stuck in Running state, but the pod status shows Error

root@ip-172-31-4-211: ~ # oc logs -n proj18 cakephp-mysql-example-119-build --follow
Cloning "https://github.com/redhat-performance/cakephp-ex.git" ...
        Commit: 0014ddebb91bc7dff3a1dabfbd7b51da762a6677 (made changes to enable database example)
        Author: ofthecure <robdean.smith>
        Date:   Mon Apr 25 14:33:06 2016 -0400
DEPRECATED: Use .s2i/bin instead of .sti/bin
---> Installing application source...
Pushing image 172.24.132.26:5000/proj18/cakephp-mysql-example:latest ...
error: Unable to update build status: Get https://172.24.0.1:443/oapi/v1/namespaces/proj18/builds/cakephp-mysql-example-119: dial tcp 172.24.0.1:443: getsockopt: connection refused
Registry server Address: 
Registry server User Name: serviceaccount
Registry server Email: serviceaccount
Registry server Password: <<non-empty>>
error: Unable to update build status: Get https://172.24.0.1:443/oapi/v1/namespaces/proj18/builds/cakephp-mysql-example-119: dial tcp 172.24.0.1:443: getsockopt: connection refused
error: build error: Failed to push image: unauthorized: authentication required


root@ip-172-31-4-211: ~ # oc get builds -n proj18 | grep -v Complete                                                                                                                                                                                         
NAME                        TYPE      FROM          STATUS     STARTED        DURATION
cakephp-mysql-example-119   Source    Git@0014dde   Running    2 hours ago    
cakephp-mysql-example-120   Source    Git           New                       
cakephp-mysql-example-121   Source    Git           New                       


Logs show its failed but the list shows its stuck in Running state. Attaching json for build and pod.

root@ip-172-31-4-211: ~ # oc get pods -n proj18 | grep -v Complete
NAME                              READY     STATUS      RESTARTS   AGE
cakephp-mysql-example-119-build   0/1       Error       0          1d


Version-Release number of selected component (if applicable):

root@ip-172-31-4-211: ~ # openshift version
openshift v3.6.74
kubernetes v1.6.1+5115d708d7
etcd 3.1.0
root@ip-172-31-4-211: ~ # docker version
Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-16.el7.x86_64
 Go version:      go1.7.4
 Git commit:      3a094bd/1.12.6
 Built:           Tue Mar 21 13:30:59 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-16.el7.x86_64
 Go version:      go1.7.4
 Git commit:      3a094bd/1.12.6
 Built:           Tue Mar 21 13:30:59 2017
 OS/Arch:         linux/amd64

Steps to Reproduce:
1. Running concurrent build test, this happened after around 3000 successful builds
2. I am running 30 concurrent builds on 2 m4.xlarge nodes

Actual results:
Build stuck in Running.

Expected results:
Like pod build status also should show failed.

Additional info:

Comment 1 Vikas Laad 2017-05-12 16:28:03 UTC
Created attachment 1278277 [details]
pod json

Comment 2 Vikas Laad 2017-05-12 16:29:44 UTC
Created attachment 1278278 [details]
build json

Comment 3 Ben Parees 2017-05-12 17:47:57 UTC
we're going to need master logs w/ level 5 tracing to be able to see what happened within the pod controller for this.

Comment 4 Ben Parees 2017-05-31 20:08:13 UTC
Marking upcoming release as Cesar's PR that reworks all this logic is going to land at the start of next sprint.

Comment 5 Ben Parees 2017-05-31 20:09:02 UTC
relevant PR: https://github.com/openshift/origin/pull/14289

Comment 7 Hongkai Liu 2017-07-07 19:14:06 UTC
Rerun the test with 50 concurrent builds, all builds succeeded.

Comment 8 Hongkai Liu 2017-07-07 19:28:56 UTC
(In reply to Hongkai Liu from comment #7)
> Rerun the test with 50 concurrent builds, all builds succeeded.

Verified on 3.6.133


Note You need to log in before you can comment on or make changes to this bug.