Bug 1450466

Summary: Build status shows Running but Pod shows Error for Failed build
Product: OpenShift Container Platform Reporter: Vikas Laad <vlaad>
Component: BuildAssignee: Cesar Wong <cewong>
Status: CLOSED NEXTRELEASE QA Contact: Hongkai Liu <hongkliu>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: aos-bugs, bparees, mifiedle, trankin, vlaad, xtian
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Cause: When running many concurrent builds, the build controller will not update a pending build to failed when the corresponding pod fails. Consequence: The status of the build is not updated correctly. Fix: The build controller code has been refactored to avoid race conditions and update build status correctly. Result: The build status should no longer get out of sync with the corresponding pod.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-14 18:45:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
pod json
none
build json none

Description Vikas Laad 2017-05-12 16:26:40 UTC
Description of problem:
Build is stuck in Running state, but the pod status shows Error

root@ip-172-31-4-211: ~ # oc logs -n proj18 cakephp-mysql-example-119-build --follow
Cloning "https://github.com/redhat-performance/cakephp-ex.git" ...
        Commit: 0014ddebb91bc7dff3a1dabfbd7b51da762a6677 (made changes to enable database example)
        Author: ofthecure <robdean.smith>
        Date:   Mon Apr 25 14:33:06 2016 -0400
DEPRECATED: Use .s2i/bin instead of .sti/bin
---> Installing application source...
Pushing image 172.24.132.26:5000/proj18/cakephp-mysql-example:latest ...
error: Unable to update build status: Get https://172.24.0.1:443/oapi/v1/namespaces/proj18/builds/cakephp-mysql-example-119: dial tcp 172.24.0.1:443: getsockopt: connection refused
Registry server Address: 
Registry server User Name: serviceaccount
Registry server Email: serviceaccount
Registry server Password: <<non-empty>>
error: Unable to update build status: Get https://172.24.0.1:443/oapi/v1/namespaces/proj18/builds/cakephp-mysql-example-119: dial tcp 172.24.0.1:443: getsockopt: connection refused
error: build error: Failed to push image: unauthorized: authentication required


root@ip-172-31-4-211: ~ # oc get builds -n proj18 | grep -v Complete                                                                                                                                                                                         
NAME                        TYPE      FROM          STATUS     STARTED        DURATION
cakephp-mysql-example-119   Source    Git@0014dde   Running    2 hours ago    
cakephp-mysql-example-120   Source    Git           New                       
cakephp-mysql-example-121   Source    Git           New                       


Logs show its failed but the list shows its stuck in Running state. Attaching json for build and pod.

root@ip-172-31-4-211: ~ # oc get pods -n proj18 | grep -v Complete
NAME                              READY     STATUS      RESTARTS   AGE
cakephp-mysql-example-119-build   0/1       Error       0          1d


Version-Release number of selected component (if applicable):

root@ip-172-31-4-211: ~ # openshift version
openshift v3.6.74
kubernetes v1.6.1+5115d708d7
etcd 3.1.0
root@ip-172-31-4-211: ~ # docker version
Client:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-16.el7.x86_64
 Go version:      go1.7.4
 Git commit:      3a094bd/1.12.6
 Built:           Tue Mar 21 13:30:59 2017
 OS/Arch:         linux/amd64

Server:
 Version:         1.12.6
 API version:     1.24
 Package version: docker-common-1.12.6-16.el7.x86_64
 Go version:      go1.7.4
 Git commit:      3a094bd/1.12.6
 Built:           Tue Mar 21 13:30:59 2017
 OS/Arch:         linux/amd64

Steps to Reproduce:
1. Running concurrent build test, this happened after around 3000 successful builds
2. I am running 30 concurrent builds on 2 m4.xlarge nodes

Actual results:
Build stuck in Running.

Expected results:
Like pod build status also should show failed.

Additional info:

Comment 1 Vikas Laad 2017-05-12 16:28:03 UTC
Created attachment 1278277 [details]
pod json

Comment 2 Vikas Laad 2017-05-12 16:29:44 UTC
Created attachment 1278278 [details]
build json

Comment 3 Ben Parees 2017-05-12 17:47:57 UTC
we're going to need master logs w/ level 5 tracing to be able to see what happened within the pod controller for this.

Comment 4 Ben Parees 2017-05-31 20:08:13 UTC
Marking upcoming release as Cesar's PR that reworks all this logic is going to land at the start of next sprint.

Comment 5 Ben Parees 2017-05-31 20:09:02 UTC
relevant PR: https://github.com/openshift/origin/pull/14289

Comment 7 Hongkai Liu 2017-07-07 19:14:06 UTC
Rerun the test with 50 concurrent builds, all builds succeeded.

Comment 8 Hongkai Liu 2017-07-07 19:28:56 UTC
(In reply to Hongkai Liu from comment #7)
> Rerun the test with 50 concurrent builds, all builds succeeded.

Verified on 3.6.133