Bug 1458884

Summary: Build status shows Pending for Failed pod
Product: OpenShift Container Platform Reporter: Vikas Laad <vlaad>
Component: BuildAssignee: Cesar Wong <cewong>
Status: CLOSED NEXTRELEASE QA Contact: Hongkai Liu <hongkliu>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: aos-bugs, bparees, decarr, mifiedle, trankin, vlaad, xtian
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: When running many concurrent builds, the build controller will not update a pending build to failed when the corresponding pod fails. Consequence: The status of the build is not updated correctly. Fix: The build controller code has been refactored to avoid race conditions and update build status correctly. Result: The build status should no longer get out of sync with the corresponding pod.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-14 18:42:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
build json
none
pod json
none
logs from master none

Description Vikas Laad 2017-06-05 18:12:33 UTC
Description of problem:
While running concurrent build I am running into this problem where build status shows Pending but the Pod shows failed.

oc get pods -n proj23
NAME                             READY     STATUS                                                 RESTARTS   AGE 
cakephp-mysql-example-30-build   0/1       rpc error: code = 4 desc = context deadline exceeded   0          2h 

oc get build -n proj23
cakephp-mysql-example-30   Source    Git           Pending 

Version-Release number of selected component (if applicable):
openshift v3.6.94
kubernetes v1.6.1+5115d708d7
etcd 3.1.0

How reproducible:
Run many concurrent build until docker starts running into context deadline exceeded problems. I was able to do this by running 50 concurrent builds on 4 m4.xlarge compute nodes around 30 cycles.

Actual results:
Build status is shown wrong, it should be Failed.

Expected results:
Build status is shown wrong, it should be Failed.


Additional info:
See pod, build json attached.

Comment 1 Vikas Laad 2017-06-05 18:13:23 UTC
Created attachment 1285094 [details]
build json

Comment 2 Vikas Laad 2017-06-05 18:13:47 UTC
Created attachment 1285095 [details]
pod json

Comment 3 Vikas Laad 2017-06-05 18:15:26 UTC
Created attachment 1285096 [details]
logs from master

Comment 4 Ben Parees 2017-06-13 12:31:11 UTC
Vikas did you open a separate bug for the problem that "oc get pods" is leaking the wrong information into the "status" column in the output?  (the pod json shows the correct value).

For the main issue (build shows pending) we expect this to be fixed when Cesar's build controller PR merges:
https://github.com/openshift/origin/pull/14596

Comment 5 Vikas Laad 2017-06-13 12:45:18 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1449820

Here it is.

Comment 6 Ben Parees 2017-06-13 12:47:38 UTC
Perfect, thanks Vikas!

Comment 8 Hongkai Liu 2017-07-07 19:06:02 UTC
catch "rpc error: code = 4 desc = context deadline exceeded" key words in /var/lib/messages file by doing 50 concurrent builds. It showed up:

[hongkliu@hongkliu ttt]$ grep "context deadline exceeded" conc.build.20170706/messages.node4 | grep proj25 | grep example-8
Jul  6 15:04:58 ip-172-31-8-32 atomic-openshift-node: E0706 15:04:58.717674   28712 generic.go:241] PLEG: Ignoring events for pod django-psql-example-8-build/proj25: rpc error: code = 4 desc = context deadline exceeded

=============
At the meantime, oc-get-pods every 2 mins and record the result in file /tmp/bbb.txt:

root@ip-172-31-31-187: ~/svt/manual.steps # cat /tmp/bbb.txt | grep django-psql-example-8 | grep proj25
proj25      django-psql-example-8-build               0/1       ContainerCreating   0          44s       <none>         ip-172-31-8-139.us-west-2.compute.internal
proj25      django-psql-example-8-build               1/1       Running                0          2m        172.20.6.24    ip-172-31-8-139.us-west-2.compute.internal
proj25      django-psql-example-8-build               1/1       Running                0          4m        172.20.6.24    ip-172-31-8-139.us-west-2.compute.internal
proj25      django-psql-example-8-build               1/1       Running                0          6m        172.20.6.24    ip-172-31-8-139.us-west-2.compute.internal
proj25      django-psql-example-8-build               1/1       Running                0          8m        172.20.6.24    ip-172-31-8-139.us-west-2.compute.internal
proj25      django-psql-example-8-build               1/1       Unknown                0          10m       172.20.6.24    ip-172-31-8-139.us-west-2.compute.internal
proj25      django-psql-example-8-build               1/1       Unknown                0          12m       172.20.6.24    ip-172-31-8-139.us-west-2.compute.internal
proj25      django-psql-example-8-build               1/1       Unknown                0          14m       172.20.6.24    ip-172-31-8-139.us-west-2.compute.internal
proj25      django-psql-example-8-build               1/1       Unknown                0          16m       172.20.6.24    ip-172-31-8-139.us-west-2.compute.internal
proj25      django-psql-example-8-build               1/1       Unknown                0          18m       172.20.6.24    ip-172-31-8-139.us-west-2.compute.internal
proj25      django-psql-example-8-build               1/1       Unknown                0          20m       172.20.6.24    ip-172-31-8-139.us-west-2.compute.internal
proj25      django-psql-example-8-build               1/1       Unknown                0          22m       172.20.6.24    ip-172-31-8-139.us-west-2.compute.internal
proj25      django-psql-example-8-build               1/1       Unknown                0          24m       172.20.6.24    ip-172-31-8-139.us-west-2.compute.internal
proj25      django-psql-example-8-build               1/1       Terminating            0          26m       172.20.6.24    ip-172-31-8-139.us-west-2.compute.internal
proj25      django-psql-example-8-build               1/1       Terminating            0          28m       172.20.6.24    ip-172-31-8-139.us-west-2.compute.internal
proj25      django-psql-example-8-build               1/1       Terminating            0          30m       172.20.6.24    ip-172-31-8-139.us-west-2.compute.internal
proj25      django-psql-example-8-build               1/1       Terminating            0          32m       172.20.6.24    ip-172-31-8-139.us-west-2.compute.internal

===
It is confirmed the internal information does not show as the status of pod.

One question:
Is "Unknown" the expected status in this situation?

===
I will provide link of the log files in a private comment.

Comment 10 Cesar Wong 2017-07-07 19:14:08 UTC
Hi hongkliu,
This is more of a question for the Kubernetes team. The build controller is not responsible for the status of the pod itself. I assume that because the node is taking too long to communicate with the master, that the master updates the status of the pod to Unknown. 

Derek, do you know under what conditions the pod's status would be changed to Unknown?

Comment 11 Hongkai Liu 2017-07-07 19:27:08 UTC
(In reply to Hongkai Liu from comment #8)
> catch "rpc error: code = 4 desc = context deadline exceeded" key words in
> /var/lib/messages file by doing 50 concurrent builds. It showed up:
> 
> [hongkliu@hongkliu ttt]$ grep "context deadline exceeded"
> conc.build.20170706/messages.node4 | grep proj25 | grep example-8
> Jul  6 15:04:58 ip-172-31-8-32 atomic-openshift-node: E0706 15:04:58.717674 
> 28712 generic.go:241] PLEG: Ignoring events for pod
> django-psql-example-8-build/proj25: rpc error: code = 4 desc = context
> deadline exceeded
> 
> =============
> At the meantime, oc-get-pods every 2 mins and record the result in file
> /tmp/bbb.txt:
> 
> root@ip-172-31-31-187: ~/svt/manual.steps # cat /tmp/bbb.txt | grep
> django-psql-example-8 | grep proj25
> proj25      django-psql-example-8-build               0/1      
> ContainerCreating   0          44s       <none>        
> ip-172-31-8-139.us-west-2.compute.internal
> proj25      django-psql-example-8-build               1/1       Running     
> 0          2m        172.20.6.24   
> ip-172-31-8-139.us-west-2.compute.internal
> proj25      django-psql-example-8-build               1/1       Running     
> 0          4m        172.20.6.24   
> ip-172-31-8-139.us-west-2.compute.internal
> proj25      django-psql-example-8-build               1/1       Running     
> 0          6m        172.20.6.24   
> ip-172-31-8-139.us-west-2.compute.internal
> proj25      django-psql-example-8-build               1/1       Running     
> 0          8m        172.20.6.24   
> ip-172-31-8-139.us-west-2.compute.internal
> proj25      django-psql-example-8-build               1/1       Unknown     
> 0          10m       172.20.6.24   
> ip-172-31-8-139.us-west-2.compute.internal
> proj25      django-psql-example-8-build               1/1       Unknown     
> 0          12m       172.20.6.24   
> ip-172-31-8-139.us-west-2.compute.internal
> proj25      django-psql-example-8-build               1/1       Unknown     
> 0          14m       172.20.6.24   
> ip-172-31-8-139.us-west-2.compute.internal
> proj25      django-psql-example-8-build               1/1       Unknown     
> 0          16m       172.20.6.24   
> ip-172-31-8-139.us-west-2.compute.internal
> proj25      django-psql-example-8-build               1/1       Unknown     
> 0          18m       172.20.6.24   
> ip-172-31-8-139.us-west-2.compute.internal
> proj25      django-psql-example-8-build               1/1       Unknown     
> 0          20m       172.20.6.24   
> ip-172-31-8-139.us-west-2.compute.internal
> proj25      django-psql-example-8-build               1/1       Unknown     
> 0          22m       172.20.6.24   
> ip-172-31-8-139.us-west-2.compute.internal
> proj25      django-psql-example-8-build               1/1       Unknown     
> 0          24m       172.20.6.24   
> ip-172-31-8-139.us-west-2.compute.internal
> proj25      django-psql-example-8-build               1/1       Terminating 
> 0          26m       172.20.6.24   
> ip-172-31-8-139.us-west-2.compute.internal
> proj25      django-psql-example-8-build               1/1       Terminating 
> 0          28m       172.20.6.24   
> ip-172-31-8-139.us-west-2.compute.internal
> proj25      django-psql-example-8-build               1/1       Terminating 
> 0          30m       172.20.6.24   
> ip-172-31-8-139.us-west-2.compute.internal
> proj25      django-psql-example-8-build               1/1       Terminating 
> 0          32m       172.20.6.24   
> ip-172-31-8-139.us-west-2.compute.internal
> 
> ===
> It is confirmed the internal information does not show as the status of pod.
> 
> One question:
> Is "Unknown" the expected status in this situation?
> 
> ===
> I will provide link of the log files in a private comment.

Verified on 3.6.133

Comment 14 Red Hat Bugzilla 2023-09-14 03:58:47 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days