Bug 1458884
| Summary: | Build status shows Pending for Failed pod | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Vikas Laad <vlaad> | ||||||||
| Component: | Build | Assignee: | Cesar Wong <cewong> | ||||||||
| Status: | CLOSED NEXTRELEASE | QA Contact: | Hongkai Liu <hongkliu> | ||||||||
| Severity: | unspecified | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 3.6.0 | CC: | aos-bugs, bparees, decarr, mifiedle, trankin, vlaad, xtian | ||||||||
| Target Milestone: | --- | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: |
Cause:
When running many concurrent builds, the build controller will not update a pending build to failed when the corresponding pod fails.
Consequence:
The status of the build is not updated correctly.
Fix:
The build controller code has been refactored to avoid race conditions and update build status correctly.
Result:
The build status should no longer get out of sync with the corresponding pod.
|
Story Points: | --- | ||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2017-08-14 18:42:46 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
|
Description
Vikas Laad
2017-06-05 18:12:33 UTC
Created attachment 1285094 [details]
build json
Created attachment 1285095 [details]
pod json
Created attachment 1285096 [details]
logs from master
Vikas did you open a separate bug for the problem that "oc get pods" is leaking the wrong information into the "status" column in the output? (the pod json shows the correct value). For the main issue (build shows pending) we expect this to be fixed when Cesar's build controller PR merges: https://github.com/openshift/origin/pull/14596 Perfect, thanks Vikas! catch "rpc error: code = 4 desc = context deadline exceeded" key words in /var/lib/messages file by doing 50 concurrent builds. It showed up: [hongkliu@hongkliu ttt]$ grep "context deadline exceeded" conc.build.20170706/messages.node4 | grep proj25 | grep example-8 Jul 6 15:04:58 ip-172-31-8-32 atomic-openshift-node: E0706 15:04:58.717674 28712 generic.go:241] PLEG: Ignoring events for pod django-psql-example-8-build/proj25: rpc error: code = 4 desc = context deadline exceeded ============= At the meantime, oc-get-pods every 2 mins and record the result in file /tmp/bbb.txt: root@ip-172-31-31-187: ~/svt/manual.steps # cat /tmp/bbb.txt | grep django-psql-example-8 | grep proj25 proj25 django-psql-example-8-build 0/1 ContainerCreating 0 44s <none> ip-172-31-8-139.us-west-2.compute.internal proj25 django-psql-example-8-build 1/1 Running 0 2m 172.20.6.24 ip-172-31-8-139.us-west-2.compute.internal proj25 django-psql-example-8-build 1/1 Running 0 4m 172.20.6.24 ip-172-31-8-139.us-west-2.compute.internal proj25 django-psql-example-8-build 1/1 Running 0 6m 172.20.6.24 ip-172-31-8-139.us-west-2.compute.internal proj25 django-psql-example-8-build 1/1 Running 0 8m 172.20.6.24 ip-172-31-8-139.us-west-2.compute.internal proj25 django-psql-example-8-build 1/1 Unknown 0 10m 172.20.6.24 ip-172-31-8-139.us-west-2.compute.internal proj25 django-psql-example-8-build 1/1 Unknown 0 12m 172.20.6.24 ip-172-31-8-139.us-west-2.compute.internal proj25 django-psql-example-8-build 1/1 Unknown 0 14m 172.20.6.24 ip-172-31-8-139.us-west-2.compute.internal proj25 django-psql-example-8-build 1/1 Unknown 0 16m 172.20.6.24 ip-172-31-8-139.us-west-2.compute.internal proj25 django-psql-example-8-build 1/1 Unknown 0 18m 172.20.6.24 ip-172-31-8-139.us-west-2.compute.internal proj25 django-psql-example-8-build 1/1 Unknown 0 20m 172.20.6.24 ip-172-31-8-139.us-west-2.compute.internal proj25 django-psql-example-8-build 1/1 Unknown 0 22m 172.20.6.24 ip-172-31-8-139.us-west-2.compute.internal proj25 django-psql-example-8-build 1/1 Unknown 0 24m 172.20.6.24 ip-172-31-8-139.us-west-2.compute.internal proj25 django-psql-example-8-build 1/1 Terminating 0 26m 172.20.6.24 ip-172-31-8-139.us-west-2.compute.internal proj25 django-psql-example-8-build 1/1 Terminating 0 28m 172.20.6.24 ip-172-31-8-139.us-west-2.compute.internal proj25 django-psql-example-8-build 1/1 Terminating 0 30m 172.20.6.24 ip-172-31-8-139.us-west-2.compute.internal proj25 django-psql-example-8-build 1/1 Terminating 0 32m 172.20.6.24 ip-172-31-8-139.us-west-2.compute.internal === It is confirmed the internal information does not show as the status of pod. One question: Is "Unknown" the expected status in this situation? === I will provide link of the log files in a private comment. Hi hongkliu, This is more of a question for the Kubernetes team. The build controller is not responsible for the status of the pod itself. I assume that because the node is taking too long to communicate with the master, that the master updates the status of the pod to Unknown. Derek, do you know under what conditions the pod's status would be changed to Unknown? (In reply to Hongkai Liu from comment #8) > catch "rpc error: code = 4 desc = context deadline exceeded" key words in > /var/lib/messages file by doing 50 concurrent builds. It showed up: > > [hongkliu@hongkliu ttt]$ grep "context deadline exceeded" > conc.build.20170706/messages.node4 | grep proj25 | grep example-8 > Jul 6 15:04:58 ip-172-31-8-32 atomic-openshift-node: E0706 15:04:58.717674 > 28712 generic.go:241] PLEG: Ignoring events for pod > django-psql-example-8-build/proj25: rpc error: code = 4 desc = context > deadline exceeded > > ============= > At the meantime, oc-get-pods every 2 mins and record the result in file > /tmp/bbb.txt: > > root@ip-172-31-31-187: ~/svt/manual.steps # cat /tmp/bbb.txt | grep > django-psql-example-8 | grep proj25 > proj25 django-psql-example-8-build 0/1 > ContainerCreating 0 44s <none> > ip-172-31-8-139.us-west-2.compute.internal > proj25 django-psql-example-8-build 1/1 Running > 0 2m 172.20.6.24 > ip-172-31-8-139.us-west-2.compute.internal > proj25 django-psql-example-8-build 1/1 Running > 0 4m 172.20.6.24 > ip-172-31-8-139.us-west-2.compute.internal > proj25 django-psql-example-8-build 1/1 Running > 0 6m 172.20.6.24 > ip-172-31-8-139.us-west-2.compute.internal > proj25 django-psql-example-8-build 1/1 Running > 0 8m 172.20.6.24 > ip-172-31-8-139.us-west-2.compute.internal > proj25 django-psql-example-8-build 1/1 Unknown > 0 10m 172.20.6.24 > ip-172-31-8-139.us-west-2.compute.internal > proj25 django-psql-example-8-build 1/1 Unknown > 0 12m 172.20.6.24 > ip-172-31-8-139.us-west-2.compute.internal > proj25 django-psql-example-8-build 1/1 Unknown > 0 14m 172.20.6.24 > ip-172-31-8-139.us-west-2.compute.internal > proj25 django-psql-example-8-build 1/1 Unknown > 0 16m 172.20.6.24 > ip-172-31-8-139.us-west-2.compute.internal > proj25 django-psql-example-8-build 1/1 Unknown > 0 18m 172.20.6.24 > ip-172-31-8-139.us-west-2.compute.internal > proj25 django-psql-example-8-build 1/1 Unknown > 0 20m 172.20.6.24 > ip-172-31-8-139.us-west-2.compute.internal > proj25 django-psql-example-8-build 1/1 Unknown > 0 22m 172.20.6.24 > ip-172-31-8-139.us-west-2.compute.internal > proj25 django-psql-example-8-build 1/1 Unknown > 0 24m 172.20.6.24 > ip-172-31-8-139.us-west-2.compute.internal > proj25 django-psql-example-8-build 1/1 Terminating > 0 26m 172.20.6.24 > ip-172-31-8-139.us-west-2.compute.internal > proj25 django-psql-example-8-build 1/1 Terminating > 0 28m 172.20.6.24 > ip-172-31-8-139.us-west-2.compute.internal > proj25 django-psql-example-8-build 1/1 Terminating > 0 30m 172.20.6.24 > ip-172-31-8-139.us-west-2.compute.internal > proj25 django-psql-example-8-build 1/1 Terminating > 0 32m 172.20.6.24 > ip-172-31-8-139.us-west-2.compute.internal > > === > It is confirmed the internal information does not show as the status of pod. > > One question: > Is "Unknown" the expected status in this situation? > > === > I will provide link of the log files in a private comment. Verified on 3.6.133 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |