Bug 1500498
| Summary: | Builds stuck in Pending status | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Vikas Laad <vlaad> | ||||
| Component: | Node | Assignee: | Avesh Agarwal <avagarwa> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | DeShuai Ma <dma> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 3.7.0 | CC: | aos-bugs, jokerman, mifiedle, mmccomas, vlaad | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2017-10-24 23:20:28 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Vikas Laad
2017-10-10 18:25:38 UTC
This seems like the most interesting event: 9m 14m 10 django-psql-example-1-build Pod Normal SandboxChanged kubelet, ip-172-31-62-163.us-west-2.compute.internal Pod sandbox changed, it will be killed and re-created. But in general any case where a pod is stuck in pending should be looked at by the pod team. Created attachment 1336886 [details]
pod yaml
Builds finally finish after around 2 hours, one of the builds will start after few mins and once that finishes it will all be pending again for some time. There are many non-bug explanations for why this might occur. Did all pods schedule at the same time? what were the pod resource limits? max-pods and pods-per-core in the node config? node resource capacity/allocatable values? My understanding is that the pods did eventually complete successfully. What exactly is the issue then? Yes all the pods were scheduled at the same time. kubeletArguments: cloud-config: - /etc/origin/cloudprovider/aws.conf cloud-provider: - aws image-gc-high-threshold: - '80' image-gc-low-threshold: - '70' max-pods: - '510' maximum-dead-containers: - '20' maximum-dead-containers-per-container: - '1' minimum-container-ttl-duration: - 10s node-labels: - region=infra - zone=default pods-per-core: - '0' Issue is out of 50 builds around 40 completes and then there is no activity for an hour. Then some of the builds will to to Running state and complete again, again after that no activity for some time. My question is why pods are remaining in Pending when both the nodes have no Running pods ? Please let me know if you want me to re-produce and give you access to the cluster. This is very different behavior from previous releases when all build pods would go to Running within a few minutes at most and run to completion. There have been bugs across releases where builds hang in various states and for various reasons, but they were eventually fixed. https://www.google.com/search?q=site%3Abugzilla.redhat.com+openshift+build+hangs Vikas, I am going through the previous comments/history/logs. Meahwhile whenever you observe/reproduce the issue, please give me access to the cluster. From the stuck build pod yaml
initContainerStatuses:
- image: registry.ops.openshift.com/openshift3/ose-sti-builder:v3.7.0-0.158.0
imageID: ""
lastState: {}
name: git-clone
ready: false
restartCount: 0
state:
waiting:
reason: PodInitializing
And the
# oc get pods --all-namespaces|grep -v Completed
# NAMESPACE NAME READY STATUS RESTARTS AGE
svt-proj-044 eap-app-3-build 0/1 Init:0/2 0 4h
The same behaviour is observed in https://bugzilla.redhat.com/show_bug.cgi?id=1486356 just manifested in a different way. So closing it as a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1486356.
*** This bug has been marked as a duplicate of bug 1486356 *** |