Bug 1689061
Summary: | Evicted builds don't have a specific status reason, instead are GenericBuildFailure | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> | |
Component: | Build | Assignee: | Adam Kaplan <adam.kaplan> | |
Status: | CLOSED ERRATA | QA Contact: | wewang <wewang> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 4.1.0 | CC: | aos-bugs, hongkliu, skuznets, sponnaga, wzheng | |
Target Milestone: | --- | |||
Target Release: | 4.1.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause: if a build pod was evicted, build reported a GenericBuildFailure
Consequence: cluster administrators could not determine why builds failed if the node was under resource pressure
Fix: new failure reason `BuildPodEvicted` added
Result: builds that fail due to pod eviction report `BuildPodEvicted` in their status reason
|
Story Points: | --- | |
Clone Of: | ||||
: | 1690066 (view as bug list) | Environment: | ||
Last Closed: | 2019-06-04 10:45:52 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1690066 |
Description
Clayton Coleman
2019-03-15 04:16:59 UTC
Also note that's *ALL* the status the pod has, so that may be causing other failures in the build controller. API PR: https://github.com/openshift/api/pull/255 Origin PR: https://github.com/openshift/origin/pull/22344 @Adam Kaplan, I tried to execute command: oc adm drain workname --ignore-daemonsets=true to let build pod is in evicted status, but didnot get evicted info when check controller-manager pod logs in openshift-controller-manager project? @wewang if the build pod was indeed evicted, the build object should report its status as Failed with the BuildPodEvicted reason code. @Adam Kaplan I cannot get pod evicted status, when do follow steps, could you help to correct me? steps: 1. Create apps $oc new-app openshift/ruby~https://github.com/sclorg/ruby-ex.git 2. When build pod in node1 are running status,add NoExecute taint to node1 $oc adm taint nodes node1 dedicated=special-user:NoExecute 3. Then build pod from running status updated to Terminating status directly, cannot get evicted status, at last deleted $ oc get pods --watch NAME READY STATUS RESTARTS AGE ruby-ex-1-build 1/1 Running 0 25s ruby-ex-1-build 1/1 Terminating 0 31s 4. Check the pod of openshift-controller-manager, no evict info $oc logs -f pod/controller-manager-kpsxl -n openshift-controller-manager --loglevel=5 |grep -i evict --------------------------------------------------------------------- and I tried another method using drain, but failed to drain steps 1. Create apps 2. When build pod in node1 are running status, drain the node $ oc adm drain ip-172-xxxxx.internal --ignore-daemonsets=true node/ip-172xx-xxx.internal cordoned error: unable to drain node "ip-172xxxxxxnternal", aborting command... There are pending nodes to be drained: ip-172-xxxinternal error: pods with local storage (use --delete-local-data to override): alertmanager-main-1, alertmanager-main-2, ruby-ex-1-build 3. build pod complete @Adam Kaplan Do you know how to let build pod failed with Status.Reason = "Evicted" , I tried a lot of time, still cannot get the result, just pod from running to terminating, thanks @Wen the worker node needs to be under memory/disk pressure for the pod to be evicted. My idea to verify: 1. Create a cluster that has a low-memory worker node. Add a unique label to the node. 2. Craft a Docker strategy build that does the following: a. Runs a Dockerfile that consumes a lot of RAM (ex: compute the billionth Fibonnaci number) b. Uses a NodeSelector to run on the low-memory worker node 3. Start the build - hopefully the pod gets evicted, but the build may fail with the OOMKilled reason :/ What I did: 3 workers OCP 4 cluster. Default installation config: m4.large worker instances In the test, one of the workers (ip-10-0-161-25.us-east-2.compute.internal Ready,SchedulingDisabled) is `SchedulingDisabled` all the time. The ${standby_build_node} is cordoned/uncordoned to make sure the build pod runs only on the ${running_build_node}. Trigger the test with the following script: https://github.com/hongkailiu/svt-case-doc/blob/master/scripts/simple_1689061.sh We can modify the size of file in dd command (in the above script), or ssh to the ${running_build_node} and `dd` directly. When k8s felt running low disk on the ${running_build_node}, it evicted the build pod. The message is different than the one in description: message: 'The node was low on resource: ephemeral-storage. Container sti-build was using 20Ki, which exceeds its request of 0. ' $ oc get build NAME TYPE FROM STATUS STARTED DURATION django-ex-16 Source Git@0905223 Complete 2 hours ago 4m34s django-ex-17 Source Git@0905223 Complete About an hour ago 1m32s django-ex-18 Source Git@0905223 Complete About an hour ago 11m29s django-ex-19 Source Git@0905223 Complete About an hour ago 1m47s django-ex-21 Source Git@0905223 Complete About an hour ago 1m30s django-ex-22 Source Git@0905223 Failed (BuildPodEvicted) 45 minutes ago 2m27s django-ex-24 Source Git@0905223 Failed (BuildPodEvicted) 27 minutes ago 4m26s It is expected as described in comment 5. The tricky part is * before starting the build, the disk has to have enough space. Otherwise, the build is `Pending`. * We have to write files into the disk, after the build is `Running`. So the build has to running long enough to give k8s the chance to detect low disk and then evict the build pod. Moving this bug to `Verified`. @Hongkai Liu Thanks a lot for it *** Bug 1705128 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |