Bug 1689061
| Summary: | Evicted builds don't have a specific status reason, instead are GenericBuildFailure | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> | |
| Component: | Build | Assignee: | Adam Kaplan <adam.kaplan> | |
| Status: | CLOSED ERRATA | QA Contact: | wewang <wewang> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 4.1.0 | CC: | aos-bugs, hongkliu, skuznets, sponnaga, wzheng | |
| Target Milestone: | --- | |||
| Target Release: | 4.1.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Cause: if a build pod was evicted, build reported a GenericBuildFailure
Consequence: cluster administrators could not determine why builds failed if the node was under resource pressure
Fix: new failure reason `BuildPodEvicted` added
Result: builds that fail due to pod eviction report `BuildPodEvicted` in their status reason
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1690066 (view as bug list) | Environment: | ||
| Last Closed: | 2019-06-04 10:45:52 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1690066 | |||
Also note that's *ALL* the status the pod has, so that may be causing other failures in the build controller. API PR: https://github.com/openshift/api/pull/255 Origin PR: https://github.com/openshift/origin/pull/22344 @Adam Kaplan, I tried to execute command: oc adm drain workname --ignore-daemonsets=true to let build pod is in evicted status, but didnot get evicted info when check controller-manager pod logs in openshift-controller-manager project? @wewang if the build pod was indeed evicted, the build object should report its status as Failed with the BuildPodEvicted reason code. @Adam Kaplan I cannot get pod evicted status, when do follow steps, could you help to correct me? steps: 1. Create apps $oc new-app openshift/ruby~https://github.com/sclorg/ruby-ex.git 2. When build pod in node1 are running status,add NoExecute taint to node1 $oc adm taint nodes node1 dedicated=special-user:NoExecute 3. Then build pod from running status updated to Terminating status directly, cannot get evicted status, at last deleted $ oc get pods --watch NAME READY STATUS RESTARTS AGE ruby-ex-1-build 1/1 Running 0 25s ruby-ex-1-build 1/1 Terminating 0 31s 4. Check the pod of openshift-controller-manager, no evict info $oc logs -f pod/controller-manager-kpsxl -n openshift-controller-manager --loglevel=5 |grep -i evict --------------------------------------------------------------------- and I tried another method using drain, but failed to drain steps 1. Create apps 2. When build pod in node1 are running status, drain the node $ oc adm drain ip-172-xxxxx.internal --ignore-daemonsets=true node/ip-172xx-xxx.internal cordoned error: unable to drain node "ip-172xxxxxxnternal", aborting command... There are pending nodes to be drained: ip-172-xxxinternal error: pods with local storage (use --delete-local-data to override): alertmanager-main-1, alertmanager-main-2, ruby-ex-1-build 3. build pod complete @Adam Kaplan Do you know how to let build pod failed with Status.Reason = "Evicted" , I tried a lot of time, still cannot get the result, just pod from running to terminating, thanks @Wen the worker node needs to be under memory/disk pressure for the pod to be evicted. My idea to verify: 1. Create a cluster that has a low-memory worker node. Add a unique label to the node. 2. Craft a Docker strategy build that does the following: a. Runs a Dockerfile that consumes a lot of RAM (ex: compute the billionth Fibonnaci number) b. Uses a NodeSelector to run on the low-memory worker node 3. Start the build - hopefully the pod gets evicted, but the build may fail with the OOMKilled reason :/ What I did:
3 workers OCP 4 cluster. Default installation config: m4.large worker instances
In the test, one of the workers (ip-10-0-161-25.us-east-2.compute.internal Ready,SchedulingDisabled) is `SchedulingDisabled` all the time.
The ${standby_build_node} is cordoned/uncordoned to make sure the build pod runs only on the ${running_build_node}.
Trigger the test with the following script:
https://github.com/hongkailiu/svt-case-doc/blob/master/scripts/simple_1689061.sh
We can modify the size of file in dd command (in the above script), or ssh to the ${running_build_node} and `dd` directly.
When k8s felt running low disk on the ${running_build_node}, it evicted the build pod.
The message is different than the one in description:
message: 'The node was low on resource: ephemeral-storage. Container sti-build was
using 20Ki, which exceeds its request of 0. '
$ oc get build
NAME TYPE FROM STATUS STARTED DURATION
django-ex-16 Source Git@0905223 Complete 2 hours ago 4m34s
django-ex-17 Source Git@0905223 Complete About an hour ago 1m32s
django-ex-18 Source Git@0905223 Complete About an hour ago 11m29s
django-ex-19 Source Git@0905223 Complete About an hour ago 1m47s
django-ex-21 Source Git@0905223 Complete About an hour ago 1m30s
django-ex-22 Source Git@0905223 Failed (BuildPodEvicted) 45 minutes ago 2m27s
django-ex-24 Source Git@0905223 Failed (BuildPodEvicted) 27 minutes ago 4m26s
It is expected as described in comment 5.
The tricky part is
* before starting the build, the disk has to have enough space. Otherwise, the build is `Pending`.
* We have to write files into the disk, after the build is `Running`. So the build has to running long enough to give k8s the chance to detect low disk and then evict the build pod.
Moving this bug to `Verified`.
@Hongkai Liu Thanks a lot for it *** Bug 1705128 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |
Builds should get a failure reason for evicted pods and not be GenericBuildFailure. On 3.11 api.ci we see evictions frequently and they are hard to debug. Builds should report a reason for eviction. Request backport to origin 3.11 so we can get this on api.ci. --- apiVersion: build.openshift.io/v1 kind: Build metadata: annotations: ci.openshift.io/job-spec: '{"type":"postsubmit","job":"branch-ci-openshift-release-controller-master-images","buildid":"53","prowjobid":"bc04d182-46d5-11e9-b760-0a58ac10b13f","refs":{"org":"openshift","repo":"release-controller","base_ref":"master","base_sha":"253573b4cccb254de4bdd621499bdc30c2769c29","base_link":"https://github.com/openshift/release-controller/compare/4375b4ac2e8d...253573b4cccb"}}' openshift.io/build.pod-name: release-controller-build creationTimestamp: 2019-03-15T03:53:29Z labels: build-id: "53" created-by-ci: "true" creates: release-controller job: branch-ci-openshift-release-controller-master-images persists-between-builds: "false" prow.k8s.io/id: bc04d182-46d5-11e9-b760-0a58ac10b13f name: release-controller namespace: ci-op-7g4vf063 ownerReferences: - apiVersion: image.openshift.io/v1 controller: true kind: ImageStream name: pipeline uid: d435bb30-46d5-11e9-9b95-42010a8e0003 resourceVersion: "90888834" selfLink: /apis/build.openshift.io/v1/namespaces/ci-op-7g4vf063/builds/release-controller uid: e5043c52-46d5-11e9-9b95-42010a8e0003 spec: nodeSelector: null output: imageLabels: - name: vcs-type value: git - name: vcs-url value: https://github.com/openshift/release-controller - name: io.openshift.build.name - name: io.openshift.build.namespace - name: io.openshift.build.commit.ref value: master - name: io.openshift.build.source-location value: https://github.com/openshift/release-controller - name: vcs-ref value: 253573b4cccb254de4bdd621499bdc30c2769c29 - name: io.openshift.build.commit.id value: 253573b4cccb254de4bdd621499bdc30c2769c29 - name: io.openshift.build.commit.message - name: io.openshift.build.commit.author - name: io.openshift.build.commit.date - name: io.openshift.build.source-context-dir pushSecret: name: builder-dockercfg-k4g2k to: kind: ImageStreamTag name: pipeline:release-controller namespace: ci-op-7g4vf063 postCommit: {} resources: limits: memory: 6Gi requests: cpu: 100m memory: 200Mi serviceAccount: builder source: images: - as: - "0" from: kind: ImageStreamTag name: pipeline:root paths: null - as: null from: kind: ImageStreamTag name: pipeline:src paths: - destinationDir: . sourcePath: /go/src/github.com/openshift/release-controller///. type: Image strategy: dockerStrategy: forcePull: true from: kind: ImageStreamTag name: pipeline:os namespace: ci-op-7g4vf063 imageOptimizationPolicy: SkipLayers noCache: true type: Docker triggeredBy: null status: completionTimestamp: 2019-03-15T03:53:53Z message: Generic Build failure - check logs for details. output: {} outputDockerImageReference: docker-registry.default.svc:5000/ci-op-7g4vf063/pipeline:release-controller phase: Failed reason: GenericBuildFailed startTimestamp: 2019-03-15T03:53:53Z --- status: message: 'Pod The node was low on resource: [DiskPressure]. ' phase: Failed reason: Evicted startTime: 2019-03-15T03:53:53Z