Hide Forgot
Builds should get a failure reason for evicted pods and not be GenericBuildFailure. On 3.11 api.ci we see evictions frequently and they are hard to debug. Builds should report a reason for eviction. Request backport to origin 3.11 so we can get this on api.ci. --- apiVersion: build.openshift.io/v1 kind: Build metadata: annotations: ci.openshift.io/job-spec: '{"type":"postsubmit","job":"branch-ci-openshift-release-controller-master-images","buildid":"53","prowjobid":"bc04d182-46d5-11e9-b760-0a58ac10b13f","refs":{"org":"openshift","repo":"release-controller","base_ref":"master","base_sha":"253573b4cccb254de4bdd621499bdc30c2769c29","base_link":"https://github.com/openshift/release-controller/compare/4375b4ac2e8d...253573b4cccb"}}' openshift.io/build.pod-name: release-controller-build creationTimestamp: 2019-03-15T03:53:29Z labels: build-id: "53" created-by-ci: "true" creates: release-controller job: branch-ci-openshift-release-controller-master-images persists-between-builds: "false" prow.k8s.io/id: bc04d182-46d5-11e9-b760-0a58ac10b13f name: release-controller namespace: ci-op-7g4vf063 ownerReferences: - apiVersion: image.openshift.io/v1 controller: true kind: ImageStream name: pipeline uid: d435bb30-46d5-11e9-9b95-42010a8e0003 resourceVersion: "90888834" selfLink: /apis/build.openshift.io/v1/namespaces/ci-op-7g4vf063/builds/release-controller uid: e5043c52-46d5-11e9-9b95-42010a8e0003 spec: nodeSelector: null output: imageLabels: - name: vcs-type value: git - name: vcs-url value: https://github.com/openshift/release-controller - name: io.openshift.build.name - name: io.openshift.build.namespace - name: io.openshift.build.commit.ref value: master - name: io.openshift.build.source-location value: https://github.com/openshift/release-controller - name: vcs-ref value: 253573b4cccb254de4bdd621499bdc30c2769c29 - name: io.openshift.build.commit.id value: 253573b4cccb254de4bdd621499bdc30c2769c29 - name: io.openshift.build.commit.message - name: io.openshift.build.commit.author - name: io.openshift.build.commit.date - name: io.openshift.build.source-context-dir pushSecret: name: builder-dockercfg-k4g2k to: kind: ImageStreamTag name: pipeline:release-controller namespace: ci-op-7g4vf063 postCommit: {} resources: limits: memory: 6Gi requests: cpu: 100m memory: 200Mi serviceAccount: builder source: images: - as: - "0" from: kind: ImageStreamTag name: pipeline:root paths: null - as: null from: kind: ImageStreamTag name: pipeline:src paths: - destinationDir: . sourcePath: /go/src/github.com/openshift/release-controller///. type: Image strategy: dockerStrategy: forcePull: true from: kind: ImageStreamTag name: pipeline:os namespace: ci-op-7g4vf063 imageOptimizationPolicy: SkipLayers noCache: true type: Docker triggeredBy: null status: completionTimestamp: 2019-03-15T03:53:53Z message: Generic Build failure - check logs for details. output: {} outputDockerImageReference: docker-registry.default.svc:5000/ci-op-7g4vf063/pipeline:release-controller phase: Failed reason: GenericBuildFailed startTimestamp: 2019-03-15T03:53:53Z --- status: message: 'Pod The node was low on resource: [DiskPressure]. ' phase: Failed reason: Evicted startTime: 2019-03-15T03:53:53Z
Also note that's *ALL* the status the pod has, so that may be causing other failures in the build controller.
API PR: https://github.com/openshift/api/pull/255 Origin PR: https://github.com/openshift/origin/pull/22344
@Adam Kaplan, I tried to execute command: oc adm drain workname --ignore-daemonsets=true to let build pod is in evicted status, but didnot get evicted info when check controller-manager pod logs in openshift-controller-manager project?
@wewang if the build pod was indeed evicted, the build object should report its status as Failed with the BuildPodEvicted reason code.
@Adam Kaplan I cannot get pod evicted status, when do follow steps, could you help to correct me? steps: 1. Create apps $oc new-app openshift/ruby~https://github.com/sclorg/ruby-ex.git 2. When build pod in node1 are running status,add NoExecute taint to node1 $oc adm taint nodes node1 dedicated=special-user:NoExecute 3. Then build pod from running status updated to Terminating status directly, cannot get evicted status, at last deleted $ oc get pods --watch NAME READY STATUS RESTARTS AGE ruby-ex-1-build 1/1 Running 0 25s ruby-ex-1-build 1/1 Terminating 0 31s 4. Check the pod of openshift-controller-manager, no evict info $oc logs -f pod/controller-manager-kpsxl -n openshift-controller-manager --loglevel=5 |grep -i evict --------------------------------------------------------------------- and I tried another method using drain, but failed to drain steps 1. Create apps 2. When build pod in node1 are running status, drain the node $ oc adm drain ip-172-xxxxx.internal --ignore-daemonsets=true node/ip-172xx-xxx.internal cordoned error: unable to drain node "ip-172xxxxxxnternal", aborting command... There are pending nodes to be drained: ip-172-xxxinternal error: pods with local storage (use --delete-local-data to override): alertmanager-main-1, alertmanager-main-2, ruby-ex-1-build 3. build pod complete
@Adam Kaplan Do you know how to let build pod failed with Status.Reason = "Evicted" , I tried a lot of time, still cannot get the result, just pod from running to terminating, thanks
@Wen the worker node needs to be under memory/disk pressure for the pod to be evicted. My idea to verify: 1. Create a cluster that has a low-memory worker node. Add a unique label to the node. 2. Craft a Docker strategy build that does the following: a. Runs a Dockerfile that consumes a lot of RAM (ex: compute the billionth Fibonnaci number) b. Uses a NodeSelector to run on the low-memory worker node 3. Start the build - hopefully the pod gets evicted, but the build may fail with the OOMKilled reason :/
What I did: 3 workers OCP 4 cluster. Default installation config: m4.large worker instances In the test, one of the workers (ip-10-0-161-25.us-east-2.compute.internal Ready,SchedulingDisabled) is `SchedulingDisabled` all the time. The ${standby_build_node} is cordoned/uncordoned to make sure the build pod runs only on the ${running_build_node}. Trigger the test with the following script: https://github.com/hongkailiu/svt-case-doc/blob/master/scripts/simple_1689061.sh We can modify the size of file in dd command (in the above script), or ssh to the ${running_build_node} and `dd` directly. When k8s felt running low disk on the ${running_build_node}, it evicted the build pod. The message is different than the one in description: message: 'The node was low on resource: ephemeral-storage. Container sti-build was using 20Ki, which exceeds its request of 0. ' $ oc get build NAME TYPE FROM STATUS STARTED DURATION django-ex-16 Source Git@0905223 Complete 2 hours ago 4m34s django-ex-17 Source Git@0905223 Complete About an hour ago 1m32s django-ex-18 Source Git@0905223 Complete About an hour ago 11m29s django-ex-19 Source Git@0905223 Complete About an hour ago 1m47s django-ex-21 Source Git@0905223 Complete About an hour ago 1m30s django-ex-22 Source Git@0905223 Failed (BuildPodEvicted) 45 minutes ago 2m27s django-ex-24 Source Git@0905223 Failed (BuildPodEvicted) 27 minutes ago 4m26s It is expected as described in comment 5. The tricky part is * before starting the build, the disk has to have enough space. Otherwise, the build is `Pending`. * We have to write files into the disk, after the build is `Running`. So the build has to running long enough to give k8s the chance to detect low disk and then evict the build pod. Moving this bug to `Verified`.
@Hongkai Liu Thanks a lot for it
*** Bug 1705128 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758