Bug 1689061

Summary: Evicted builds don't have a specific status reason, instead are GenericBuildFailure
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: BuildAssignee: Adam Kaplan <adam.kaplan>
Status: CLOSED ERRATA QA Contact: wewang <wewang>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: aos-bugs, hongkliu, skuznets, sponnaga, wzheng
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: if a build pod was evicted, build reported a GenericBuildFailure Consequence: cluster administrators could not determine why builds failed if the node was under resource pressure Fix: new failure reason `BuildPodEvicted` added Result: builds that fail due to pod eviction report `BuildPodEvicted` in their status reason
Story Points: ---
Clone Of:
: 1690066 (view as bug list) Environment:
Last Closed: 2019-06-04 10:45:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1690066    

Description Clayton Coleman 2019-03-15 04:16:59 UTC
Builds should get a failure reason for evicted pods and not be GenericBuildFailure.  On 3.11 api.ci we see evictions frequently and they are hard to debug. Builds should report a reason for eviction.

Request backport to origin 3.11 so we can get this on api.ci.

---

apiVersion: build.openshift.io/v1
kind: Build
metadata:
  annotations:
    ci.openshift.io/job-spec: '{"type":"postsubmit","job":"branch-ci-openshift-release-controller-master-images","buildid":"53","prowjobid":"bc04d182-46d5-11e9-b760-0a58ac10b13f","refs":{"org":"openshift","repo":"release-controller","base_ref":"master","base_sha":"253573b4cccb254de4bdd621499bdc30c2769c29","base_link":"https://github.com/openshift/release-controller/compare/4375b4ac2e8d...253573b4cccb"}}'
    openshift.io/build.pod-name: release-controller-build
  creationTimestamp: 2019-03-15T03:53:29Z
  labels:
    build-id: "53"
    created-by-ci: "true"
    creates: release-controller
    job: branch-ci-openshift-release-controller-master-images
    persists-between-builds: "false"
    prow.k8s.io/id: bc04d182-46d5-11e9-b760-0a58ac10b13f
  name: release-controller
  namespace: ci-op-7g4vf063
  ownerReferences:
  - apiVersion: image.openshift.io/v1
    controller: true
    kind: ImageStream
    name: pipeline
    uid: d435bb30-46d5-11e9-9b95-42010a8e0003
  resourceVersion: "90888834"
  selfLink: /apis/build.openshift.io/v1/namespaces/ci-op-7g4vf063/builds/release-controller
  uid: e5043c52-46d5-11e9-9b95-42010a8e0003
spec:
  nodeSelector: null
  output:
    imageLabels:
    - name: vcs-type
      value: git
    - name: vcs-url
      value: https://github.com/openshift/release-controller
    - name: io.openshift.build.name
    - name: io.openshift.build.namespace
    - name: io.openshift.build.commit.ref
      value: master
    - name: io.openshift.build.source-location
      value: https://github.com/openshift/release-controller
    - name: vcs-ref
      value: 253573b4cccb254de4bdd621499bdc30c2769c29
    - name: io.openshift.build.commit.id
      value: 253573b4cccb254de4bdd621499bdc30c2769c29
    - name: io.openshift.build.commit.message
    - name: io.openshift.build.commit.author
    - name: io.openshift.build.commit.date
    - name: io.openshift.build.source-context-dir
    pushSecret:
      name: builder-dockercfg-k4g2k
    to:
      kind: ImageStreamTag
      name: pipeline:release-controller
      namespace: ci-op-7g4vf063
  postCommit: {}
  resources:
    limits:
      memory: 6Gi
    requests:
      cpu: 100m
      memory: 200Mi
  serviceAccount: builder
  source:
    images:
    - as:
      - "0"
      from:
        kind: ImageStreamTag
        name: pipeline:root
      paths: null
    - as: null
      from:
        kind: ImageStreamTag
        name: pipeline:src
      paths:
      - destinationDir: .
        sourcePath: /go/src/github.com/openshift/release-controller///.
    type: Image
  strategy:
    dockerStrategy:
      forcePull: true
      from:
        kind: ImageStreamTag
        name: pipeline:os
        namespace: ci-op-7g4vf063
      imageOptimizationPolicy: SkipLayers
      noCache: true
    type: Docker
  triggeredBy: null
status:
  completionTimestamp: 2019-03-15T03:53:53Z
  message: Generic Build failure - check logs for details.
  output: {}
  outputDockerImageReference: docker-registry.default.svc:5000/ci-op-7g4vf063/pipeline:release-controller
  phase: Failed
  reason: GenericBuildFailed
  startTimestamp: 2019-03-15T03:53:53Z


---

status:
  message: 'Pod The node was low on resource: [DiskPressure]. '
  phase: Failed
  reason: Evicted
  startTime: 2019-03-15T03:53:53Z

Comment 1 Clayton Coleman 2019-03-15 04:17:58 UTC
Also note that's *ALL* the status the pod has, so that may be causing other failures in the build controller.

Comment 4 wewang 2019-04-01 09:22:00 UTC
@Adam Kaplan, I tried to execute command: oc adm drain workname  --ignore-daemonsets=true to let  build pod is in evicted status, but didnot get evicted info when check controller-manager pod logs in openshift-controller-manager project?

Comment 5 Adam Kaplan 2019-04-01 12:57:32 UTC
@wewang if the build pod was indeed evicted, the build object should report its status as Failed with the BuildPodEvicted reason code.

Comment 6 wewang 2019-04-02 10:42:05 UTC
@Adam Kaplan  I cannot get pod evicted status, when do follow steps, could you help to correct me? 

steps:
1. Create apps 
 $oc new-app openshift/ruby~https://github.com/sclorg/ruby-ex.git
2. When build pod in node1 are running status,add  NoExecute taint to node1
 $oc adm taint nodes node1 dedicated=special-user:NoExecute
 
3. Then build pod from running status updated to Terminating status directly, cannot get evicted status, at last deleted
$ oc get pods --watch
NAME              READY   STATUS    RESTARTS   AGE
ruby-ex-1-build   1/1     Running   0          25s
ruby-ex-1-build   1/1   Terminating   0     31s

4. Check the pod of openshift-controller-manager, no evict info
  $oc logs -f pod/controller-manager-kpsxl -n openshift-controller-manager --loglevel=5 |grep -i evict

---------------------------------------------------------------------
and I tried another method using drain, but failed to drain
steps
1. Create apps

2. When build pod in node1 are running status, drain the node
$ oc adm drain ip-172-xxxxx.internal   --ignore-daemonsets=true 
node/ip-172xx-xxx.internal cordoned
error: unable to drain node "ip-172xxxxxxnternal", aborting command...

There are pending nodes to be drained:
 ip-172-xxxinternal
error: pods with local storage (use --delete-local-data to override): alertmanager-main-1, alertmanager-main-2, ruby-ex-1-build
3. build pod complete

Comment 7 wewang 2019-04-04 10:07:32 UTC
@Adam Kaplan Do you know how to let build pod failed with Status.Reason = "Evicted" , I tried a lot of time, still cannot get the result, just pod from running to terminating, thanks

Comment 8 Adam Kaplan 2019-04-08 15:10:13 UTC
@Wen the worker node needs to be under memory/disk pressure for the pod to be evicted. My idea to verify:
1. Create a cluster that has a low-memory worker node. Add a unique label to the node.
2. Craft a Docker strategy build that does the following:
  a. Runs a Dockerfile that consumes a lot of RAM (ex: compute the billionth Fibonnaci number)
  b. Uses a NodeSelector to run on the low-memory worker node
3. Start the build - hopefully the pod gets evicted, but the build may fail with the OOMKilled reason :/

Comment 9 Hongkai Liu 2019-04-09 21:15:41 UTC
What I did:
3 workers OCP 4 cluster. Default installation config: m4.large worker instances
In the test, one of the workers (ip-10-0-161-25.us-east-2.compute.internal    Ready,SchedulingDisabled) is `SchedulingDisabled` all the time.
The ${standby_build_node} is cordoned/uncordoned to make sure the build pod runs only on the ${running_build_node}.

Trigger the test with the following script:
https://github.com/hongkailiu/svt-case-doc/blob/master/scripts/simple_1689061.sh
We can modify the size of file in dd command (in the above script), or ssh to the ${running_build_node} and `dd` directly.

When k8s felt running low disk on the ${running_build_node}, it evicted the build pod.

The message is different than the one in description:
  message: 'The node was low on resource: ephemeral-storage. Container sti-build was
    using 20Ki, which exceeds its request of 0. '

$ oc get build
NAME           TYPE      FROM          STATUS                     STARTED             DURATION
django-ex-16   Source    Git@0905223   Complete                   2 hours ago         4m34s
django-ex-17   Source    Git@0905223   Complete                   About an hour ago   1m32s
django-ex-18   Source    Git@0905223   Complete                   About an hour ago   11m29s
django-ex-19   Source    Git@0905223   Complete                   About an hour ago   1m47s
django-ex-21   Source    Git@0905223   Complete                   About an hour ago   1m30s
django-ex-22   Source    Git@0905223   Failed (BuildPodEvicted)   45 minutes ago      2m27s
django-ex-24   Source    Git@0905223   Failed (BuildPodEvicted)   27 minutes ago      4m26s

It is expected as described in comment 5.

The tricky part is
* before starting the build, the disk has to have enough space. Otherwise, the build is `Pending`.
* We have to write files into the disk, after the build is `Running`. So the build has to running long enough to give k8s the chance to detect low disk and then evict the build pod.

Moving this bug to `Verified`.

Comment 11 wewang 2019-04-10 01:44:33 UTC
@Hongkai Liu Thanks a lot for it

Comment 12 Steve Kuznetsov 2019-05-14 15:51:35 UTC
*** Bug 1705128 has been marked as a duplicate of this bug. ***

Comment 14 errata-xmlrpc 2019-06-04 10:45:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758