Bug 1689061

Summary:	Evicted builds don't have a specific status reason, instead are GenericBuildFailure
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Build	Assignee:	Adam Kaplan <adam.kaplan>
Status:	CLOSED ERRATA	QA Contact:	wewang <wewang>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.1.0	CC:	aos-bugs, hongkliu, skuznets, sponnaga, wzheng
Target Milestone:	---
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: if a build pod was evicted, build reported a GenericBuildFailure Consequence: cluster administrators could not determine why builds failed if the node was under resource pressure Fix: new failure reason `BuildPodEvicted` added Result: builds that fail due to pod eviction report `BuildPodEvicted` in their status reason	Story Points:	---
Clone Of:
Clones:	1690066 (view as bug list)		Environment:
Last Closed:	2019-06-04 10:45:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1690066

Description Clayton Coleman 2019-03-15 04:16:59 UTC

Builds should get a failure reason for evicted pods and not be GenericBuildFailure.  On 3.11 api.ci we see evictions frequently and they are hard to debug. Builds should report a reason for eviction.

Request backport to origin 3.11 so we can get this on api.ci.

---

apiVersion: build.openshift.io/v1
kind: Build
metadata:
  annotations:
    ci.openshift.io/job-spec: '{"type":"postsubmit","job":"branch-ci-openshift-release-controller-master-images","buildid":"53","prowjobid":"bc04d182-46d5-11e9-b760-0a58ac10b13f","refs":{"org":"openshift","repo":"release-controller","base_ref":"master","base_sha":"253573b4cccb254de4bdd621499bdc30c2769c29","base_link":"https://github.com/openshift/release-controller/compare/4375b4ac2e8d...253573b4cccb"}}'
    openshift.io/build.pod-name: release-controller-build
  creationTimestamp: 2019-03-15T03:53:29Z
  labels:
    build-id: "53"
    created-by-ci: "true"
    creates: release-controller
    job: branch-ci-openshift-release-controller-master-images
    persists-between-builds: "false"
    prow.k8s.io/id: bc04d182-46d5-11e9-b760-0a58ac10b13f
  name: release-controller
  namespace: ci-op-7g4vf063
  ownerReferences:
  - apiVersion: image.openshift.io/v1
    controller: true
    kind: ImageStream
    name: pipeline
    uid: d435bb30-46d5-11e9-9b95-42010a8e0003
  resourceVersion: "90888834"
  selfLink: /apis/build.openshift.io/v1/namespaces/ci-op-7g4vf063/builds/release-controller
  uid: e5043c52-46d5-11e9-9b95-42010a8e0003
spec:
  nodeSelector: null
  output:
    imageLabels:
    - name: vcs-type
      value: git
    - name: vcs-url
      value: https://github.com/openshift/release-controller
    - name: io.openshift.build.name
    - name: io.openshift.build.namespace
    - name: io.openshift.build.commit.ref
      value: master
    - name: io.openshift.build.source-location
      value: https://github.com/openshift/release-controller
    - name: vcs-ref
      value: 253573b4cccb254de4bdd621499bdc30c2769c29
    - name: io.openshift.build.commit.id
      value: 253573b4cccb254de4bdd621499bdc30c2769c29
    - name: io.openshift.build.commit.message
    - name: io.openshift.build.commit.author
    - name: io.openshift.build.commit.date
    - name: io.openshift.build.source-context-dir
    pushSecret:
      name: builder-dockercfg-k4g2k
    to:
      kind: ImageStreamTag
      name: pipeline:release-controller
      namespace: ci-op-7g4vf063
  postCommit: {}
  resources:
    limits:
      memory: 6Gi
    requests:
      cpu: 100m
      memory: 200Mi
  serviceAccount: builder
  source:
    images:
    - as:
      - "0"
      from:
        kind: ImageStreamTag
        name: pipeline:root
      paths: null
    - as: null
      from:
        kind: ImageStreamTag
        name: pipeline:src
      paths:
      - destinationDir: .
        sourcePath: /go/src/github.com/openshift/release-controller///.
    type: Image
  strategy:
    dockerStrategy:
      forcePull: true
      from:
        kind: ImageStreamTag
        name: pipeline:os
        namespace: ci-op-7g4vf063
      imageOptimizationPolicy: SkipLayers
      noCache: true
    type: Docker
  triggeredBy: null
status:
  completionTimestamp: 2019-03-15T03:53:53Z
  message: Generic Build failure - check logs for details.
  output: {}
  outputDockerImageReference: docker-registry.default.svc:5000/ci-op-7g4vf063/pipeline:release-controller
  phase: Failed
  reason: GenericBuildFailed
  startTimestamp: 2019-03-15T03:53:53Z


---

status:
  message: 'Pod The node was low on resource: [DiskPressure]. '
  phase: Failed
  reason: Evicted
  startTime: 2019-03-15T03:53:53Z

Comment 1 Clayton Coleman 2019-03-15 04:17:58 UTC

Also note that's *ALL* the status the pod has, so that may be causing other failures in the build controller.

Comment 2 Adam Kaplan 2019-03-18 17:27:59 UTC

API PR: https://github.com/openshift/api/pull/255
Origin PR: https://github.com/openshift/origin/pull/22344

Comment 4 wewang 2019-04-01 09:22:00 UTC

@Adam Kaplan, I tried to execute command： oc adm drain workname  --ignore-daemonsets=true to let  build pod is in evicted status, but didnot get evicted info when check controller-manager pod logs in openshift-controller-manager project?

Comment 5 Adam Kaplan 2019-04-01 12:57:32 UTC

@wewang if the build pod was indeed evicted, the build object should report its status as Failed with the BuildPodEvicted reason code.

Comment 6 wewang 2019-04-02 10:42:05 UTC

@Adam Kaplan  I cannot get pod evicted status, when do follow steps, could you help to correct me? 

steps:
1. Create apps 
 $oc new-app openshift/ruby~https://github.com/sclorg/ruby-ex.git
2. When build pod in node1 are running status,add  NoExecute taint to node1
 $oc adm taint nodes node1 dedicated=special-user:NoExecute
 
3. Then build pod from running status updated to Terminating status directly, cannot get evicted status, at last deleted
$ oc get pods --watch
NAME              READY   STATUS    RESTARTS   AGE
ruby-ex-1-build   1/1     Running   0          25s
ruby-ex-1-build   1/1   Terminating   0     31s

4. Check the pod of openshift-controller-manager, no evict info
  $oc logs -f pod/controller-manager-kpsxl -n openshift-controller-manager --loglevel=5 |grep -i evict

---------------------------------------------------------------------
and I tried another method using drain, but failed to drain
steps
1. Create apps

2. When build pod in node1 are running status, drain the node
$ oc adm drain ip-172-xxxxx.internal   --ignore-daemonsets=true 
node/ip-172xx-xxx.internal cordoned
error: unable to drain node "ip-172xxxxxxnternal", aborting command...

There are pending nodes to be drained:
 ip-172-xxxinternal
error: pods with local storage (use --delete-local-data to override): alertmanager-main-1, alertmanager-main-2, ruby-ex-1-build
3. build pod complete

Comment 7 wewang 2019-04-04 10:07:32 UTC

@Adam Kaplan Do you know how to let build pod failed with Status.Reason = "Evicted" , I tried a lot of time, still cannot get the result, just pod from running to terminating, thanks

Comment 8 Adam Kaplan 2019-04-08 15:10:13 UTC

@Wen the worker node needs to be under memory/disk pressure for the pod to be evicted. My idea to verify:
1. Create a cluster that has a low-memory worker node. Add a unique label to the node.
2. Craft a Docker strategy build that does the following:
  a. Runs a Dockerfile that consumes a lot of RAM (ex: compute the billionth Fibonnaci number)
  b. Uses a NodeSelector to run on the low-memory worker node
3. Start the build - hopefully the pod gets evicted, but the build may fail with the OOMKilled reason :/

Comment 9 Hongkai Liu 2019-04-09 21:15:41 UTC

What I did:
3 workers OCP 4 cluster. Default installation config: m4.large worker instances
In the test, one of the workers (ip-10-0-161-25.us-east-2.compute.internal Ready,SchedulingDisabled) is `SchedulingDisabled` all the time.
The ${standby_build_node} is cordoned/uncordoned to make sure the build pod runs only on the ${running_build_node}.

Trigger the test with the following script:
https://github.com/hongkailiu/svt-case-doc/blob/master/scripts/simple_1689061.sh
We can modify the size of file in dd command (in the above script), or ssh to the ${running_build_node} and `dd` directly.

When k8s felt running low disk on the ${running_build_node}, it evicted the build pod.

The message is different than the one in description:
message: 'The node was low on resource: ephemeral-storage. Container sti-build was
using 20Ki, which exceeds its request of 0. '

$ oc get build
NAME TYPE FROM STATUS STARTED DURATION
django-ex-16 Source Git@0905223 Complete 2 hours ago 4m34s
django-ex-17 Source Git@0905223 Complete About an hour ago 1m32s
django-ex-18 Source Git@0905223 Complete About an hour ago 11m29s
django-ex-19 Source Git@0905223 Complete About an hour ago 1m47s
django-ex-21 Source Git@0905223 Complete About an hour ago 1m30s
django-ex-22 Source Git@0905223 Failed (BuildPodEvicted) 45 minutes ago 2m27s
django-ex-24 Source Git@0905223 Failed (BuildPodEvicted) 27 minutes ago 4m26s

It is expected as described in comment 5.

The tricky part is
* before starting the build, the disk has to have enough space. Otherwise, the build is `Pending`.
* We have to write files into the disk, after the build is `Running`. So the build has to running long enough to give k8s the chance to detect low disk and then evict the build pod.

Moving this bug to `Verified`.

Comment 11 wewang 2019-04-10 01:44:33 UTC

@Hongkai Liu Thanks a lot for it

Comment 12 Steve Kuznetsov 2019-05-14 15:51:35 UTC

*** Bug 1705128 has been marked as a duplicate of this bug. ***

Comment 14 errata-xmlrpc 2019-06-04 10:45:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758