Bug 1690066

Summary: [3.11] Evicted builds don't have a specific status reason, instead are GenericBuildFailure
Product: OpenShift Container Platform Reporter: Adam Kaplan <adam.kaplan>
Component: BuildAssignee: Adam Kaplan <adam.kaplan>
Status: CLOSED ERRATA QA Contact: Hongkai Liu <hongkliu>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: aos-bugs, ccoleman, hongkliu, sponnaga, wewang, wzheng
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: if a build pod was evicted, build reported a GenericBuildFailure Consequence: cluster administrators could not determine why builds failed if the node was under resource pressure Fix: new failure reason `BuildPodEvicted` added Result: builds that fail due to pod eviction report `BuildPodEvicted` in their status reason
Story Points: ---
Clone Of: 1689061 Environment:
Last Closed: 2019-06-26 09:07:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1689061    
Bug Blocks:    

Description Adam Kaplan 2019-03-18 17:30:10 UTC
+++ This bug was initially created as a clone of Bug #1689061 +++

Builds should get a failure reason for evicted pods and not be GenericBuildFailure.  On 3.11 api.ci we see evictions frequently and they are hard to debug. Builds should report a reason for eviction.

Request backport to origin 3.11 so we can get this on api.ci.

---

apiVersion: build.openshift.io/v1
kind: Build
metadata:
  annotations:
    ci.openshift.io/job-spec: '{"type":"postsubmit","job":"branch-ci-openshift-release-controller-master-images","buildid":"53","prowjobid":"bc04d182-46d5-11e9-b760-0a58ac10b13f","refs":{"org":"openshift","repo":"release-controller","base_ref":"master","base_sha":"253573b4cccb254de4bdd621499bdc30c2769c29","base_link":"https://github.com/openshift/release-controller/compare/4375b4ac2e8d...253573b4cccb"}}'
    openshift.io/build.pod-name: release-controller-build
  creationTimestamp: 2019-03-15T03:53:29Z
  labels:
    build-id: "53"
    created-by-ci: "true"
    creates: release-controller
    job: branch-ci-openshift-release-controller-master-images
    persists-between-builds: "false"
    prow.k8s.io/id: bc04d182-46d5-11e9-b760-0a58ac10b13f
  name: release-controller
  namespace: ci-op-7g4vf063
  ownerReferences:
  - apiVersion: image.openshift.io/v1
    controller: true
    kind: ImageStream
    name: pipeline
    uid: d435bb30-46d5-11e9-9b95-42010a8e0003
  resourceVersion: "90888834"
  selfLink: /apis/build.openshift.io/v1/namespaces/ci-op-7g4vf063/builds/release-controller
  uid: e5043c52-46d5-11e9-9b95-42010a8e0003
spec:
  nodeSelector: null
  output:
    imageLabels:
    - name: vcs-type
      value: git
    - name: vcs-url
      value: https://github.com/openshift/release-controller
    - name: io.openshift.build.name
    - name: io.openshift.build.namespace
    - name: io.openshift.build.commit.ref
      value: master
    - name: io.openshift.build.source-location
      value: https://github.com/openshift/release-controller
    - name: vcs-ref
      value: 253573b4cccb254de4bdd621499bdc30c2769c29
    - name: io.openshift.build.commit.id
      value: 253573b4cccb254de4bdd621499bdc30c2769c29
    - name: io.openshift.build.commit.message
    - name: io.openshift.build.commit.author
    - name: io.openshift.build.commit.date
    - name: io.openshift.build.source-context-dir
    pushSecret:
      name: builder-dockercfg-k4g2k
    to:
      kind: ImageStreamTag
      name: pipeline:release-controller
      namespace: ci-op-7g4vf063
  postCommit: {}
  resources:
    limits:
      memory: 6Gi
    requests:
      cpu: 100m
      memory: 200Mi
  serviceAccount: builder
  source:
    images:
    - as:
      - "0"
      from:
        kind: ImageStreamTag
        name: pipeline:root
      paths: null
    - as: null
      from:
        kind: ImageStreamTag
        name: pipeline:src
      paths:
      - destinationDir: .
        sourcePath: /go/src/github.com/openshift/release-controller///.
    type: Image
  strategy:
    dockerStrategy:
      forcePull: true
      from:
        kind: ImageStreamTag
        name: pipeline:os
        namespace: ci-op-7g4vf063
      imageOptimizationPolicy: SkipLayers
      noCache: true
    type: Docker
  triggeredBy: null
status:
  completionTimestamp: 2019-03-15T03:53:53Z
  message: Generic Build failure - check logs for details.
  output: {}
  outputDockerImageReference: docker-registry.default.svc:5000/ci-op-7g4vf063/pipeline:release-controller
  phase: Failed
  reason: GenericBuildFailed
  startTimestamp: 2019-03-15T03:53:53Z


---

status:
  message: 'Pod The node was low on resource: [DiskPressure]. '
  phase: Failed
  reason: Evicted
  startTime: 2019-03-15T03:53:53Z

--- Additional comment from Clayton Coleman on 2019-03-15 04:17:58 UTC ---

Also note that's *ALL* the status the pod has, so that may be causing other failures in the build controller.

Comment 1 Adam Kaplan 2019-03-18 18:07:23 UTC
API PR: https://github.com/openshift/api/pull/256

Comment 2 Adam Kaplan 2019-04-02 15:00:38 UTC
Origin PR: https://github.com/openshift/origin/pull/22346

Comment 5 Hongkai Liu 2019-04-11 12:40:10 UTC
Let me give it a shot tomorrow.

Comment 6 Hongkai Liu 2019-04-12 13:28:41 UTC
$ git tag  --contains 29cde93
[origin]$ git log --oneline 29cde93..HEAD
9b1e77773a (HEAD -> release-3.11, origin/release-3.11) Merge pull request #22443 from danwinship/sync-inuse-vnids-on-restart-3.11
c137ed0d25 Merge pull request #22397 from jcantrill/1676720
6f59b4eb4c Fix reinitialization of NetworkPolicy state on restart
a2aa67a169 Initialize NetworkPolicy which-namespaces-are-in-use properly on restart
a8f6aec707 Clean up NetworkPolicies on NetNamespace deletion
03b5b9e76a bug 1676720. Check clusterlogging curator for cronjob instead of DC

No 3.11 puddle contains the fix yet.

Comment 7 Hongkai Liu 2019-04-12 13:41:12 UTC
Sorry my bad ... checking ose repo now

Comment 8 Hongkai Liu 2019-04-12 13:42:10 UTC
[hongkliu@MiWiFi-R1CM-srv ose]$ git tag  --contains 29cde93
v3.11.104-1
v3.11.105-1

Comment 9 Hongkai Liu 2019-04-12 16:02:41 UTC
Still saw `GenericBuildFailed`

Every 6.0s: oc get build -n testproject                                                                                                                                             Fri Apr 12 16:01:47 2019

NAME           TYPE      FROM          STATUS                        STARTED             DURATION
django-ex-7    Source    Git@0905223   Complete                      About an hour ago   1m16s
django-ex-8    Source    Git@0905223   Complete                      About an hour ago   1m12s
django-ex-9    Source    Git@0905223   Complete                      44 minutes ago      1m38s
django-ex-10   Source    Git@0905223   Failed (GenericBuildFailed)   41 minutes ago      2m6s
django-ex-12   Source    Git@0905223   Complete                      31 minutes ago      1m16s
django-ex-14   Source    Git           Failed (GenericBuildFailed)   22 minutes ago      40s
django-ex-15   Source    Git@0905223   Complete                      19 minutes ago      1m1s
django-ex-16   Source    Git@0905223   Failed (GenericBuildFailed)   18 minutes ago      53s

Comment 11 Hongkai Liu 2019-04-12 16:12:02 UTC
Only django-ex-10 and django-ex-16 are relevant to disk pressure.
django-ex-14 is something else.

Comment 12 Clayton Coleman 2019-04-15 14:23:56 UTC
Not all evictions are reported to the pod (which is what the build controller uses).  When reproducing eviction related issues, always include the pod yaml of the build pod.

Comment 13 Hongkai Liu 2019-04-15 18:37:31 UTC
Sorry ... did not know the requirement of pod yaml.

A. If it is for the pod definition, then the build is trigger by the bc created by `oc new-app centos/python-35-centos7~https://github.com/sclorg/django-ex`.
B. If it is for the pod status, then I have to redo the test.

@Clayton, Let me know if it is Case B above. Thanks.

Comment 15 Adam Kaplan 2019-04-23 18:14:43 UTC
@Hongkai we need case B - fetch the status of the pod. Can you please re-run the test and report your findings?

Comment 16 Hongkai Liu 2019-04-23 19:49:56 UTC
Sure. I will rerun it tomorrow.

Comment 17 Hongkai Liu 2019-04-24 14:58:58 UTC
Use the latest for the moment:

# yum list installed | grep openshift
atomic-openshift.x86_64         3.11.109-1.git.0.8f0b752.el7

Every 3.0s: oc get build -n testproject                                                                                                                                             Wed Apr 24 14:47:36 2019

NAME           TYPE      FROM          STATUS                     STARTED          DURATION
django-ex-5    Source    Git@0905223   Complete                   44 minutes ago   1m1s
django-ex-6    Source    Git@0905223   Complete                   43 minutes ago   1m3s
django-ex-7    Source    Git@0905223   Complete                   37 minutes ago   4m39s
django-ex-8    Source    Git@0905223   Complete                   32 minutes ago   1m22s
django-ex-9    Source    Git@0905223   Failed                     24 minutes ago   2m10s
django-ex-10   Source    Git@0905223   Complete                   17 minutes ago   1m0s
django-ex-11   Source    Git           Failed (BuildPodEvicted)   14 minutes ago   21s

http://file.rdu.redhat.com/~hongkliu/test_result/bz1690066/20190424/Screenshot%20from%202019-04-24%2010-44-00.png

This is different from the result in Comment 9. I think it is what we expect for the fix.
Only one NIP:

Status of builds:
django-ex-9: Failed
django-ex-11: Failed (BuildPodEvicted)

For a moment, I saw `(BuildPodEvicted)` for ex-9, but it vanished quickly after.
For ex-11, `(BuildPodEvicted)` is stable.
From what I did, they of them failed up to the same issue - `low disk space`.

pod status files: http://file.rdu.redhat.com/~hongkliu/test_result/bz1690066/20190424/

I think the important thing for this bug is we should not see `GenericBuildFailed` as build status which IMO has been achieved.
Please reopen if i missed the point.

Comment 18 Hongkai Liu 2019-04-24 15:13:33 UTC
http://file.rdu.redhat.com/~hongkliu/test_result/bz1690066/20190424/Screenshot%20from%202019-04-24%2011-10-44.png
Tested more, this unstable `(BuildPodEvicted)` like ex-9 did not show.

Comment 20 errata-xmlrpc 2019-06-26 09:07:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1605