Bug 1690066 - [3.11] Evicted builds don't have a specific status reason, instead are GenericBuildFailure
Summary: [3.11] Evicted builds don't have a specific status reason, instead are Generi...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Build
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 3.11.z
Assignee: Adam Kaplan
QA Contact: Hongkai Liu
Depends On: 1689061
TreeView+ depends on / blocked
Reported: 2019-03-18 17:30 UTC by Adam Kaplan
Modified: 2019-06-26 09:08 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: if a build pod was evicted, build reported a GenericBuildFailure Consequence: cluster administrators could not determine why builds failed if the node was under resource pressure Fix: new failure reason `BuildPodEvicted` added Result: builds that fail due to pod eviction report `BuildPodEvicted` in their status reason
Clone Of: 1689061
Last Closed: 2019-06-26 09:07:55 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:1605 0 None None None 2019-06-26 09:08:04 UTC

Description Adam Kaplan 2019-03-18 17:30:10 UTC
+++ This bug was initially created as a clone of Bug #1689061 +++

Builds should get a failure reason for evicted pods and not be GenericBuildFailure.  On 3.11 api.ci we see evictions frequently and they are hard to debug. Builds should report a reason for eviction.

Request backport to origin 3.11 so we can get this on api.ci.


apiVersion: build.openshift.io/v1
kind: Build
    ci.openshift.io/job-spec: '{"type":"postsubmit","job":"branch-ci-openshift-release-controller-master-images","buildid":"53","prowjobid":"bc04d182-46d5-11e9-b760-0a58ac10b13f","refs":{"org":"openshift","repo":"release-controller","base_ref":"master","base_sha":"253573b4cccb254de4bdd621499bdc30c2769c29","base_link":"https://github.com/openshift/release-controller/compare/4375b4ac2e8d...253573b4cccb"}}'
    openshift.io/build.pod-name: release-controller-build
  creationTimestamp: 2019-03-15T03:53:29Z
    build-id: "53"
    created-by-ci: "true"
    creates: release-controller
    job: branch-ci-openshift-release-controller-master-images
    persists-between-builds: "false"
    prow.k8s.io/id: bc04d182-46d5-11e9-b760-0a58ac10b13f
  name: release-controller
  namespace: ci-op-7g4vf063
  - apiVersion: image.openshift.io/v1
    controller: true
    kind: ImageStream
    name: pipeline
    uid: d435bb30-46d5-11e9-9b95-42010a8e0003
  resourceVersion: "90888834"
  selfLink: /apis/build.openshift.io/v1/namespaces/ci-op-7g4vf063/builds/release-controller
  uid: e5043c52-46d5-11e9-9b95-42010a8e0003
  nodeSelector: null
    - name: vcs-type
      value: git
    - name: vcs-url
      value: https://github.com/openshift/release-controller
    - name: io.openshift.build.name
    - name: io.openshift.build.namespace
    - name: io.openshift.build.commit.ref
      value: master
    - name: io.openshift.build.source-location
      value: https://github.com/openshift/release-controller
    - name: vcs-ref
      value: 253573b4cccb254de4bdd621499bdc30c2769c29
    - name: io.openshift.build.commit.id
      value: 253573b4cccb254de4bdd621499bdc30c2769c29
    - name: io.openshift.build.commit.message
    - name: io.openshift.build.commit.author
    - name: io.openshift.build.commit.date
    - name: io.openshift.build.source-context-dir
      name: builder-dockercfg-k4g2k
      kind: ImageStreamTag
      name: pipeline:release-controller
      namespace: ci-op-7g4vf063
  postCommit: {}
      memory: 6Gi
      cpu: 100m
      memory: 200Mi
  serviceAccount: builder
    - as:
      - "0"
        kind: ImageStreamTag
        name: pipeline:root
      paths: null
    - as: null
        kind: ImageStreamTag
        name: pipeline:src
      - destinationDir: .
        sourcePath: /go/src/github.com/openshift/release-controller///.
    type: Image
      forcePull: true
        kind: ImageStreamTag
        name: pipeline:os
        namespace: ci-op-7g4vf063
      imageOptimizationPolicy: SkipLayers
      noCache: true
    type: Docker
  triggeredBy: null
  completionTimestamp: 2019-03-15T03:53:53Z
  message: Generic Build failure - check logs for details.
  output: {}
  outputDockerImageReference: docker-registry.default.svc:5000/ci-op-7g4vf063/pipeline:release-controller
  phase: Failed
  reason: GenericBuildFailed
  startTimestamp: 2019-03-15T03:53:53Z


  message: 'Pod The node was low on resource: [DiskPressure]. '
  phase: Failed
  reason: Evicted
  startTime: 2019-03-15T03:53:53Z

--- Additional comment from Clayton Coleman on 2019-03-15 04:17:58 UTC ---

Also note that's *ALL* the status the pod has, so that may be causing other failures in the build controller.

Comment 1 Adam Kaplan 2019-03-18 18:07:23 UTC
API PR: https://github.com/openshift/api/pull/256

Comment 2 Adam Kaplan 2019-04-02 15:00:38 UTC
Origin PR: https://github.com/openshift/origin/pull/22346

Comment 5 Hongkai Liu 2019-04-11 12:40:10 UTC
Let me give it a shot tomorrow.

Comment 6 Hongkai Liu 2019-04-12 13:28:41 UTC
$ git tag  --contains 29cde93
[origin]$ git log --oneline 29cde93..HEAD
9b1e77773a (HEAD -> release-3.11, origin/release-3.11) Merge pull request #22443 from danwinship/sync-inuse-vnids-on-restart-3.11
c137ed0d25 Merge pull request #22397 from jcantrill/1676720
6f59b4eb4c Fix reinitialization of NetworkPolicy state on restart
a2aa67a169 Initialize NetworkPolicy which-namespaces-are-in-use properly on restart
a8f6aec707 Clean up NetworkPolicies on NetNamespace deletion
03b5b9e76a bug 1676720. Check clusterlogging curator for cronjob instead of DC

No 3.11 puddle contains the fix yet.

Comment 7 Hongkai Liu 2019-04-12 13:41:12 UTC
Sorry my bad ... checking ose repo now

Comment 8 Hongkai Liu 2019-04-12 13:42:10 UTC
[hongkliu@MiWiFi-R1CM-srv ose]$ git tag  --contains 29cde93

Comment 9 Hongkai Liu 2019-04-12 16:02:41 UTC
Still saw `GenericBuildFailed`

Every 6.0s: oc get build -n testproject                                                                                                                                             Fri Apr 12 16:01:47 2019

NAME           TYPE      FROM          STATUS                        STARTED             DURATION
django-ex-7    Source    Git@0905223   Complete                      About an hour ago   1m16s
django-ex-8    Source    Git@0905223   Complete                      About an hour ago   1m12s
django-ex-9    Source    Git@0905223   Complete                      44 minutes ago      1m38s
django-ex-10   Source    Git@0905223   Failed (GenericBuildFailed)   41 minutes ago      2m6s
django-ex-12   Source    Git@0905223   Complete                      31 minutes ago      1m16s
django-ex-14   Source    Git           Failed (GenericBuildFailed)   22 minutes ago      40s
django-ex-15   Source    Git@0905223   Complete                      19 minutes ago      1m1s
django-ex-16   Source    Git@0905223   Failed (GenericBuildFailed)   18 minutes ago      53s

Comment 11 Hongkai Liu 2019-04-12 16:12:02 UTC
Only django-ex-10 and django-ex-16 are relevant to disk pressure.
django-ex-14 is something else.

Comment 12 Clayton Coleman 2019-04-15 14:23:56 UTC
Not all evictions are reported to the pod (which is what the build controller uses).  When reproducing eviction related issues, always include the pod yaml of the build pod.

Comment 13 Hongkai Liu 2019-04-15 18:37:31 UTC
Sorry ... did not know the requirement of pod yaml.

A. If it is for the pod definition, then the build is trigger by the bc created by `oc new-app centos/python-35-centos7~https://github.com/sclorg/django-ex`.
B. If it is for the pod status, then I have to redo the test.

@Clayton, Let me know if it is Case B above. Thanks.

Comment 15 Adam Kaplan 2019-04-23 18:14:43 UTC
@Hongkai we need case B - fetch the status of the pod. Can you please re-run the test and report your findings?

Comment 16 Hongkai Liu 2019-04-23 19:49:56 UTC
Sure. I will rerun it tomorrow.

Comment 17 Hongkai Liu 2019-04-24 14:58:58 UTC
Use the latest for the moment:

# yum list installed | grep openshift
atomic-openshift.x86_64         3.11.109-1.git.0.8f0b752.el7

Every 3.0s: oc get build -n testproject                                                                                                                                             Wed Apr 24 14:47:36 2019

NAME           TYPE      FROM          STATUS                     STARTED          DURATION
django-ex-5    Source    Git@0905223   Complete                   44 minutes ago   1m1s
django-ex-6    Source    Git@0905223   Complete                   43 minutes ago   1m3s
django-ex-7    Source    Git@0905223   Complete                   37 minutes ago   4m39s
django-ex-8    Source    Git@0905223   Complete                   32 minutes ago   1m22s
django-ex-9    Source    Git@0905223   Failed                     24 minutes ago   2m10s
django-ex-10   Source    Git@0905223   Complete                   17 minutes ago   1m0s
django-ex-11   Source    Git           Failed (BuildPodEvicted)   14 minutes ago   21s


This is different from the result in Comment 9. I think it is what we expect for the fix.
Only one NIP:

Status of builds:
django-ex-9: Failed
django-ex-11: Failed (BuildPodEvicted)

For a moment, I saw `(BuildPodEvicted)` for ex-9, but it vanished quickly after.
For ex-11, `(BuildPodEvicted)` is stable.
From what I did, they of them failed up to the same issue - `low disk space`.

pod status files: http://file.rdu.redhat.com/~hongkliu/test_result/bz1690066/20190424/

I think the important thing for this bug is we should not see `GenericBuildFailed` as build status which IMO has been achieved.
Please reopen if i missed the point.

Comment 18 Hongkai Liu 2019-04-24 15:13:33 UTC
Tested more, this unstable `(BuildPodEvicted)` like ex-9 did not show.

Comment 20 errata-xmlrpc 2019-06-26 09:07:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.