Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1695507

Summary: CI operator was using a cached failed build and CI got borked on it
Product: OpenShift Container Platform Reporter: Tomáš Nožička <tnozicka>
Component: Test InfrastructureAssignee: Steve Kuznetsov <skuznets>
Status: CLOSED NOTABUG QA Contact:
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: adahiya, adam.kaplan, deads, mifiedle, obulatov
Target Milestone: ---Keywords: Reopened
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-07 19:42:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
oc get all -o yaml none

Description Tomáš Nožička 2019-04-03 09:04:47 UTC
Created attachment 1551291 [details]
oc get all -o yaml

Description of problem:

The PR in question (https://github.com/openshift/origin/pull/22440 - changes not related to the failure) was getting consistent errors:

could not wait for build: the build hyperkube failed with reason DockerBuildFailed: Docker build strategy has failed.

--> RUN make build WHAT=vendor/k8s.io/kubernetes/cmd/hyperkube
hack/build-go.sh vendor/k8s.io/kubernetes/cmd/hyperkube 
++ Building go targets for linux/amd64: vendor/k8s.io/kubernetes/cmd/hyperkube
[INFO] hack/build-go.sh exited with code 0 after 00h 00m 33s
error: build error: no such image

/retest wasn't helping


Seen in:

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22440/pull-ci-openshift-origin-master-e2e-aws/6210/
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22440/pull-ci-openshift-origin-master-e2e-aws/6208/
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22440/pull-ci-openshift-origin-master-e2e-aws/6207/
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22440/pull-ci-openshift-origin-master-e2e-aws/6205/
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22440/pull-ci-openshift-origin-master-e2e-aws/6202/
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22440/pull-ci-openshift-origin-master-e2e-aws/6201/
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22440/pull-ci-openshift-origin-master-e2e-aws/6200/https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22440/pull-ci-openshift-origin-master-e2e-aws/6196/


After manually deleting the CI namespace it started working:
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22440/pull-ci-openshift-origin-master-e2e-aws/6212


Note: rebase landed during that time, but few other PRs didn't seem to be hitting this build bug.

Comment 1 Steve Kuznetsov 2019-04-03 15:16:02 UTC
As far as CI-operator is concerned, this was a bona fide build failure:


  status:
    message: Docker build strategy has failed.
    phase: Failed
    reason: DockerBuildFailed


This is WAI.

Comment 2 Tomáš Nožička 2019-04-03 16:09:14 UTC
it should have been retried on /retest but wasn't

Comment 3 Steve Kuznetsov 2019-04-03 18:03:56 UTC
Build failures are not retried on retest, this is working as intended and will not be changed. If we can tell from the API that the build failed due to something that could be an infrastructure error, we will retry. This presented as a build time failure, which we do not retry.

Comment 4 Tomáš Nožička 2019-04-04 06:39:37 UTC
I disagree. The obvious issue here is caching you do at some point. **This wasn't an issue with the PR at all, yet it was cycling on retest without any way to get unstuck.** 

The only "workaround" was to manually mess with CI infra and delete the namespace. This is proving that the caching choices you think are WAI are not correct.
The fact that deleting the namespace helps to unstuck clearly proves you have a caching bug. I don't really see a more compelling prove for that.

> If we can tell from the API that the build failed due to something that could be an infrastructure error, we will retry. This presented as a build time failure, which we do not retry.

I don't think there is a way to reliably determine the cause of the failure, so how about not caching failures at all? 

Not sure this was a networking issue, probably infra flake - what's the point of caching those except being permanently broken?

Comment 5 Steve Kuznetsov 2019-04-04 14:30:44 UTC
When we _can_ tell that something has failed due to an infra flake, we do not cache it. This is a case where the build API did not let us determine that. The cost of manually cleaning these (very rare) cases up exceeds the cost of re-trying every failed build and not caching at all.

Comment 9 Tomáš Nožička 2019-04-05 07:50:19 UTC
> When we _can_ tell that something has failed due to an infra flake, we do not cache it. 

This is not how caching works. You can cache if you know the inputs didn't change, not the other way around.

Comment 10 Tomáš Nožička 2019-04-05 08:05:17 UTC
Caching build failures means **flakes become permanent**, this is effectively crippling CI. There are now dozens of instances like that. Yes, builds flake, but the ci-operator multiplies those failures and blocks the CI.

Comment 11 Steve Kuznetsov 2019-04-07 19:42:02 UTC
We retry infrastructure failures where we can. We determine inputs as much as possible. This never crippled CI, it affected some 10% of builds and that number is inflated as if one job for the PR hit this, all would, and if a re-test happened they would all as well. CI Operator will continue to retry builds we can tell are infrastructure related and will continue to cache builds (even failed ones) if no inputs we control changed. The underlying issue in the cluster is being addressed and this specific symptom will go away. The amount of benefit we get by caching builds is much higher than an incidental and rare case where infrastructure failures are not reported as such by the build ecosystem.

Comment 12 David Eads 2019-06-25 11:48:21 UTC
I've just been bit.  In terms of determining this value for https://github.com/openshift/origin/pull/23265#issuecomment-505405910, you can follow the same steps I did.  Failure notification, to namespace, to builds, to failed build pod, to status to message:

  - containerID: docker://26c8b19c28f6964cc038f3692c52fd46baa2d96a7f31c51ef7403d77aab5cb4f
    image: docker.io/openshift/origin-docker-builder:v3.11.0
    imageID: docker-pullable://docker.io/openshift/origin-docker-builder@sha256:d660cab1442fa1d103725099a0cef93102265a3e4c22686e5329f34de1ca5be8
    lastState: {}
    name: manage-dockerfile
    ready: false
    restartCount: 0
    state:
      terminated:
        containerID: docker://26c8b19c28f6964cc038f3692c52fd46baa2d96a7f31c51ef7403d77aab5cb4f
        exitCode: 128
        finishedAt: "2019-06-25T06:44:33Z"
        message: |
          nsenter: could not ensure we are a cloned binary: Device or resource busy
          container_linux.go:247: starting container process caused "process_linux.go:245: running exec setns process for init caused \"exit status 17\""
        reason: ContainerCannotRun
        startedAt: "2019-06-25T06:44:33Z"
  phase: Failed

Even a whitelist of "don't cache if X is found" could improve usage.