Bug 1695507
| Summary: | CI operator was using a cached failed build and CI got borked on it | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Tomáš Nožička <tnozicka> | ||||
| Component: | Test Infrastructure | Assignee: | Steve Kuznetsov <skuznets> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | |||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 4.1.0 | CC: | adahiya, adam.kaplan, deads, mifiedle, obulatov | ||||
| Target Milestone: | --- | Keywords: | Reopened | ||||
| Target Release: | 4.1.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2019-04-07 19:42:02 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Tomáš Nožička
2019-04-03 09:04:47 UTC
As far as CI-operator is concerned, this was a bona fide build failure:
status:
message: Docker build strategy has failed.
phase: Failed
reason: DockerBuildFailed
This is WAI.
it should have been retried on /retest but wasn't Build failures are not retried on retest, this is working as intended and will not be changed. If we can tell from the API that the build failed due to something that could be an infrastructure error, we will retry. This presented as a build time failure, which we do not retry. I disagree. The obvious issue here is caching you do at some point. **This wasn't an issue with the PR at all, yet it was cycling on retest without any way to get unstuck.**
The only "workaround" was to manually mess with CI infra and delete the namespace. This is proving that the caching choices you think are WAI are not correct.
The fact that deleting the namespace helps to unstuck clearly proves you have a caching bug. I don't really see a more compelling prove for that.
> If we can tell from the API that the build failed due to something that could be an infrastructure error, we will retry. This presented as a build time failure, which we do not retry.
I don't think there is a way to reliably determine the cause of the failure, so how about not caching failures at all?
Not sure this was a networking issue, probably infra flake - what's the point of caching those except being permanently broken?
When we _can_ tell that something has failed due to an infra flake, we do not cache it. This is a case where the build API did not let us determine that. The cost of manually cleaning these (very rare) cases up exceeds the cost of re-trying every failed build and not caching at all. Seeing similar errors on origin https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22450/pull-ci-openshift-origin-master-e2e-aws/6305 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22463/pull-ci-openshift-origin-master-e2e-aws/6304 This error happened again on non-origin repo... https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/599/pull-ci-openshift-machine-config-operator-master-e2e-aws/2932 More of these happening in the origin repo https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/21244/pull-ci-openshift-origin-master-e2e-aws/6354 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22478/pull-ci-openshift-origin-master-e2e-aws/6352 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22477/pull-ci-openshift-origin-master-e2e-aws/6348 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22477/pull-ci-openshift-origin-master-e2e-aws/6347 https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22477/pull-ci-openshift-origin-master-e2e-aws/6344 > When we _can_ tell that something has failed due to an infra flake, we do not cache it.
This is not how caching works. You can cache if you know the inputs didn't change, not the other way around.
Caching build failures means **flakes become permanent**, this is effectively crippling CI. There are now dozens of instances like that. Yes, builds flake, but the ci-operator multiplies those failures and blocks the CI. We retry infrastructure failures where we can. We determine inputs as much as possible. This never crippled CI, it affected some 10% of builds and that number is inflated as if one job for the PR hit this, all would, and if a re-test happened they would all as well. CI Operator will continue to retry builds we can tell are infrastructure related and will continue to cache builds (even failed ones) if no inputs we control changed. The underlying issue in the cluster is being addressed and this specific symptom will go away. The amount of benefit we get by caching builds is much higher than an incidental and rare case where infrastructure failures are not reported as such by the build ecosystem. I've just been bit. In terms of determining this value for https://github.com/openshift/origin/pull/23265#issuecomment-505405910, you can follow the same steps I did. Failure notification, to namespace, to builds, to failed build pod, to status to message: - containerID: docker://26c8b19c28f6964cc038f3692c52fd46baa2d96a7f31c51ef7403d77aab5cb4f image: docker.io/openshift/origin-docker-builder:v3.11.0 imageID: docker-pullable://docker.io/openshift/origin-docker-builder@sha256:d660cab1442fa1d103725099a0cef93102265a3e4c22686e5329f34de1ca5be8 lastState: {} name: manage-dockerfile ready: false restartCount: 0 state: terminated: containerID: docker://26c8b19c28f6964cc038f3692c52fd46baa2d96a7f31c51ef7403d77aab5cb4f exitCode: 128 finishedAt: "2019-06-25T06:44:33Z" message: | nsenter: could not ensure we are a cloned binary: Device or resource busy container_linux.go:247: starting container process caused "process_linux.go:245: running exec setns process for init caused \"exit status 17\"" reason: ContainerCannotRun startedAt: "2019-06-25T06:44:33Z" phase: Failed Even a whitelist of "don't cache if X is found" could improve usage. |