Bug 1901737
| Summary: | [buildcop] ipi-deprovision step exceeds 4h ProwJob timeout | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Lokesh Mandvekar <lsm5> |
| Component: | Test Infrastructure | Assignee: | Steve Kuznetsov <skuznets> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.7 | CC: | aos-bugs, bparees, eparis, jokerman, wking |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-01-11 15:16:33 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Lokesh Mandvekar
2020-11-25 21:43:42 UTC
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/190/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/1329702034437312512/build-log.txt | grep 'Executing pod' 2020/11/20 08:24:14 Executing pod "release-latest" 2020/11/20 08:25:57 Executing pod "e2e-aws-operator-ipi-conf" 2020/11/20 08:26:02 Executing pod "e2e-aws-operator-ipi-conf-aws" 2020/11/20 08:26:07 Executing pod "e2e-aws-operator-ipi-install-rbac" 2020/11/20 08:26:13 Executing pod "e2e-aws-operator-ipi-install-install" 2020/11/20 08:57:49 Executing pod "e2e-aws-operator-test" 2020/11/20 09:45:44 Executing pod "e2e-aws-operator-gather-aws-console" 2020/11/20 09:46:00 Executing pod "e2e-aws-operator-gather-must-gather" 2020/11/20 09:48:46 Executing pod "e2e-aws-operator-gather-extra" 2020/11/20 09:54:20 Executing pod "e2e-aws-operator-gather-audit-logs" 2020/11/20 09:54:54 Executing pod "e2e-aws-operator-ipi-deprovision-deprovision" $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/190/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/1329702034437312512/prowjob.json | jq -r .metadata.creationTimestamp 2020-11-20T08:24:04Z Looks like e2e-aws-operator-ipi-deprovision-deprovision needs a timeout. I've filed a PR setting a timeout on that step. That should give us nicer failure handling, but will probably not address the underlying issue, which may be an installer (or provider) teardown bug or the cluster-API PR creating some kind of blocking AWS resource that the installer doesn't know how to delete. The only thing being created by the new tests in the PR (or rather restored tests) are machines. So perhaps the installer is unable to tear down the environment because it can't tear down the machines? IIRC installer uses the instance tags to determine what to remove, if we aren't adding those in these tests for some reason, that could be the problem With that said, the test suite is actually waiting for the Machines to go away before the test succeeds, so it should be cleaning up everything it creates @wking This is still happening (I realise nothing has been fixed yet) and it is happening across many repositories and many tests. I don't think this is a problem with the repos, my suspicion, based on timing, is that this started with the big switch to multi-stage jobs and could be a side effect of that migration? If you look at this search [1], you can see that there are many tests from many repos that have "x completed successfully" and then several hours later the timeout Eg: - GCP Console [2] - AWS MAPI Provider [3] - MAO Azure (this one actually failed but still has a long delay between finishing and then a timeout) [4] [1] https://search.ci.openshift.org/?search=Process+did+not+finish+before+4h0m0s+timeout&maxAge=48h&context=1&type=all&name=&maxMatches=5&maxBytes=20971520&groupBy=job [2] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_console/7343/pull-ci-openshift-console-master-e2e-gcp-console/1333945808952758272 [3] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-aws/332/pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-operator/1333724390407278592 [4] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/767/pull-ci-openshift-machine-api-operator-master-e2e-azure-operator/1333860563813404672 Is it possible we could look into this further? This ticket is not about the 4h ProwJob timeout. Looking at the three jobs you linked [1]:
2020/12/02 05:45:37 Pod e2e-gcp-console-ipi-deprovision-deprovision succeeded after 4m29s
so in that run, some previous step was slow, but the ipi-deprovision step (what this ticket is about) was fine. Then [2]:
2020/12/01 12:50:16 Executing pod "e2e-aws-operator-ipi-deprovision-deprovision"
2020/12/01 12:50:21 Container cp-secret-wrapper in pod e2e-aws-operator-ipi-deprovision-deprovision completed successfully
{"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-12-01T14:47:33Z"}
So in that case the ipi-deprovision step took almost 2h, and would have been helped by my in-flight PR associated with this bug (still no review from anyone on the PR). Then [3]:
2020/12/01 22:45:27 Executing pod "e2e-azure-operator-ipi-deprovision-deprovision"
2020/12/01 22:45:32 Container cp-secret-wrapper in pod e2e-azure-operator-ipi-deprovision-deprovision completed successfully
{"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-12-01T23:48:44Z"}
So another case where ipi-deprovision was slow (1h), and would have been helped by my PR. In both cases "helped" is "terminated more graciously, so we would collect artifacts from the slow teardown, and could hopefully figure out why it stuck". We'd still be leaking the cluster to the ipi-deprovision job [4] with or without my step-timeout PR.
But nothing I can do about this. If someone else in the OpenShift org likes the PR I've floated, they can drop a /lgtm in it to land the PR.
[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_console/7343/pull-ci-openshift-console-master-e2e-gcp-console/1333945808952758272#1:build-log.txt%3A77
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-aws/332/pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-operator/1333724390407278592#1:build-log.txt%3A104
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/767/pull-ci-openshift-machine-api-operator-master-e2e-azure-operator/1333860563813404672#1:build-log.txt%3A9253
[4]: https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ipi-deprovision
Trevor's timeout PR has merged so the job artifacts should be available to investigate this further (thanks Trevor!) |