Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1901737

Summary:	[buildcop] ipi-deprovision step exceeds 4h ProwJob timeout
Product:	OpenShift Container Platform	Reporter:	Lokesh Mandvekar <lsm5>
Component:	Test Infrastructure	Assignee:	Steve Kuznetsov <skuznets>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.7	CC:	aos-bugs, bparees, eparis, jokerman, wking
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-01-11 15:16:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Lokesh Mandvekar 2020-11-25 21:43:42 UTC

Description of problem:

This was sent to me by Joel Speed, I couldn't seem to find the issue on any of the latest issues, so filing under Uknown.

See the timeout issue at: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/190/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/1329702034437312512 

 2020/11/20 09:54:51 Container test in pod e2e-aws-operator-gather-audit-logs completed successfully
2020/11/20 09:54:51 Pod e2e-aws-operator-gather-audit-logs succeeded after 30s
2020/11/20 09:54:54 Copied 42.89MB of artifacts from e2e-aws-operator-gather-audit-logs to /logs/artifacts/e2e-aws-operator/gather-audit-logs
2020/11/20 09:54:54 Executing pod "e2e-aws-operator-ipi-deprovision-deprovision"
2020/11/20 09:54:57 Container cp-secret-wrapper in pod e2e-aws-operator-ipi-deprovision-deprovision completed successfully
{"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-11-20T12:24:09Z"}
time="2020-11-20T12:24:09Z" level=info msg="Received signal." signal=interrupt
2020/11/20 12:24:09 error: Process interrupted with signal interrupt, cancelling execution...
2020/11/20 12:24:09 cleanup: Deleting release pod release-latest
{"component":"entrypoint","file":"prow/entrypoint/run.go:250","func":"k8s.io/test-infra/prow/entrypoint.gracefullyTerminate","level":"error","msg":"Process did not exit before 30m0s grace period","severity":"error","time":"2020-11-20T12:54:09Z"}

Comment 1 W. Trevor King 2020-11-25 23:38:12 UTC

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/190/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/1329702034437312512/build-log.txt | grep 'Executing pod'
2020/11/20 08:24:14 Executing pod "release-latest"
2020/11/20 08:25:57 Executing pod "e2e-aws-operator-ipi-conf"
2020/11/20 08:26:02 Executing pod "e2e-aws-operator-ipi-conf-aws"
2020/11/20 08:26:07 Executing pod "e2e-aws-operator-ipi-install-rbac"
2020/11/20 08:26:13 Executing pod "e2e-aws-operator-ipi-install-install"
2020/11/20 08:57:49 Executing pod "e2e-aws-operator-test"
2020/11/20 09:45:44 Executing pod "e2e-aws-operator-gather-aws-console"
2020/11/20 09:46:00 Executing pod "e2e-aws-operator-gather-must-gather"
2020/11/20 09:48:46 Executing pod "e2e-aws-operator-gather-extra"
2020/11/20 09:54:20 Executing pod "e2e-aws-operator-gather-audit-logs"
2020/11/20 09:54:54 Executing pod "e2e-aws-operator-ipi-deprovision-deprovision"
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-api-actuator-pkg/190/pull-ci-openshift-cluster-api-actuator-pkg-master-e2e-aws-operator/1329702034437312512/prowjob.json | jq -r .metadata.creationTimestamp
2020-11-20T08:24:04Z

Looks like e2e-aws-operator-ipi-deprovision-deprovision needs a timeout.

Comment 2 W. Trevor King 2020-11-25 23:50:13 UTC

I've filed a PR setting a timeout on that step.  That should give us nicer failure handling, but will probably not address the underlying issue, which may be an installer (or provider) teardown bug or the cluster-API PR creating some kind of blocking AWS resource that the installer doesn't know how to delete.

Comment 3 Joel Speed 2020-11-26 10:15:44 UTC

The only thing being created by the new tests in the PR (or rather restored tests) are machines. So perhaps the installer is unable to tear down the environment because it can't tear down the machines? IIRC installer uses the instance tags to determine what to remove, if we aren't adding those in these tests for some reason, that could be the problem

Comment 4 Joel Speed 2020-11-26 10:18:47 UTC

With that said, the test suite is actually waiting for the Machines to go away before the test succeeds, so it should be cleaning up everything it creates

Comment 5 Joel Speed 2020-12-03 11:01:22 UTC

@wking This is still happening (I realise nothing has been fixed yet) and it is happening across many repositories and many tests. I don't think this is a problem with the repos, my suspicion, based on timing, is that this started with the big switch to multi-stage jobs and could be a side effect of that migration? 

If you look at this search [1], you can see that there are many tests from many repos that have "x completed successfully" and then several hours later the timeout

Eg:
- GCP Console [2]
- AWS MAPI Provider [3]
- MAO Azure (this one actually failed but still has a long delay between finishing and then a timeout) [4]

[1] https://search.ci.openshift.org/?search=Process+did+not+finish+before+4h0m0s+timeout&maxAge=48h&context=1&type=all&name=&maxMatches=5&maxBytes=20971520&groupBy=job

[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_console/7343/pull-ci-openshift-console-master-e2e-gcp-console/1333945808952758272
[3] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-aws/332/pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-operator/1333724390407278592
[4] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/767/pull-ci-openshift-machine-api-operator-master-e2e-azure-operator/1333860563813404672

Is it possible we could look into this further?

Comment 6 W. Trevor King 2020-12-04 05:30:26 UTC

This ticket is not about the 4h ProwJob timeout.  Looking at the three jobs you linked [1]:

  2020/12/02 05:45:37 Pod e2e-gcp-console-ipi-deprovision-deprovision succeeded after 4m29s

so in that run, some previous step was slow, but the ipi-deprovision step (what this ticket is about) was fine.  Then [2]:

  2020/12/01 12:50:16 Executing pod "e2e-aws-operator-ipi-deprovision-deprovision"
  2020/12/01 12:50:21 Container cp-secret-wrapper in pod e2e-aws-operator-ipi-deprovision-deprovision completed successfully
  {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-12-01T14:47:33Z"}

So in that case the ipi-deprovision step took almost 2h, and would have been helped by my in-flight PR associated with this bug (still no review from anyone on the PR).  Then [3]:

  2020/12/01 22:45:27 Executing pod "e2e-azure-operator-ipi-deprovision-deprovision"
  2020/12/01 22:45:32 Container cp-secret-wrapper in pod e2e-azure-operator-ipi-deprovision-deprovision completed successfully
  {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-12-01T23:48:44Z"}

So another case where ipi-deprovision was slow (1h), and would have been helped by my PR.  In both cases "helped" is "terminated more graciously, so we would collect artifacts from the slow teardown, and could hopefully figure out why it stuck".  We'd still be leaking the cluster to the ipi-deprovision job [4] with or without my step-timeout PR.

But nothing I can do about this.  If someone else in the OpenShift org likes the PR I've floated, they can drop a /lgtm in it to land the PR.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_console/7343/pull-ci-openshift-console-master-e2e-gcp-console/1333945808952758272#1:build-log.txt%3A77
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-api-provider-aws/332/pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-operator/1333724390407278592#1:build-log.txt%3A104
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/767/pull-ci-openshift-machine-api-operator-master-e2e-azure-operator/1333860563813404672#1:build-log.txt%3A9253
[4]: https://prow.ci.openshift.org/job-history/origin-ci-test/logs/periodic-ipi-deprovision

Comment 7 Ben Parees 2020-12-21 17:09:49 UTC

Trevor's timeout PR has merged so the job artifacts should be available to investigate this further (thanks Trevor!)