Bug 1707210

Summary:	no logs gathered for failed upgrade job
Product:	OpenShift Container Platform	Reporter:	Ben Parees <bparees>
Component:	Test Infrastructure	Assignee:	Steve Kuznetsov <skuznets>
Status:	CLOSED ERRATA	QA Contact:
Severity:	low	Docs Contact:
Priority:	low
Version:	4.1.0	CC:	mfojtik, nmoraiti, pmuller, sponnaga, vlaad
Target Milestone:	---
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-04 10:48:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ben Parees 2019-05-07 04:13:31 UTC

The upgrade job failed (seems to have timed out?):
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22774/pull-ci-openshift-origin-master-e2e-aws-upgrade/633

2019/05/07 00:16:14 Running pod e2e-aws-upgrade
2019/05/07 00:43:34 Container setup in pod e2e-aws-upgrade completed successfully
{"component":"entrypoint","level":"error","msg":"Process did not finish before 2h0m0s timeout","time":"2019-05-07T01:54:11Z"}
2019/05/07 01:54:11 error: Process interrupted with signal interrupt, exiting in 10s ...
2019/05/07 01:54:11 cleanup: Deleting release pod release-initial
2019/05/07 01:54:11 cleanup: Deleting release pod release-latest
2019/05/07 01:54:11 cleanup: Deleting template e2e-aws-upgrade
rpc error: code = 2 desc = oci runtime error: exec failed: cannot exec a container that has run and stopped

But the e2e-aws-upgrade logs are not found in the collected artifacts:

https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/22774/pull-ci-openshift-origin-master-e2e-aws-upgrade/633/

Making it difficult to triage why the job failed.

Comment 1 Ben Parees 2019-05-07 04:15:29 UTC

similar issues in e2e-aws-serial:
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22774/pull-ci-openshift-origin-master-e2e-aws-serial/5875

Comment 2 Michal Fojtik 2019-05-07 12:31:53 UTC

Seen here:

https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-controller-manager-operator/250/pull-ci-openshift-cluster-kube-controller-manager-operator-master-e2e-aws-upgrade/48/artifacts/

Raising severity to urgent.

Comment 3 Petr Muller 2019-05-07 13:31:31 UTC

Do successful runs of the same job (or at least runs that do not hit the timeout) save any useful artifact? If so, can you give me a pointer to this artifact?

I assume this is something that needs to be fixed in the appropriate test template - it looks that artifacts are collected even on timeout, but the useful one is not placed in the collected location...

Comment 4 Ben Parees 2019-05-07 14:15:40 UTC

yes, successful runs appear to have logs gathered:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/batch/pull-ci-openshift-origin-master-e2e-aws-serial/5900

https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/batch/pull-ci-openshift-origin-master-e2e-aws-serial/5900/artifacts/e2e-aws-serial/

Comment 5 Nikolaos Leandros Moraitis 2019-05-07 14:39:39 UTC

https://github.com/openshift/ci-operator-prowgen/pull/157
https://github.com/openshift/release/pull/3706

Updating the jobs are the latest prow bumping.

Comment 7 Sudha Ponnaganti 2019-05-07 23:02:45 UTC

Pl merge it ASAP tonight before we make final build. I am lowering the priority so it will not block code freeze

Comment 9 Nikolaos Leandros Moraitis 2019-05-08 12:55:19 UTC

The issue should be fixed by now.
we will continue monitoring the jobs and make sure that we won't hit that error again.

Comment 10 Steve Kuznetsov 2019-05-08 15:27:08 UTC

What was the actual issue? At every step Prow will give a (configurable) grace period for jobs when they are being asked to terminate. It is up to the job to trap that and do something.

Comment 11 Steve Kuznetsov 2019-05-08 16:01:56 UTC

The issue was an internal default overriding the configure default for timeout and grace period. Not sure this one makes sense to send to QA.

Comment 13 errata-xmlrpc 2019-06-04 10:48:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758