Bug 1707210

Summary: no logs gathered for failed upgrade job
Product: OpenShift Container Platform Reporter: Ben Parees <bparees>
Component: Test InfrastructureAssignee: Steve Kuznetsov <skuznets>
Status: CLOSED ERRATA QA Contact:
Severity: low Docs Contact:
Priority: low    
Version: 4.1.0CC: mfojtik, nmoraiti, pmuller, sponnaga, vlaad
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:48:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ben Parees 2019-05-07 04:13:31 UTC
The upgrade job failed (seems to have timed out?):
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22774/pull-ci-openshift-origin-master-e2e-aws-upgrade/633

2019/05/07 00:16:14 Running pod e2e-aws-upgrade
2019/05/07 00:43:34 Container setup in pod e2e-aws-upgrade completed successfully
{"component":"entrypoint","level":"error","msg":"Process did not finish before 2h0m0s timeout","time":"2019-05-07T01:54:11Z"}
2019/05/07 01:54:11 error: Process interrupted with signal interrupt, exiting in 10s ...
2019/05/07 01:54:11 cleanup: Deleting release pod release-initial
2019/05/07 01:54:11 cleanup: Deleting release pod release-latest
2019/05/07 01:54:11 cleanup: Deleting template e2e-aws-upgrade
rpc error: code = 2 desc = oci runtime error: exec failed: cannot exec a container that has run and stopped

But the e2e-aws-upgrade logs are not found in the collected artifacts:

https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/22774/pull-ci-openshift-origin-master-e2e-aws-upgrade/633/

Making it difficult to triage why the job failed.

Comment 3 Petr Muller 2019-05-07 13:31:31 UTC
Do successful runs of the same job (or at least runs that do not hit the timeout) save any useful artifact? If so, can you give me a pointer to this artifact?

I assume this is something that needs to be fixed in the appropriate test template - it looks that artifacts are collected even on timeout, but the useful one is not placed in the collected location...

Comment 5 Nikolaos Leandros Moraitis 2019-05-07 14:39:39 UTC
https://github.com/openshift/ci-operator-prowgen/pull/157
https://github.com/openshift/release/pull/3706

Updating the jobs are the latest prow bumping.

Comment 7 Sudha Ponnaganti 2019-05-07 23:02:45 UTC
Pl merge it ASAP tonight before we make final build. I am lowering the priority so it will not block code freeze

Comment 9 Nikolaos Leandros Moraitis 2019-05-08 12:55:19 UTC
The issue should be fixed by now.
we will continue monitoring the jobs and make sure that we won't hit that error again.

Comment 10 Steve Kuznetsov 2019-05-08 15:27:08 UTC
What was the actual issue? At every step Prow will give a (configurable) grace period for jobs when they are being asked to terminate. It is up to the job to trap that and do something.

Comment 11 Steve Kuznetsov 2019-05-08 16:01:56 UTC
The issue was an internal default overriding the configure default for timeout and grace period. Not sure this one makes sense to send to QA.

Comment 13 errata-xmlrpc 2019-06-04 10:48:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758