Bug 1707210 - no logs gathered for failed upgrade job
Summary: no logs gathered for failed upgrade job
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Test Infrastructure
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.1.0
Assignee: Steve Kuznetsov
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-07 04:13 UTC by Ben Parees
Modified: 2019-06-04 10:48 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:48:31 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 None None None 2019-06-04 10:48:42 UTC

Description Ben Parees 2019-05-07 04:13:31 UTC
The upgrade job failed (seems to have timed out?):
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22774/pull-ci-openshift-origin-master-e2e-aws-upgrade/633

2019/05/07 00:16:14 Running pod e2e-aws-upgrade
2019/05/07 00:43:34 Container setup in pod e2e-aws-upgrade completed successfully
{"component":"entrypoint","level":"error","msg":"Process did not finish before 2h0m0s timeout","time":"2019-05-07T01:54:11Z"}
2019/05/07 01:54:11 error: Process interrupted with signal interrupt, exiting in 10s ...
2019/05/07 01:54:11 cleanup: Deleting release pod release-initial
2019/05/07 01:54:11 cleanup: Deleting release pod release-latest
2019/05/07 01:54:11 cleanup: Deleting template e2e-aws-upgrade
rpc error: code = 2 desc = oci runtime error: exec failed: cannot exec a container that has run and stopped

But the e2e-aws-upgrade logs are not found in the collected artifacts:

https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/22774/pull-ci-openshift-origin-master-e2e-aws-upgrade/633/

Making it difficult to triage why the job failed.

Comment 3 Petr Muller 2019-05-07 13:31:31 UTC
Do successful runs of the same job (or at least runs that do not hit the timeout) save any useful artifact? If so, can you give me a pointer to this artifact?

I assume this is something that needs to be fixed in the appropriate test template - it looks that artifacts are collected even on timeout, but the useful one is not placed in the collected location...

Comment 5 Nikolaos Leandros Moraitis 2019-05-07 14:39:39 UTC
https://github.com/openshift/ci-operator-prowgen/pull/157
https://github.com/openshift/release/pull/3706

Updating the jobs are the latest prow bumping.

Comment 7 Sudha Ponnaganti 2019-05-07 23:02:45 UTC
Pl merge it ASAP tonight before we make final build. I am lowering the priority so it will not block code freeze

Comment 9 Nikolaos Leandros Moraitis 2019-05-08 12:55:19 UTC
The issue should be fixed by now.
we will continue monitoring the jobs and make sure that we won't hit that error again.

Comment 10 Steve Kuznetsov 2019-05-08 15:27:08 UTC
What was the actual issue? At every step Prow will give a (configurable) grace period for jobs when they are being asked to terminate. It is up to the job to trap that and do something.

Comment 11 Steve Kuznetsov 2019-05-08 16:01:56 UTC
The issue was an internal default overriding the configure default for timeout and grace period. Not sure this one makes sense to send to QA.

Comment 13 errata-xmlrpc 2019-06-04 10:48:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.