Hide Forgot
The upgrade job failed (seems to have timed out?): https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22774/pull-ci-openshift-origin-master-e2e-aws-upgrade/633 2019/05/07 00:16:14 Running pod e2e-aws-upgrade 2019/05/07 00:43:34 Container setup in pod e2e-aws-upgrade completed successfully {"component":"entrypoint","level":"error","msg":"Process did not finish before 2h0m0s timeout","time":"2019-05-07T01:54:11Z"} 2019/05/07 01:54:11 error: Process interrupted with signal interrupt, exiting in 10s ... 2019/05/07 01:54:11 cleanup: Deleting release pod release-initial 2019/05/07 01:54:11 cleanup: Deleting release pod release-latest 2019/05/07 01:54:11 cleanup: Deleting template e2e-aws-upgrade rpc error: code = 2 desc = oci runtime error: exec failed: cannot exec a container that has run and stopped But the e2e-aws-upgrade logs are not found in the collected artifacts: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/22774/pull-ci-openshift-origin-master-e2e-aws-upgrade/633/ Making it difficult to triage why the job failed.
similar issues in e2e-aws-serial: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22774/pull-ci-openshift-origin-master-e2e-aws-serial/5875
Seen here: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-controller-manager-operator/250/pull-ci-openshift-cluster-kube-controller-manager-operator-master-e2e-aws-upgrade/48/artifacts/ Raising severity to urgent.
Do successful runs of the same job (or at least runs that do not hit the timeout) save any useful artifact? If so, can you give me a pointer to this artifact? I assume this is something that needs to be fixed in the appropriate test template - it looks that artifacts are collected even on timeout, but the useful one is not placed in the collected location...
yes, successful runs appear to have logs gathered: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/batch/pull-ci-openshift-origin-master-e2e-aws-serial/5900 https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/batch/pull-ci-openshift-origin-master-e2e-aws-serial/5900/artifacts/e2e-aws-serial/
https://github.com/openshift/ci-operator-prowgen/pull/157 https://github.com/openshift/release/pull/3706 Updating the jobs are the latest prow bumping.
Pl merge it ASAP tonight before we make final build. I am lowering the priority so it will not block code freeze
The issue should be fixed by now. we will continue monitoring the jobs and make sure that we won't hit that error again.
What was the actual issue? At every step Prow will give a (configurable) grace period for jobs when they are being asked to terminate. It is up to the job to trap that and do something.
The issue was an internal default overriding the configure default for timeout and grace period. Not sure this one makes sense to send to QA.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758