Description of problem: The release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7 job is not producing results because after the initial e2e-aws-upgrade container setup pod is done, there is a 4 hour timeout being hit and the job is aborted: example job and relevant output line: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1321915308318920704 {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-10-30T00:42:43Z"} There are no pods/nodes artifacts to help with debugging. This also appears to be a problem with 4.6, but not really with 4.5: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.5-to-4.6 https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-informing#release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.4-to-4.5 Periodically (maybe 5 in 60 runs) the timeout doesn't happen. Here is an example for 4.7: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1319377461326123008 Version-Release number of selected component (if applicable): 4.6 and 4.7 How reproducible: 90% with the prow jobs
This is a test hitting a timeout in test code, there is no infrastructure issue here.
Was going to look for some clues running the job locally, but now the job is passing, all of a sudden: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7
This is still a problem. Seems that DPTP is converting these template jobs to multi-step and once that happens we may end up with a job that gives us artifacts that we can use to debug this. re-opening for now with Unknown component and once we have the ability to debug with logs I can assign it to the appropriate component.
I think because of https://github.com/openshift/ci-tools/pull/1436 we are now collecting the junit [0] file so we have a little more info to go by, but not much. It seems that the test cases are all passing but we are still hitting the 4h prow timeout. In the build log [1], you can see this: 2020/11/30 01:22:59 error: could not wait for pod 'e2e-aws-upgrade': it is no longer present on the cluster (usually a result of a race or resource pressure. re-running the job should help) I doubt this is enough info to debug any further. [0] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1333155690608332800/artifacts/junit_operator.xml [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1333155690608332800/build-log.txt
It's good to link the job detail page in case folks want to look at other artifacts. For the job discussed in comment 5, that is [1]. Excerpting more of the build log [2]: 2020/11/30 01:06:24 Container teardown in pod e2e-aws-upgrade completed successfully {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-11-30T01:07:54Z"} time="2020-11-30T01:07:54Z" level=info msg="Received signal." signal=interrupt 2020/11/30 01:07:54 error: Process interrupted with signal interrupt, cancelling execution... ... 2020/11/30 01:22:59 error: could not wait for pod 'e2e-aws-upgrade': it is no longer present on the cluster (usually a result of a race or resource pressure. re-running the job should help) ... time="2020-11-30T01:22:59Z" level=error msg="Waiting for container artifacts in pod e2e-aws-upgrade in namespace ci-op-sktzjm8c" error="pods \"e2e-aws-upgrade\" not found" ... time="2020-11-30T01:27:59Z" level=error msg="Waiting for container artifacts in pod e2e-aws-upgrade in namespace ci-op-sktzjm8c" error="pods \"e2e-aws-upgrade\" not found" 2020/11/30 01:27:59 error: unable to signal to artifacts container to terminate in pod e2e-aws-upgrade, could not run remote command: pods "e2e-aws-upgrade" is forbidden: pods "e2e-aws-upgrade" not found 2020/11/30 01:27:59 error: artifacts container for pod e2e-aws-upgrade unready: timed out waiting for the condition 2020/11/30 01:27:59 error: could not wait for pod 'e2e-aws-upgrade': it is no longer present on the cluster (usually a result of a race or resource pressure. re-running the job should help) 2020/11/30 01:27:59 Releasing leases for "e2e-aws-upgrade" 2020/11/30 01:27:59 Releasing lease "b6968552-9b27-4cf6-b325-478162a4604a" for "aws-quota-slice" time="2020-11-30T01:27:59Z" level=error msg="Waiting for container artifacts in pod e2e-aws-upgrade in namespace ci-op-sktzjm8c" error="pods \"e2e-aws-upgrade\" not found" 2020/11/30 01:27:59 No custom metadata found and prow metadata already exists. Not updating the metadata. 2020/11/30 01:27:59 Ran for 4h19m46s error: some steps failed: * could not run steps: execution cancelled * could not run steps: step e2e-aws-upgrade failed: template pod "e2e-aws-upgrade" failed: pod was deleted while ci-operator step was waiting for it {"component":"entrypoint","file":"prow/entrypoint/run.go:247","func":"k8s.io/test-infra/prow/entrypoint.gracefullyTerminate","level":"error","msg":"Process gracefully exited before 30m0s grace period","severity":"error","time":"2020-11-30T01:27:59Z"} So still not clear to me (and probably still a DPTP problem) to figure out where that artifacts container went. And the job is still not multi-step; transitioning to multi-step might help reduce the impact of a single artifacts container being lost (or maybe not, if all the artifacts are batched up and pushed after the job completes, instead of being pushed up as each step completes). I dunno if there are existing DPTP tickets around the lost artifacts container or not. They definitely have a ticket about getting jobs ported from the outgoing template framework to the newer multi-step framework. I don't think the updates-team is going to sink much work into debugging the rollbacks until we get them transitioned to multi-step or the artifact collection is otherwise recovered; adding UpcomingSprint. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1333155690608332800 [2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1333155690608332800/build-log.txt
> So still not clear to me (and probably still a DPTP problem) to figure out where that artifacts container went. And the job is still not multi-step; > transitioning to multi-step might help reduce the impact of a single artifacts container being lost (or maybe not, if all the artifacts are batched > up and pushed after the job completes, instead of being pushed up as each step completes). I dunno if there are existing DPTP tickets around the lost > artifacts container or not. They definitely have a ticket about getting jobs ported from the outgoing template framework to the newer multi-step > framework. I don't think the updates-team is going to sink much work into debugging the rollbacks until we get them transitioned to multi-step or the > artifact collection is otherwise recovered; adding UpcomingSprint. I'm working on this here: https://github.com/openshift/release/pull/13957 had to wait for https://github.com/openshift/ci-tools/pull/1430 to be merged so I could get github tests to run the pj-rehearse job to kick of the upgrade-rollback job. But, now I'm still stuck with a problem there. I'll keep after it and we can see if using multi-step helps us any more or not.
We are still not sure what caused this and hence adding blocker -. We need to review the CI again to see if this is a blocker.
I was able to move this job to multi-step and have a failing run here which does collect more artifacts: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1337381727240196096 It looks like 4.6 was installed to success: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1337381727240196096/artifacts/e2e-aws-upgrade/ipi-install-install-stableinitial/.openshift_install.log I do see a failure in the test.log for that container step (2020/12/11 14:00:22 Failed to upload $KUBECONFIG: timed out waiting for the condition: stat /tmp/secret/kubeconfig: no such file or directory) but not sure if it's relevant: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1337381727240196096/artifacts/e2e-aws-upgrade/ipi-install-install-stableinitial/container-logs/test.log will update more as I dig, but at this point I think the job doesn't even get to a happy 4.6 state, much less try to upgrade to 4.7 and then rollback.
With help from @wking I see now that this *is* failing on the rollback from 4.7 to 4.6. in the cluster version log, you can see this: "history": [ { "completionTime": null, "image": "registry.svc.ci.openshift.org/ci-op-fj8z0c6x/release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf", "startedTime": "2020-12-11T15:06:56Z", "state": "Partial", "verified": false, "version": "4.6.8" }, { "completionTime": "2020-12-11T15:06:56Z", "image": "registry.svc.ci.openshift.org/ci-op-fj8z0c6x/release@sha256:91bd93a846989e40440c7ac93f566c4cb9bdd13878efbad829424dae34b091bd", "startedTime": "2020-12-11T14:06:22Z", "state": "Partial", "verified": false, "version": "4.7.0-0.ci-2020-12-10-144849" }, { "completionTime": "2020-12-11T14:00:22Z", "image": "registry.svc.ci.openshift.org/ci-op-fj8z0c6x/release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf", "startedTime": "2020-12-11T13:29:28Z", "state": "Completed", "verified": false, "version": "4.6.8" } ], log file: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1337381727240196096/artifacts/e2e-aws-upgrade/gather-extra/clusterversion.json
and in the same cluster version log file from comment #11, there is this: { "lastTransitionTime": "2020-12-11T16:17:03Z", "message": "Could not update prometheusrule \"openshift-cluster-version/cluster-version-operator\" (9 of 617): the server is reporting an internal error", "reason": "UpdatePayloadClusterError", "status": "True", "type": "Failing" }, { "lastTransitionTime": "2020-12-11T14:06:22Z", "message": "Unable to apply 4.6.8: the control plane is reporting an internal error", "reason": "UpdatePayloadClusterError", "status": "True", "type": "Progressing" } quote from https://coreos.slack.com/archives/C01CQA76KMX/p1607715026278500?thread_ts=1607713778.276800&cid=C01CQA76KMX presumably someone added a prometheusrule manifest for the CVO to push in 4.7, but when we move back to 4.6, the CVO tries to push it into the rolled-back-to-4.6-already API server, and the API-server rejects us with "what is this?"
Closing this as it really was tracking the fact that our upgrade-rollback jobs were not collecting any artifacts and the job was timing out with no data. That was fixed with the following PRs moving it to use multi-step: https://github.com/openshift/release/pull/14184 https://github.com/openshift/release/pull/14065 https://github.com/openshift/release/pull/13957 and a new bug is filed to track the problem with updating prometheusrule when rolling back from 4.7 to 4.6 https://bugzilla.redhat.com/show_bug.cgi?id=1906936