1893348 – upgrade-rollback job timing out after initial e2e-aws-upgrade container setup pod completes

Bug 1893348 - upgrade-rollback job timing out after initial e2e-aws-upgrade container setup pod completes

Summary: upgrade-rollback job timing out after initial e2e-aws-upgrade container setup...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Over the Air Updates
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-30 19:25 UTC by jamo luhrsen
Modified:	2022-05-06 12:29 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-11 20:46:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description jamo luhrsen 2020-10-30 19:25:00 UTC

Description of problem:

The release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7 job is not producing
results because after the initial e2e-aws-upgrade container setup pod is done, there is a 4 hour
timeout being hit and the job is aborted:

example job and relevant output line:
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1321915308318920704

{"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-10-30T00:42:43Z"}

There are no pods/nodes artifacts to help with debugging.

This also appears to be a problem with 4.6, but not really with 4.5:

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.5-to-4.6
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.5-informing#release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.4-to-4.5

Periodically (maybe 5 in 60 runs) the timeout doesn't happen. Here is an example for 4.7:

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1319377461326123008

Version-Release number of selected component (if applicable):

4.6 and 4.7


How reproducible:

90% with the prow jobs

Comment 1 Steve Kuznetsov 2020-11-02 16:30:04 UTC

This is a test hitting a timeout in test code, there is no infrastructure issue here.

Comment 2 jamo luhrsen 2020-11-02 22:11:31 UTC

Was going to look for some clues running the job locally, but now the job is passing, all of a sudden:
https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7

Comment 3 jamo luhrsen 2020-11-04 18:26:41 UTC

This is still a problem. Seems that DPTP is converting these template jobs to multi-step and once that happens we may end up with
a job that gives us artifacts that we can use to debug this. re-opening for now with Unknown component and once we have the ability
to debug with logs I can assign it to the appropriate component.

Comment 5 jamo luhrsen 2020-11-30 23:10:20 UTC

I think because of https://github.com/openshift/ci-tools/pull/1436 we are now collecting the junit [0] file so
we have a little more info to go by, but not much.

It seems that the test cases are all passing but we are still hitting the 4h prow timeout.

In the build log [1], you can see this:
  2020/11/30 01:22:59 error: could not wait for pod 'e2e-aws-upgrade': it is no longer present on the cluster (usually a result of a race or resource pressure. re-running the job should help)

I doubt this is enough info to debug any further.



[0] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1333155690608332800/artifacts/junit_operator.xml
[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1333155690608332800/build-log.txt

Comment 6 W. Trevor King 2020-12-01 22:34:44 UTC

It's good to link the job detail page in case folks want to look at other artifacts.  For the job discussed in comment 5, that is [1].  Excerpting more of the build log [2]:

2020/11/30 01:06:24 Container teardown in pod e2e-aws-upgrade completed successfully
{"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-11-30T01:07:54Z"}
time="2020-11-30T01:07:54Z" level=info msg="Received signal." signal=interrupt
2020/11/30 01:07:54 error: Process interrupted with signal interrupt, cancelling execution...
...
2020/11/30 01:22:59 error: could not wait for pod 'e2e-aws-upgrade': it is no longer present on the cluster (usually a result of a race or resource pressure. re-running the job should help)
...
time="2020-11-30T01:22:59Z" level=error msg="Waiting for container artifacts in pod e2e-aws-upgrade in namespace ci-op-sktzjm8c" error="pods \"e2e-aws-upgrade\" not found"
...
time="2020-11-30T01:27:59Z" level=error msg="Waiting for container artifacts in pod e2e-aws-upgrade in namespace ci-op-sktzjm8c" error="pods \"e2e-aws-upgrade\" not found"
2020/11/30 01:27:59 error: unable to signal to artifacts container to terminate in pod e2e-aws-upgrade, could not run remote command: pods "e2e-aws-upgrade" is forbidden: pods "e2e-aws-upgrade" not found
2020/11/30 01:27:59 error: artifacts container for pod e2e-aws-upgrade unready: timed out waiting for the condition
2020/11/30 01:27:59 error: could not wait for pod 'e2e-aws-upgrade': it is no longer present on the cluster (usually a result of a race or resource pressure. re-running the job should help)
2020/11/30 01:27:59 Releasing leases for "e2e-aws-upgrade"
2020/11/30 01:27:59 Releasing lease "b6968552-9b27-4cf6-b325-478162a4604a" for "aws-quota-slice"
time="2020-11-30T01:27:59Z" level=error msg="Waiting for container artifacts in pod e2e-aws-upgrade in namespace ci-op-sktzjm8c" error="pods \"e2e-aws-upgrade\" not found"
2020/11/30 01:27:59 No custom metadata found and prow metadata already exists. Not updating the metadata.
2020/11/30 01:27:59 Ran for 4h19m46s
error: some steps failed:
  * could not run steps: execution cancelled
  * could not run steps: step e2e-aws-upgrade failed: template pod "e2e-aws-upgrade" failed: pod was deleted while ci-operator step was waiting for it
{"component":"entrypoint","file":"prow/entrypoint/run.go:247","func":"k8s.io/test-infra/prow/entrypoint.gracefullyTerminate","level":"error","msg":"Process gracefully exited before 30m0s grace period","severity":"error","time":"2020-11-30T01:27:59Z"}

So still not clear to me (and probably still a DPTP problem) to figure out where that artifacts container went.  And the job is still not multi-step; transitioning to multi-step might help reduce the impact of a single artifacts container being lost (or maybe not, if all the artifacts are batched up and pushed after the job completes, instead of being pushed up as each step completes).  I dunno if there are existing DPTP tickets around the lost artifacts container or not.  They definitely have a ticket about getting jobs ported from the outgoing template framework to the newer multi-step framework.  I don't think the updates-team is going to sink much work into debugging the rollbacks until we get them transitioned to multi-step or the artifact collection is otherwise recovered; adding UpcomingSprint.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1333155690608332800
[2]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1333155690608332800/build-log.txt

Comment 8 jamo luhrsen 2020-12-01 22:42:29 UTC

> So still not clear to me (and probably still a DPTP problem) to figure out where that artifacts container went.  And the job is still not multi-step;
> transitioning to multi-step might help reduce the impact of a single artifacts container being lost (or maybe not, if all the artifacts are batched
> up and pushed after the job completes, instead of being pushed up as each step completes).  I dunno if there are existing DPTP tickets around the lost
> artifacts container or not.  They definitely have a ticket about getting jobs ported from the outgoing template framework to the newer multi-step
> framework.  I don't think the updates-team is going to sink much work into debugging the rollbacks until we get them transitioned to multi-step or the
> artifact collection is otherwise recovered; adding UpcomingSprint.

I'm working on this here:
  https://github.com/openshift/release/pull/13957

had to wait for https://github.com/openshift/ci-tools/pull/1430 to be merged so I could get github tests to run the pj-rehearse job to kick of the
upgrade-rollback job. But, now I'm still stuck with a problem there. I'll keep after it and we can see if using multi-step helps us any more or
not.

Comment 9 Lalatendu Mohanty 2020-12-11 15:27:36 UTC

We are still not sure what caused this and hence adding blocker -. We need to review the CI again to see if this is a blocker.

Comment 10 jamo luhrsen 2020-12-11 19:06:54 UTC

I was able to move this job to multi-step and have a failing run here which does collect more artifacts:
  https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1337381727240196096

It looks like 4.6 was installed to success:
  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1337381727240196096/artifacts/e2e-aws-upgrade/ipi-install-install-stableinitial/.openshift_install.log

I do see a failure in the test.log for that container step (2020/12/11 14:00:22 Failed to upload $KUBECONFIG: timed out waiting for the condition: stat /tmp/secret/kubeconfig: no such file or directory) but not sure if it's relevant:
  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1337381727240196096/artifacts/e2e-aws-upgrade/ipi-install-install-stableinitial/container-logs/test.log

will update more as I dig, but at this point I think the job doesn't even get to a happy 4.6 state, much less try to upgrade to 4.7 and then rollback.

Comment 11 jamo luhrsen 2020-12-11 19:26:08 UTC

With help from @wking I see now that this *is* failing on the rollback from 4.7 to 4.6. in the cluster version log, you can see this:

      "history": [
                    {
                        "completionTime": null,
                        "image": "registry.svc.ci.openshift.org/ci-op-fj8z0c6x/release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf",
                        "startedTime": "2020-12-11T15:06:56Z",
                        "state": "Partial",
                        "verified": false,
                        "version": "4.6.8"
                    },
                    {
                        "completionTime": "2020-12-11T15:06:56Z",
                        "image": "registry.svc.ci.openshift.org/ci-op-fj8z0c6x/release@sha256:91bd93a846989e40440c7ac93f566c4cb9bdd13878efbad829424dae34b091bd",
                        "startedTime": "2020-12-11T14:06:22Z",
                        "state": "Partial",
                        "verified": false,
                        "version": "4.7.0-0.ci-2020-12-10-144849"
                    },
                    {
                        "completionTime": "2020-12-11T14:00:22Z",
                        "image": "registry.svc.ci.openshift.org/ci-op-fj8z0c6x/release@sha256:6ddbf56b7f9776c0498f23a54b65a06b3b846c1012200c5609c4bb716b6bdcdf",
                        "startedTime": "2020-12-11T13:29:28Z",
                        "state": "Completed",
                        "verified": false,
                        "version": "4.6.8"
                    }
                ],

log file:
  https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.6-to-4.7/1337381727240196096/artifacts/e2e-aws-upgrade/gather-extra/clusterversion.json

Comment 12 jamo luhrsen 2020-12-11 19:34:12 UTC

and in the same cluster version log file from comment #11, there is this:

                    {
                        "lastTransitionTime": "2020-12-11T16:17:03Z",
                        "message": "Could not update prometheusrule \"openshift-cluster-version/cluster-version-operator\" (9 of 617): the server is reporting an internal error",
                        "reason": "UpdatePayloadClusterError",
                        "status": "True",
                        "type": "Failing"
                    },
                    {
                        "lastTransitionTime": "2020-12-11T14:06:22Z",
                        "message": "Unable to apply 4.6.8: the control plane is reporting an internal error",
                        "reason": "UpdatePayloadClusterError",
                        "status": "True",
                        "type": "Progressing"
                    }

quote from https://coreos.slack.com/archives/C01CQA76KMX/p1607715026278500?thread_ts=1607713778.276800&cid=C01CQA76KMX
  presumably someone added a prometheusrule manifest for the CVO to push in 4.7, but when we move back to 4.6, the CVO tries to push it into the rolled-back-to-4.6-already API server,
  and the API-server rejects us with "what is this?"

Comment 13 jamo luhrsen 2020-12-11 20:46:20 UTC

Closing this as it really was tracking the fact that our upgrade-rollback jobs were not collecting any artifacts
and the job was timing out with no data. That was fixed with the following PRs moving it to use multi-step:

  https://github.com/openshift/release/pull/14184
  https://github.com/openshift/release/pull/14065
  https://github.com/openshift/release/pull/13957

and a new bug is filed to track the problem with updating prometheusrule when rolling back from 4.7 to 4.6
  https://bugzilla.redhat.com/show_bug.cgi?id=1906936

Note You need to log in before you can comment on or make changes to this bug.