Bug 2008313 - Upgrade rollback jobs are regularly timing out
Summary: Upgrade rollback jobs are regularly timing out
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Test Framework
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: ---
Assignee: Devan Goodwin
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-27 21:41 UTC by Stephen Benjamin
Modified: 2022-03-03 05:32 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
job=periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback=all job=periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback=all job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade-rollback=all job=periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade-rollback=all
Last Closed: 2022-03-03 05:32:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift release pull 26629 0 None Merged ci-operator/config/openshift/release: Drop failing minor rollback tests 2022-03-03 05:32:46 UTC

Description Stephen Benjamin 2021-09-27 21:41:42 UTC
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade-rollback
periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade-rollback

The 4.8 -> 4.9 rollback job never passed the entire cycle. The 4.9 to 4.10 job is also now permafailing. Someone needs to dig into these jobs, but they're always exceeding 4 hours and timing out. 

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback/1437862153771028480

Comment 2 Scott Dodson 2021-11-02 13:35:56 UTC
https://issues.redhat.com/browse/OTA-455 is in 4.10 for us to support z-stream rollbacks in 4.10, though it may miss 4.10.

The purpose of these jobs is mostly to ensure that rollbacks don't get too far out of reach by ensuring that teams resolve bugs filed as a result of the job failure. We've definitely had bugs filed and resolved as a result of these jobs in the past across both z-stream and y-stream rollbacks. If we need to disable some jobs lets leave in the 4.10 z-stream rollback jobs as that's specifically identified feature work for 4.10. If it's possible to just remove the noise for TRT then lets leave the rest in as infrequent (72hrs sounds good) periodics so that there's some historic data for teams to reference if and when they debug any issues.

Comment 3 W. Trevor King 2021-11-03 04:44:43 UTC
I've just rerolled https://github.com/openshift/release/pull/22289 to pull in Vadim's new-ish config knob, and that should raise the timeouts for the 4.9 -> 4.10 -> 4.9 rollback step from 3h to 4h (also raises the test step's timeout for all the other jobs that use the test step, as discussed in the PR).

Comment 5 Vadim Rutkovsky 2021-12-06 15:13:39 UTC
periodic-ci-openshift-release-master-ci-4.10-e2e-aws-upgrade-rollback is looking pretty well - 2 failures out of last 5 runs (etcdHighNumberOfLeaderChanges and pathological event repeat).

Last run analysis:
* periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback:
  * https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback/1466857653371146240:
    * disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade
    * disruption_tests: [sig-network-edge] OAuth remains available via cluster frontend ingress
    * disruption_tests: [sig-network-edge] OAuth remains available via cluster frontend ingress using reused connections
    * disruption_tests: [sig-network-edge] Console remains available via cluster frontend ingress using new connections
    * disruption_tests: [sig-network-edge] Console remains available via cluster frontend ingress using reused connections
  * https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback/1467582628675719168:
    * disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success
    * disruption_tests: [sig-network-edge] OAuth remains available via cluster frontend ingress using new connections
    * disruption_tests: [sig-network-edge] OAuth remains available via cluster frontend ingress using reused connections
  * https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback/1466132822581317632:
    * disruption_tests: [bz-Cluster Version Operator] Verify presence of admin ack gate blocks upgrade until acknowledged
      (this has been resolved by https://github.com/openshift/origin/pull/26649)

Perhaps we should be running rollback tests without disruption testing?

periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback has all kinds of failures, seems related to kube-api. Lets file a separate bug for this?

Comment 6 W. Trevor King 2022-02-28 21:25:09 UTC
I've floated [1] dropping all the 4.(y-1) -> 4.y -> 4.(y-1) rollback jobs for 4.10 and earlier, because we don't support rollbacks at all [2], and fixing these minor rollbacks seems to be outpacing our CI-investigation budget.  Still waiting to see what other folks think; maybe they will decide that this is worth a more detailed investigation.  And we can always restore the jobs from Git history if the decision is "not worth investigating" now, but pivots to "actually, it is worth investigating" later.

[1]: https://github.com/openshift/release/pull/26629
[2]: https://github.com/openshift/openshift-docs/blame/d4762f0f626a4dddb9d7330e63a3bb6cb73f5bb5/modules/update-upgrading-cli.adoc#L160-L162

Comment 8 W. Trevor King 2022-03-03 05:32:46 UTC
[1] landed, and now the failing jobs are gone.  We still have 4.10 -> 4.11 -> 4.10, because that's still passing, and we can always pull the older minor-rollback jobs back out of Git history if we change priorities.

[1]: https://github.com/openshift/release/pull/26629


Note You need to log in before you can comment on or make changes to this bug.