periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-upgrade-rollback periodic-ci-openshift-release-master-ci-4.9-upgrade-from-stable-4.8-e2e-aws-ovn-upgrade-rollback The 4.8 -> 4.9 rollback job never passed the entire cycle. The 4.9 to 4.10 job is also now permafailing. Someone needs to dig into these jobs, but they're always exceeding 4 hours and timing out. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback/1437862153771028480
https://issues.redhat.com/browse/OTA-455 is in 4.10 for us to support z-stream rollbacks in 4.10, though it may miss 4.10. The purpose of these jobs is mostly to ensure that rollbacks don't get too far out of reach by ensuring that teams resolve bugs filed as a result of the job failure. We've definitely had bugs filed and resolved as a result of these jobs in the past across both z-stream and y-stream rollbacks. If we need to disable some jobs lets leave in the 4.10 z-stream rollback jobs as that's specifically identified feature work for 4.10. If it's possible to just remove the noise for TRT then lets leave the rest in as infrequent (72hrs sounds good) periodics so that there's some historic data for teams to reference if and when they debug any issues.
I've just rerolled https://github.com/openshift/release/pull/22289 to pull in Vadim's new-ish config knob, and that should raise the timeouts for the 4.9 -> 4.10 -> 4.9 rollback step from 3h to 4h (also raises the test step's timeout for all the other jobs that use the test step, as discussed in the PR).
Looks like we're in the same state as before, basically 0% pass rate: https://sippy.ci.openshift.org/sippy-ng/jobs/4.10?filters=%7B%22items%22%3A%5B%7B%22id%22%3A1%2C%22columnField%22%3A%22%22%2C%22operatorValue%22%3A%22%22%2C%22value%22%3A%22%22%7D%2C%7B%22id%22%3A99%2C%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22contains%22%2C%22value%22%3A%22rollback%22%7D%5D%7D Vadim are you still able to look into this? If so we'll update assignee.
periodic-ci-openshift-release-master-ci-4.10-e2e-aws-upgrade-rollback is looking pretty well - 2 failures out of last 5 runs (etcdHighNumberOfLeaderChanges and pathological event repeat). Last run analysis: * periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback: * https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback/1466857653371146240: * disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade * disruption_tests: [sig-network-edge] OAuth remains available via cluster frontend ingress * disruption_tests: [sig-network-edge] OAuth remains available via cluster frontend ingress using reused connections * disruption_tests: [sig-network-edge] Console remains available via cluster frontend ingress using new connections * disruption_tests: [sig-network-edge] Console remains available via cluster frontend ingress using reused connections * https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback/1467582628675719168: * disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success * disruption_tests: [sig-network-edge] OAuth remains available via cluster frontend ingress using new connections * disruption_tests: [sig-network-edge] OAuth remains available via cluster frontend ingress using reused connections * https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback/1466132822581317632: * disruption_tests: [bz-Cluster Version Operator] Verify presence of admin ack gate blocks upgrade until acknowledged (this has been resolved by https://github.com/openshift/origin/pull/26649) Perhaps we should be running rollback tests without disruption testing? periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback has all kinds of failures, seems related to kube-api. Lets file a separate bug for this?
I've floated [1] dropping all the 4.(y-1) -> 4.y -> 4.(y-1) rollback jobs for 4.10 and earlier, because we don't support rollbacks at all [2], and fixing these minor rollbacks seems to be outpacing our CI-investigation budget. Still waiting to see what other folks think; maybe they will decide that this is worth a more detailed investigation. And we can always restore the jobs from Git history if the decision is "not worth investigating" now, but pivots to "actually, it is worth investigating" later. [1]: https://github.com/openshift/release/pull/26629 [2]: https://github.com/openshift/openshift-docs/blame/d4762f0f626a4dddb9d7330e63a3bb6cb73f5bb5/modules/update-upgrading-cli.adoc#L160-L162
[1] landed, and now the failing jobs are gone. We still have 4.10 -> 4.11 -> 4.10, because that's still passing, and we can always pull the older minor-rollback jobs back out of Git history if we change priorities. [1]: https://github.com/openshift/release/pull/26629