Description of problem: cluster upgrades suffer from two types of problems starting from 4.9 (4.9 to 4.10, 4.10 to 4.11): 1. cluster upgrades are observed to take a longer time (more than 75 mins) 2. cluster instability and API timeouts of primary operators are observed which fails the upgrade tests. Kubernetes API timeout OpenShift API timeout OAuth API timeout Console becomes unavailable Image-registry becomes unavailable Version-Release number of selected component (if applicable): 4.10 How reproducible: upgrading an existing ocp cluster to recent 4.10 version Steps to Reproduce: - Actual results: API timeouts and cluster upgrade takes long time during upgrade. cluster becomes unstable during upgrade. Expected results: cluster upgrade should be smooth within the stipulated time. Additional info: observed in CI environment initially. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.10-upgrade-from-nightly-4.9-ocp-remote-libvirt-s390x/1488588046591856640
Moving this bug back to Multi-Arch component as Axel is investigating into this bug. Hi Axel, since this bug is categorized as a "High", we need to determine if this bug is a blocker for the 4.10 release. Do you think that that this bug is a blocker for a current release, or are we okay with fixing this bug post 4.10 GA?
Still investigating. I'll post an update as soon I have some results
I did some tests on z/VM in a non over-comittment scenario: - Update from v 4.9.5 to 4.9.17 what took about 75 min - Update from v 4.9.17 to 4.10.0-fc.2 what took about 70 min During both updates and afterwards the cluster remained stable. I will do another test on zKVM to double check the results
Thank you for doing the additional testing, Axel. Since this bug is a "High" severity, if you could provide the additional result soon, it would be very helpful for us as a team to determine whether this bug is a release blocker.
I did some additional tests, started with z/VM again: - Update from 4.9.5 to 4.10.0.fc.1 (without going over latest 4.9.x) took 63 minutes On zKVM: - Update from 4.9.19 to 4.10.0.rc.1 took 56 minutes I could not observe any instability. Masters and workers stayed online (except for the one by one update process). So I did not observe any problems with the update process. However, possible errors/problems could happen due to strong over-commitment, since the update process occupies comparatively many resources. My tests ran in a CPU dedicated environment.
*** Bug 2049749 has been marked as a duplicate of this bug. ***
Setting "Blocker-" for this bug reasons being: 1. Per Axel's test result in Comment 6, the instability/slowness is not observed on every condition 2. Similar 4.10 upgrade performance bugs (e.g. BZ 2034367 and BZ 2047828) do not seem to block 4.10 GA.
Re-assigning to this Tori for additional investigation. Hi Tori - feel free to add any information here.
Setting the priority to "high" based on existing conversation with Tori (as she is currently looking into additional investigation). Feel free to change the priority if that assumption is incorrect.
I did some tests on x86/AWS: - Update from 4.9.15 -> 4.10.0-0 took 71 minutes - Update from 4.9.15 -> 4.10.0-0 took 76 minutes with instance type: m4.2xlarge I also did not observe any instability. However, I will check x86 on virtlib next week.
a question that raises the instability of the CI test suite in this context, in CI, we see the cluster upgrade time limit was set 90 mins in older versions (4.7 to 4.8 / 4.8 to 4.9). however, in newer versions (4.9 to 4.10 / 4.10 to 4.11), we see it is reduced to '75 mins' though it is accepted that the higher version of the clusters are seemingly heavier by its nature. how was this value of time limit derived for newer versions of the cluster for different platforms? until 4.9 upgrades: - : [sig-cluster-lifecycle] cluster upgrade should complete in 90.00 minutes https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.8-upgrade-from-nightly-4.7-ocp-remote-libvirt-s390x/1491864610557399040 4.9 to 4.10, 4.10 to 4.11 upgrades: - [sig-cluster-lifecycle] cluster upgrade should complete in 75.00 minutes https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.10-upgrade-from-nightly-4.9-ocp-remote-libvirt-s390x/1491124897424871424
[1] is the 4.11 test-case logic for determining that durationToSoftFailure cutoff. As you can see from the blame and history, the cutoff evolves as folks grant additional leeway for some slow cases while trying to keep the cutoff from growing across the board. Raising the cutoff for a given situation is easy, and useful for short-term work. And cranking the cap back down can be hard, but is better for customers who want faster updates. [1]: https://github.com/openshift/origin/blame/093a24044d093241ac5416713170dff552479f07/test/e2e/upgrade/upgrade.go#L283-L326
Hi Tori, do you think you will continue to evaluate this bug after the end of the current sprint (Feb. 19th)? If we plan to, then I'd like to set the "reviewed-in-sprint" flag to indicate that this bug will continue to be evaluate in the next sprint
Hi Dan, I think I'll be done by then.
Adding "reviewed-in-sprint", as Deep is OOTO and it is unlikely that this bug will be resolved before the end of the current sprint. Also, this bug is not a blocker bug
Since Deep is OOTO this week and this bug is a "Blocker=", I am keeping the "reviewed-in-sprint+" flag as it is unlikely that this bug will get resolved this week.
Hi Deep, do you think this bug would be resolved before the end of the current sprint? If this bug will be continued to be investigated, can we set "reviewed-in-sprint"?
*** Bug 2047833 has been marked as a duplicate of this bug. ***
Based on current infrastructure and network limitations the upgrades take more time. We have increased the upgrade time from 75min -> 100min which should in theory reduce the failing tests. Verified from QE end the upgrade does seem to finish under expected time and the above issue is only seen with CI.
We might revisit this bug in the future