Bug 2049750
Summary: | Cluster upgrades to 4.10 exhibits slowness and instability on s390x | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Lakshmi Ravichandran <lakshmi.ravichandran1> |
Component: | Multi-Arch | Assignee: | Deep Mistry <dmistry> |
Multi-Arch sub component: | IBM P / Z | QA Contact: | Douglas Slavens <dslavens> |
Status: | CLOSED DEFERRED | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | abusch, aos-bugs, danili, dorzel, eparis, psundara, wking |
Version: | 4.10 | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | s390x | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-05-04 13:06:00 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 2055197 | ||
Bug Blocks: | 2047833 |
Description
Lakshmi Ravichandran
2022-02-02 16:23:38 UTC
Moving this bug back to Multi-Arch component as Axel is investigating into this bug. Hi Axel, since this bug is categorized as a "High", we need to determine if this bug is a blocker for the 4.10 release. Do you think that that this bug is a blocker for a current release, or are we okay with fixing this bug post 4.10 GA? Still investigating. I'll post an update as soon I have some results I did some tests on z/VM in a non over-comittment scenario: - Update from v 4.9.5 to 4.9.17 what took about 75 min - Update from v 4.9.17 to 4.10.0-fc.2 what took about 70 min During both updates and afterwards the cluster remained stable. I will do another test on zKVM to double check the results Thank you for doing the additional testing, Axel. Since this bug is a "High" severity, if you could provide the additional result soon, it would be very helpful for us as a team to determine whether this bug is a release blocker. I did some additional tests, started with z/VM again: - Update from 4.9.5 to 4.10.0.fc.1 (without going over latest 4.9.x) took 63 minutes On zKVM: - Update from 4.9.19 to 4.10.0.rc.1 took 56 minutes I could not observe any instability. Masters and workers stayed online (except for the one by one update process). So I did not observe any problems with the update process. However, possible errors/problems could happen due to strong over-commitment, since the update process occupies comparatively many resources. My tests ran in a CPU dedicated environment. *** Bug 2049749 has been marked as a duplicate of this bug. *** Setting "Blocker-" for this bug reasons being: 1. Per Axel's test result in Comment 6, the instability/slowness is not observed on every condition 2. Similar 4.10 upgrade performance bugs (e.g. BZ 2034367 and BZ 2047828) do not seem to block 4.10 GA. Re-assigning to this Tori for additional investigation. Hi Tori - feel free to add any information here. Setting the priority to "high" based on existing conversation with Tori (as she is currently looking into additional investigation). Feel free to change the priority if that assumption is incorrect. I did some tests on x86/AWS: - Update from 4.9.15 -> 4.10.0-0 took 71 minutes - Update from 4.9.15 -> 4.10.0-0 took 76 minutes with instance type: m4.2xlarge I also did not observe any instability. However, I will check x86 on virtlib next week. a question that raises the instability of the CI test suite in this context, in CI, we see the cluster upgrade time limit was set 90 mins in older versions (4.7 to 4.8 / 4.8 to 4.9). however, in newer versions (4.9 to 4.10 / 4.10 to 4.11), we see it is reduced to '75 mins' though it is accepted that the higher version of the clusters are seemingly heavier by its nature. how was this value of time limit derived for newer versions of the cluster for different platforms? until 4.9 upgrades: - : [sig-cluster-lifecycle] cluster upgrade should complete in 90.00 minutes https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.8-upgrade-from-nightly-4.7-ocp-remote-libvirt-s390x/1491864610557399040 4.9 to 4.10, 4.10 to 4.11 upgrades: - [sig-cluster-lifecycle] cluster upgrade should complete in 75.00 minutes https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.10-upgrade-from-nightly-4.9-ocp-remote-libvirt-s390x/1491124897424871424 [1] is the 4.11 test-case logic for determining that durationToSoftFailure cutoff. As you can see from the blame and history, the cutoff evolves as folks grant additional leeway for some slow cases while trying to keep the cutoff from growing across the board. Raising the cutoff for a given situation is easy, and useful for short-term work. And cranking the cap back down can be hard, but is better for customers who want faster updates. [1]: https://github.com/openshift/origin/blame/093a24044d093241ac5416713170dff552479f07/test/e2e/upgrade/upgrade.go#L283-L326 Hi Tori, do you think you will continue to evaluate this bug after the end of the current sprint (Feb. 19th)? If we plan to, then I'd like to set the "reviewed-in-sprint" flag to indicate that this bug will continue to be evaluate in the next sprint Hi Dan, I think I'll be done by then. Adding "reviewed-in-sprint", as Deep is OOTO and it is unlikely that this bug will be resolved before the end of the current sprint. Also, this bug is not a blocker bug Since Deep is OOTO this week and this bug is a "Blocker=", I am keeping the "reviewed-in-sprint+" flag as it is unlikely that this bug will get resolved this week. Hi Deep, do you think this bug would be resolved before the end of the current sprint? If this bug will be continued to be investigated, can we set "reviewed-in-sprint"? *** Bug 2047833 has been marked as a duplicate of this bug. *** Based on current infrastructure and network limitations the upgrades take more time. We have increased the upgrade time from 75min -> 100min which should in theory reduce the failing tests. Verified from QE end the upgrade does seem to finish under expected time and the above issue is only seen with CI. We might revisit this bug in the future |