Bug 2049750

Summary:	Cluster upgrades to 4.10 exhibits slowness and instability on s390x
Product:	OpenShift Container Platform	Reporter:	Lakshmi Ravichandran <lakshmi.ravichandran1>
Component:	Multi-Arch	Assignee:	Deep Mistry <dmistry>
Multi-Arch sub component:	IBM P / Z	QA Contact:	Douglas Slavens <dslavens>
Status:	CLOSED DEFERRED	Docs Contact:
Severity:	high
Priority:	high	CC:	abusch, aos-bugs, danili, dorzel, eparis, psundara, wking
Version:	4.10
Target Milestone:	---
Target Release:	---
Hardware:	s390x
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-05-04 13:06:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	2055197
Bug Blocks:	2047833

Description Lakshmi Ravichandran 2022-02-02 16:23:38 UTC

Description of problem:

cluster upgrades suffer from two types of problems starting from 4.9 (4.9 to 4.10, 4.10 to 4.11):

1. cluster upgrades are observed to take a longer time (more than 75 mins) 
2. cluster instability and API timeouts of primary operators are observed which fails the upgrade tests.
    Kubernetes API timeout
    OpenShift API timeout
    OAuth API timeout
    Console becomes unavailable
    Image-registry becomes unavailable 

Version-Release number of selected component (if applicable):
4.10

How reproducible:
upgrading an existing ocp cluster to recent 4.10 version

Steps to Reproduce:
-

Actual results:
API timeouts and cluster upgrade takes long time during upgrade.
cluster becomes unstable during upgrade.

Expected results:
cluster upgrade should be smooth within the stipulated time.

Additional info:
observed in CI environment initially.
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.10-upgrade-from-nightly-4.9-ocp-remote-libvirt-s390x/1488588046591856640

Comment 1 Dan Li 2022-02-04 17:28:37 UTC

Moving this bug back to Multi-Arch component as Axel is investigating into this bug. 

Hi Axel, since this bug is categorized as a "High", we need to determine if this bug is a blocker for the 4.10 release. Do you think that that this bug is a blocker for a current release, or are we okay with fixing this bug post 4.10 GA?

Comment 3 Axel Busch 2022-02-07 15:25:41 UTC

Still investigating. I'll post an update as soon I have some results

Comment 4 Axel Busch 2022-02-07 16:42:46 UTC

I did some tests on z/VM in a non over-comittment scenario:
- Update from v 4.9.5 to 4.9.17 what took about 75 min
- Update from v 4.9.17 to 4.10.0-fc.2 what took about 70 min

During both updates and afterwards the cluster remained stable.
I will do another test on zKVM to double check the results

Comment 5 Dan Li 2022-02-07 22:20:40 UTC

Thank you for doing the additional testing, Axel. Since this bug is a "High" severity, if you could provide the additional result soon, it would be very helpful for us as a team to determine whether this bug is a release blocker.

Comment 6 Axel Busch 2022-02-08 14:19:38 UTC

I did some additional tests, started with z/VM again:
- Update from 4.9.5 to 4.10.0.fc.1 (without going over latest 4.9.x) took 63 minutes

On zKVM:
- Update from 4.9.19 to 4.10.0.rc.1 took 56 minutes

I could not observe any instability. Masters and workers stayed online (except for the one by one update process).
So I did not observe any problems with the update process. However, possible errors/problems could happen due to strong over-commitment, since the update process occupies comparatively many resources. My tests ran in a CPU dedicated environment.

Comment 7 Dan Li 2022-02-08 18:56:22 UTC

*** Bug 2049749 has been marked as a duplicate of this bug. ***

Comment 8 Dan Li 2022-02-08 19:05:13 UTC

Setting "Blocker-" for this bug reasons being:
 
1. Per Axel's test result in Comment 6, the instability/slowness is not observed on every condition
2. Similar 4.10 upgrade performance bugs (e.g. BZ 2034367 and BZ 2047828) do not seem to block 4.10 GA.

Comment 9 Dan Li 2022-02-08 19:30:35 UTC

Re-assigning to this Tori for additional investigation. Hi Tori - feel free to add any information here.

Comment 10 Dan Li 2022-02-09 14:07:40 UTC

Setting the priority to "high" based on existing conversation with Tori (as she is currently looking into additional investigation). Feel free to change the priority if that assumption is incorrect.

Comment 11 tzivkovi@redhat.com 2022-02-11 01:15:35 UTC

I did some tests on x86/AWS:
- Update from 4.9.15 -> 4.10.0-0 took 71 minutes 
- Update from 4.9.15 -> 4.10.0-0 took 76 minutes with instance type: m4.2xlarge

I also did not observe any instability. However, I will check x86 on virtlib next week.

Comment 12 Lakshmi Ravichandran 2022-02-11 16:17:49 UTC

a question that raises the instability of the CI test suite in this context, 

in CI, we see the cluster upgrade time limit was set 90 mins in older versions (4.7 to 4.8 / 4.8 to 4.9).

however, in newer versions (4.9 to 4.10 / 4.10 to 4.11), we see it is reduced to '75 mins' though it is accepted that the higher version of the clusters are seemingly heavier by its nature.

how was this value of time limit derived for newer versions of the cluster for different platforms?


until 4.9 upgrades:
- : [sig-cluster-lifecycle] cluster upgrade should complete in 90.00 minutes 
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.8-upgrade-from-nightly-4.7-ocp-remote-libvirt-s390x/1491864610557399040


4.9 to 4.10, 4.10 to 4.11 upgrades:
- [sig-cluster-lifecycle] cluster upgrade should complete in 75.00 minutes 
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.10-upgrade-from-nightly-4.9-ocp-remote-libvirt-s390x/1491124897424871424

Comment 13 W. Trevor King 2022-02-11 18:14:41 UTC

[1] is the 4.11 test-case logic for determining that durationToSoftFailure cutoff.  As you can see from the blame and history, the cutoff evolves as folks grant additional leeway for some slow cases while trying to keep the cutoff from growing across the board.  Raising the cutoff for a given situation is easy, and useful for short-term work.  And cranking the cap back down can be hard, but is better for customers who want faster updates.

[1]: https://github.com/openshift/origin/blame/093a24044d093241ac5416713170dff552479f07/test/e2e/upgrade/upgrade.go#L283-L326

Comment 14 Dan Li 2022-02-14 17:49:40 UTC

Hi Tori, do you think you will continue to evaluate this bug after the end of the current sprint (Feb. 19th)? If we plan to, then I'd like to set the "reviewed-in-sprint" flag to indicate that this bug will continue to be evaluate in the next sprint

Comment 15 tzivkovi@redhat.com 2022-02-14 22:14:46 UTC

Hi Dan, I think I'll be done by then.

Comment 16 Dan Li 2022-03-07 13:50:45 UTC

Adding "reviewed-in-sprint", as Deep is OOTO and it is unlikely that this bug will be resolved before the end of the current sprint. Also, this bug is not a blocker bug

Comment 17 Dan Li 2022-03-28 17:28:01 UTC

Since Deep is OOTO this week and this bug is a "Blocker=", I am keeping the "reviewed-in-sprint+" flag as it is unlikely that this bug will get resolved this week.

Comment 18 Dan Li 2022-04-18 14:29:57 UTC

Hi Deep, do you think this bug would be resolved before the end of the current sprint? If this bug will be continued to be investigated, can we set "reviewed-in-sprint"?

Comment 19 Dan Li 2022-04-18 19:33:40 UTC

*** Bug 2047833 has been marked as a duplicate of this bug. ***

Comment 20 Deep Mistry 2022-05-04 12:04:41 UTC

Based on current infrastructure and network limitations the upgrades take more time. We have increased the upgrade time from 75min -> 100min which should in theory reduce the failing tests. Verified from QE end the upgrade does seem to finish under expected time and the above issue is only seen with CI.

Comment 21 Deep Mistry 2022-05-04 13:06:00 UTC

We might revisit this bug in the future