1848264 – Upgrade test suite fails on ppc64le environment

Bug 1848264 - Upgrade test suite fails on ppc64le environment

Summary: Upgrade test suite fails on ppc64le environment

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.4
Hardware:	ppc64le
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Rafael Fonseca
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:	multi-arch
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-18 06:17 UTC by Basheer
Modified:	2023-09-15 01:29 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-01-21 14:35:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
upgrade-test-log (2.08 MB, text/plain) 2020-06-18 06:17 UTC, Basheer	no flags	Details
View All

Description Basheer 2020-06-18 06:17:30 UTC

Created attachment 1697905 [details]
upgrade-test-log

Description of problem:
Upgrade test suite fails on ppc64le environment - Frontends were down more than toleration level

Version-Release number of selected component (if applicable):
4.3.z

How reproducible:
Consistently

Steps to Reproduce:
1. Install 4.3.23
2. Run openshift-tests run-upgrade all --to-image=<OCP4.4_image>

Actual results:
Jun  5 04:04:58.996: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready

STEP: Destroying namespace "e2e-k8s-sig-apps-job-upgrade-617" for this suite.

Jun  5 04:04:59.005: INFO: Running AfterSuite actions on node 1

fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:226]: Jun  5 04:04:58.773: Frontends were unreachable during disruption for at least 16m30s of 47m16s (35%):


Expected results:
No errors, and tests should pass

Additional info:
Part of the upgrade tests &frontends.AvailableTest{} is failing on power architecture using libvirt IPI method.. 

Test fails at :

https://github.com/openshift/origin/blob/master/test/extended/util/disruption/frontends/frontends.go#L106

Test fails because the sum of the duration of all events exceeds toleratation(0.2) mentioned in the above test code... Tested with the higher value of toleration 0.4 upgrade tests get passed as expected on power hardware..

[root@osp115 upgrade]# git diff ../../extended/util/disruption/frontends/frontends.go
diff --git a/test/extended/util/disruption/frontends/frontends.go b/test/extended/util/disruption/frontends/frontends.go
index a5e673ac07..6195d0cf45 100644
— a/test/extended/util/disruption/frontends/frontends.go
+++ b/test/extended/util/disruption/frontends/frontends.go
@@ -100,7 +100,7 @@ func (t *AvailableTest) Test(f *framework.Framework, done <-chan struct{}, upgra
cancel()
end := time.Now()

disruption.ExpectNoDisruption(f, 0.20, end.Sub(start), m.Events(time.Time{}, time.Time{}), "Frontends were unreachable during disruption")
+ disruption.ExpectNoDisruption(f, 0.40, end.Sub(start), m.Events(time.Time{}, time.Time{}), "Frontends were unreachable during disruption")
}
// Teardown cleans up any remaining resources.

Comment 2 mkumatag 2020-07-01 17:21:54 UTC

I'm wondering if we have any tool or guide to debug what is going on especially why frontend services are down more than the threshold set.?

Comment 3 Abhinav Dahiya 2020-07-10 17:14:09 UTC

Frontends in upgrade tests usually refer to the ingress controller so moving to the Network Edge team

Comment 7 mfisher 2020-08-18 19:55:52 UTC

Target reset to 4.7 while investigation is either ongoing or not yet started.  Will be considered for earlier release versions when diagnosed and resolved.

Comment 8 Oleg Bulatov 2020-09-07 13:07:07 UTC

Upgrade from 4.2.36 to 4.3 hit the same problem:

Frontends were unreachable during disruption for at least 14m51s of 48m26s (31%)

Logs and artefacts are available at https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1302866183778734080

Comment 9 Miciah Dashiel Butler Masters 2020-10-20 16:17:15 UTC

Dan Li, this report is assigned to the "Routing" component, which is the responsibility of the Network Edge team.  However, I see that you've been changing the assignment of this report among several people who are not on the Network Edge team.  Are you expecting the Network Edge team to take action on this report, or is this issue being handled by the multi-arch folks?

Comment 10 Dan Li 2020-10-20 16:26:03 UTC

Hi Miciah, at the moment this bug is assigned under Rafael Dos Santos, our Multi-Arch CI engineer and henceforth I believe it should be handled by the multi-arch team (since this bug is reported by our IBM partner engineer)

Comment 11 Andrew McDermott 2020-11-03 17:49:39 UTC

Setting to assigned re: comment #10.

Comment 13 Dan Li 2020-12-02 15:15:47 UTC

Adding "UpcomingSprint" as team will not have bandwidth to look at this bug during this sprint

Comment 14 Rafael Fonseca 2020-12-07 09:20:18 UTC

We were able to reproduce this with a 4.7 nightly image on ppc64le but not on s390x. The difference between the 2 arches was how the cluster was configured: in the s390x case, there is a load balancer configured whereas there is none for ppc64le. So what happens is that the "frontend" operators are hard-coded to a specific worker and when that worker is being upgraded, it's unavailable beyond the 20% threshold.

Comment 15 Rafael Fonseca 2020-12-10 19:16:21 UTC

Basheer, can you confirm if that's the case in your setup?

Comment 16 Rafael Fonseca 2021-01-21 14:35:37 UTC

Closing. Re-open in case the solution from the last comment doesn't work.

Comment 17 Red Hat Bugzilla 2023-09-15 01:29:56 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days

Note You need to log in before you can comment on or make changes to this bug.