Created attachment 1697905 [details] upgrade-test-log Description of problem: Upgrade test suite fails on ppc64le environment - Frontends were down more than toleration level Version-Release number of selected component (if applicable): 4.3.z How reproducible: Consistently Steps to Reproduce: 1. Install 4.3.23 2. Run openshift-tests run-upgrade all --to-image=<OCP4.4_image> Actual results: Jun 5 04:04:58.996: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready STEP: Destroying namespace "e2e-k8s-sig-apps-job-upgrade-617" for this suite. Jun 5 04:04:59.005: INFO: Running AfterSuite actions on node 1 fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:226]: Jun 5 04:04:58.773: Frontends were unreachable during disruption for at least 16m30s of 47m16s (35%): Expected results: No errors, and tests should pass Additional info: Part of the upgrade tests &frontends.AvailableTest{} is failing on power architecture using libvirt IPI method.. Test fails at : https://github.com/openshift/origin/blob/master/test/extended/util/disruption/frontends/frontends.go#L106 Test fails because the sum of the duration of all events exceeds toleratation(0.2) mentioned in the above test code... Tested with the higher value of toleration 0.4 upgrade tests get passed as expected on power hardware.. [root@osp115 upgrade]# git diff ../../extended/util/disruption/frontends/frontends.go diff --git a/test/extended/util/disruption/frontends/frontends.go b/test/extended/util/disruption/frontends/frontends.go index a5e673ac07..6195d0cf45 100644 — a/test/extended/util/disruption/frontends/frontends.go +++ b/test/extended/util/disruption/frontends/frontends.go @@ -100,7 +100,7 @@ func (t *AvailableTest) Test(f *framework.Framework, done <-chan struct{}, upgra cancel() end := time.Now() disruption.ExpectNoDisruption(f, 0.20, end.Sub(start), m.Events(time.Time{}, time.Time{}), "Frontends were unreachable during disruption") + disruption.ExpectNoDisruption(f, 0.40, end.Sub(start), m.Events(time.Time{}, time.Time{}), "Frontends were unreachable during disruption") } // Teardown cleans up any remaining resources.
I'm wondering if we have any tool or guide to debug what is going on especially why frontend services are down more than the threshold set.?
Frontends in upgrade tests usually refer to the ingress controller so moving to the Network Edge team
Target reset to 4.7 while investigation is either ongoing or not yet started. Will be considered for earlier release versions when diagnosed and resolved.
Upgrade from 4.2.36 to 4.3 hit the same problem: Frontends were unreachable during disruption for at least 14m51s of 48m26s (31%) Logs and artefacts are available at https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1302866183778734080
Dan Li, this report is assigned to the "Routing" component, which is the responsibility of the Network Edge team. However, I see that you've been changing the assignment of this report among several people who are not on the Network Edge team. Are you expecting the Network Edge team to take action on this report, or is this issue being handled by the multi-arch folks?
Hi Miciah, at the moment this bug is assigned under Rafael Dos Santos, our Multi-Arch CI engineer and henceforth I believe it should be handled by the multi-arch team (since this bug is reported by our IBM partner engineer)
Setting to assigned re: comment #10.
Adding "UpcomingSprint" as team will not have bandwidth to look at this bug during this sprint
We were able to reproduce this with a 4.7 nightly image on ppc64le but not on s390x. The difference between the 2 arches was how the cluster was configured: in the s390x case, there is a load balancer configured whereas there is none for ppc64le. So what happens is that the "frontend" operators are hard-coded to a specific worker and when that worker is being upgraded, it's unavailable beyond the 20% threshold.
Basheer, can you confirm if that's the case in your setup?
Closing. Re-open in case the solution from the last comment doesn't work.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days