Bug 1843183

Summary:	Overly broad test-case label: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
Product:	OpenShift Container Platform	Reporter:	Ben Parees <bparees>
Component:	Cluster Version Operator	Assignee:	W. Trevor King <wking>
Status:	CLOSED WORKSFORME	QA Contact:	liujia <jiajliu>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.5	CC:	aos-bugs, deads, jokerman, lmohanty, mfojtik, wking, xxia
Target Milestone:	---	Keywords:	Reopened, Upgrades
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:	[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
Last Closed:	2020-12-22 14:53:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ben Parees 2020-06-02 18:42:05 UTC

test:
[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]

is failing frequently in CI, see search results:
https://search.apps.build01.ci.devcluster.openshift.com/?maxAge=168h&context=1&type=junit&maxMatches=5&maxBytes=20971520&groupBy=job&name=upgrade.*4.5&search=%5C%5Bsig-arch%5C%5D%5C%5BFeature%3AClusterUpgrade%5C%5D+Cluster+should+remain+functional+during+upgrade+%5C%5BDisruptive%5C%5D+%5C%5BSerial%5C%5D


I understand additional CI features may be needed to diagnose this.  As this is a significant blocker for 4.5, this blocker bug should be sufficient reason to expedite tactical solutions in CI to gather the debug data needed.

Comment 1 W. Trevor King 2020-06-03 03:46:32 UTC

Many CI jobs fail on this very, very broad test-case.  Picking a recent one to ground discussion [1] (maybe the Sippy template should do this to, or include a FIXME placeholder to remind the user to find a job on their own).  For [1], this test-case failed with:

  fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:237]: Jun  2 22:09:13.197: API was unreachable during disruption for at least 8m40s of 56m39s (15%):

so I expect that it's a dup of bug 1828861 and that we might want to look into the e2e suite to drop this particular check in favor of the parallel (and more specific) 'Kubernetes APIs remain availabe' test-case.

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci/61

Comment 2 W. Trevor King 2020-06-03 04:14:24 UTC

Trying to hunt down the "Cluster should remain functional during upgrade" <testcase>, I don't see it in:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci/61/artifacts/e2e-gcp-upgrade/junit/junit_upgrade_1591136066.xml | sed 's/></>\n</g' | grep '<testcase '
<testcase name="upgrade" classname="upgrade" time="3270.048553909">
<testcase name="Kubernetes APIs remain available" classname="disruption_tests" time="3400.180762845">
<testcase name="OpenShift APIs remain available" classname="disruption_tests" time="3400.180724694">
<testcase name="Check if critical alerts are firing after upgrade success" classname="disruption_tests" time="3713.923365053">
<testcase name="Cluster frontend ingress remain available" classname="disruption_tests" time="3400.179978014">
<testcase name="Application behind service load balancer with PDB is not disrupted" classname="disruption_tests" time="3460.234515131">
<testcase name="[sig-storage] [sig-api-machinery] secret-upgrade" classname="disruption_tests" time="3393.343614871">
<testcase name="[sig-apps] replicaset-upgrade" classname="disruption_tests" time="3391.201718854">
<testcase name="[sig-apps] statefulset-upgrade" classname="disruption_tests" time="0.001423418">
<testcase name="[sig-apps] deployment-upgrade" classname="disruption_tests" time="3389.220372167">
<testcase name="[sig-apps] job-upgrade" classname="disruption_tests" time="3385.185061561">
<testcase name="[sig-storage] [sig-api-machinery] configmap-upgrade" classname="disruption_tests" time="3395.346080185">
<testcase name="[sig-apps] daemonset-upgrade" classname="disruption_tests" time="3385.196608289">
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci/61/artifacts/e2e-gcp-upgrade/junit/junit_e2e_20200602-221427.xml | sed 's/></>\n</g' | grep '<testcase '
<testcase name="[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]" time="3716">
<testcase name="Monitor cluster while tests execute" time="3716">

I'll poke around and try to figure out why we aren't making "Cluster should remain functional during upgrade" a suite or something.  And why we are splitting this into two JUnit files.

Comment 3 W. Trevor King 2020-06-03 04:56:31 UTC

So junit_upgrade_1591136066.xml is getting written by [1], while junit_e2e_20200602-221427.xml is getting written by [2].  Still not clear to me why we aren't using a single file, or why both "Cluster should remain functional during upgrade" and "Kubernetes APIs remain available" are failing with the same error message.

[1]: https://github.com/openshift/origin/blob/a6374911c68081c3e87a01fb67a40f56ca2403ec/test/extended/util/disruption/disruption.go#L106
[2]: https://github.com/openshift/origin/blob/a6374911c68081c3e87a01fb67a40f56ca2403ec/pkg/test/ginkgo/junit.go#L177

Comment 4 W. Trevor King 2020-06-03 05:42:25 UTC

Ah, because these disruption things are our own special snowflake we run alongside Ginkgo.  I'm going to steak this bug to be about improving the test suite, and folks can break out the actual product issue in a clone if it's not already covered by bug 1828861.

Comment 5 W. Trevor King 2020-06-21 14:14:05 UTC

Clayton poked some holes in my initial PR, and I haven't figured out if/how I can patch them up yet; adding UpcomingSprint

Comment 7 Ben Parees 2020-06-23 12:55:06 UTC

The direction this specific bug has been taken is not going to fix any tests, it's just being used to break the existing test up into more specific tests that will fail independently instead of a monolithic test that can fail for a lot of unrelated reasons.

So while this bug is associated w/ that top failing test, it's not going to fix it.

Comment 8 W. Trevor King 2020-06-24 04:40:48 UTC

Right.  Back in comment 1, I suspected bug 1828861 as one of the popular underlying issues.  That's since been closed as a dup of bug 1845411, which remains open.  If folks see "Cluster should remain functional during upgrade" where the error message does not suggest Kubernetes/OpenShift API connectivity issues, we should spin off new bugs that talk about those alternative failure modes.  This bug is about somehow getting us to a point where that cause spitting and rate aggregation happens automatically in Sippy and other JUnit consumers, instead of us having to work it up manually.

Comment 9 Lalatendu Mohanty 2020-06-25 10:44:56 UTC

FYI, As per https://sippy-bparees.svc.ci.openshift.org/?release=4.5#TestImpactingBugs, this bug caused 267 test failures. This show the severity of the failures coming under this bug.

Comment 10 W. Trevor King 2020-06-26 05:16:14 UTC

This bug is currently the massively large bucket of "something bad happened during your update".  It's not one thing that needs fixing.  This ticket is about somehow breaking things up so that it's easier to distinguish the separate failure modes.  Bug 1845411 is about fixing a large class of the underlying errors.  If we see more types of underlying errors, they should get their own bugs instead of piling into this one, or we'll end up with a massive, multi-cause bug to match the massive, multi-failure-cause test ;).

Comment 11 W. Trevor King 2020-07-10 21:35:49 UTC

Still not clear on how to get Ginkgo to avoid the single big-bucket test-case reporting.  Adding UpcomingSprint.

Comment 12 W. Trevor King 2020-08-01 05:35:44 UTC

We still want this and get complaints about the difficulty it causes with distinguishing between update-CI failure modes.  But I'm still not clear how to get more granular test-cases...

Comment 13 W. Trevor King 2020-08-11 23:42:37 UTC

Possibly addressed by [1].  Now that that's landed, we'll see what a failed update job reports once the next update job fails...

[1]: https://github.com/openshift/origin/pull/25399

Comment 14 W. Trevor King 2020-08-21 17:03:53 UTC

Recent 4.6-nightly -> 4.6-nightly failures are dying on unrelated things, including some fallout from [1].  Will circle back and check on recent update reporting next sprint.

[1]: https://github.com/openshift/ci-tools/pull/1131

Comment 15 David Eads 2020-08-31 14:47:16 UTC

We got more granular failures in 4.6. This is now causing noise.

Comment 16 W. Trevor King 2020-09-30 18:07:12 UTC

There is no code linked from this PR, so moving to WORKFORME, because CURRENTRELEASE shows up in Docs' query for bugs that need doc text.

Comment 18 W. Trevor King 2020-12-22 14:53:29 UTC

We got re-opened by [1] rotting out.  Re-closing.

[1]: https://github.com/openshift/origin/pull/25056#issuecomment-749479102