[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
is failing frequently in CI, see search results:
I understand additional CI features may be needed to diagnose this. As this is a significant blocker for 4.5, this blocker bug should be sufficient reason to expedite tactical solutions in CI to gather the debug data needed.
Many CI jobs fail on this very, very broad test-case. Picking a recent one to ground discussion  (maybe the Sippy template should do this to, or include a FIXME placeholder to remind the user to find a job on their own). For , this test-case failed with:
fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:237]: Jun 2 22:09:13.197: API was unreachable during disruption for at least 8m40s of 56m39s (15%):
so I expect that it's a dup of bug 1828861 and that we might want to look into the e2e suite to drop this particular check in favor of the parallel (and more specific) 'Kubernetes APIs remain availabe' test-case.
Trying to hunt down the "Cluster should remain functional during upgrade" <testcase>, I don't see it in:
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci/61/artifacts/e2e-gcp-upgrade/junit/junit_upgrade_1591136066.xml | sed 's/></>\n</g' | grep '<testcase '
<testcase name="upgrade" classname="upgrade" time="3270.048553909">
<testcase name="Kubernetes APIs remain available" classname="disruption_tests" time="3400.180762845">
<testcase name="OpenShift APIs remain available" classname="disruption_tests" time="3400.180724694">
<testcase name="Check if critical alerts are firing after upgrade success" classname="disruption_tests" time="3713.923365053">
<testcase name="Cluster frontend ingress remain available" classname="disruption_tests" time="3400.179978014">
<testcase name="Application behind service load balancer with PDB is not disrupted" classname="disruption_tests" time="3460.234515131">
<testcase name="[sig-storage] [sig-api-machinery] secret-upgrade" classname="disruption_tests" time="3393.343614871">
<testcase name="[sig-apps] replicaset-upgrade" classname="disruption_tests" time="3391.201718854">
<testcase name="[sig-apps] statefulset-upgrade" classname="disruption_tests" time="0.001423418">
<testcase name="[sig-apps] deployment-upgrade" classname="disruption_tests" time="3389.220372167">
<testcase name="[sig-apps] job-upgrade" classname="disruption_tests" time="3385.185061561">
<testcase name="[sig-storage] [sig-api-machinery] configmap-upgrade" classname="disruption_tests" time="3395.346080185">
<testcase name="[sig-apps] daemonset-upgrade" classname="disruption_tests" time="3385.196608289">
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.4-stable-to-4.5-ci/61/artifacts/e2e-gcp-upgrade/junit/junit_e2e_20200602-221427.xml | sed 's/></>\n</g' | grep '<testcase '
<testcase name="[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]" time="3716">
<testcase name="Monitor cluster while tests execute" time="3716">
I'll poke around and try to figure out why we aren't making "Cluster should remain functional during upgrade" a suite or something. And why we are splitting this into two JUnit files.
So junit_upgrade_1591136066.xml is getting written by , while junit_e2e_20200602-221427.xml is getting written by . Still not clear to me why we aren't using a single file, or why both "Cluster should remain functional during upgrade" and "Kubernetes APIs remain available" are failing with the same error message.
Ah, because these disruption things are our own special snowflake we run alongside Ginkgo. I'm going to steak this bug to be about improving the test suite, and folks can break out the actual product issue in a clone if it's not already covered by bug 1828861.
Clayton poked some holes in my initial PR, and I haven't figured out if/how I can patch them up yet; adding UpcomingSprint
The direction this specific bug has been taken is not going to fix any tests, it's just being used to break the existing test up into more specific tests that will fail independently instead of a monolithic test that can fail for a lot of unrelated reasons.
So while this bug is associated w/ that top failing test, it's not going to fix it.
Right. Back in comment 1, I suspected bug 1828861 as one of the popular underlying issues. That's since been closed as a dup of bug 1845411, which remains open. If folks see "Cluster should remain functional during upgrade" where the error message does not suggest Kubernetes/OpenShift API connectivity issues, we should spin off new bugs that talk about those alternative failure modes. This bug is about somehow getting us to a point where that cause spitting and rate aggregation happens automatically in Sippy and other JUnit consumers, instead of us having to work it up manually.
FYI, As per https://sippy-bparees.svc.ci.openshift.org/?release=4.5#TestImpactingBugs, this bug caused 267 test failures. This show the severity of the failures coming under this bug.
This bug is currently the massively large bucket of "something bad happened during your update". It's not one thing that needs fixing. This ticket is about somehow breaking things up so that it's easier to distinguish the separate failure modes. Bug 1845411 is about fixing a large class of the underlying errors. If we see more types of underlying errors, they should get their own bugs instead of piling into this one, or we'll end up with a massive, multi-cause bug to match the massive, multi-failure-cause test ;).
Still not clear on how to get Ginkgo to avoid the single big-bucket test-case reporting. Adding UpcomingSprint.
We still want this and get complaints about the difficulty it causes with distinguishing between update-CI failure modes. But I'm still not clear how to get more granular test-cases...
Possibly addressed by . Now that that's landed, we'll see what a failed update job reports once the next update job fails...
Recent 4.6-nightly -> 4.6-nightly failures are dying on unrelated things, including some fallout from . Will circle back and check on recent update reporting next sprint.
We got more granular failures in 4.6. This is now causing noise.
There is no code linked from this PR, so moving to WORKFORME, because CURRENTRELEASE shows up in Docs' query for bugs that need doc text.