Description of problem: The release-openshift-origin-installer-e2e-aws-upgrade is pretty flaky, and fails very often on [Disruptive] Cluster upgrade [Top Level] [Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Serial] [Suite:openshift] Due to: fail [github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:56]: Feb 11 17:07:56.230: API was unreachable during upgrade for at least 2m3s: The logs have a lot of: Feb 11 16:34:41.128 I openshift-apiserver OpenShift API started responding to GET requests Feb 11 16:34:57.100 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-4t5lqi0g-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: context deadline exceeded (Client.Timeout exceeded while awaiting headers) Feb 11 16:34:58.100 - 20s E openshift-apiserver OpenShift API is not responding to GET requests Example failed test: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/17130 Interestingly, even on passing tests https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/17131 I see a lot of that log too. I wonder if this is pointing to an unstable API? How reproducible: intermittent
Another related jobs with this error: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.3-nightly-to-4.4-nightly/14
13 (~34% of failures) CI release-gating jobs failed on this in the past 24h [1]. Has been happening for at least 6d, with the oldest run I've found being [2], started 2020-02-06 09:20Z going from 4.4.0-0.ci-2020-02-06-023150 to 4.4.0-0.ci-2020-02-06-091545. Might be older instances that just aren't available to the current CI search. [1]: https://search.svc.ci.openshift.org/chart?name=^release-openshift-origin-installer-e2e-.*-upgrade$&search=fail.*test/extended/util/disruption/controlplane.*API%20was%20unreachable%20during%20upgrade%20for%20at%20least [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/16555
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3-nightly/102 160 hits in last 24 hours https://search.svc.ci.openshift.org/?search=API+was+unreachable+during+upgrade+for+at+least&maxAge=24h&context=2&type=all
Similar symptoms to bug 1802246. Might be a dup?
Closing - the failure mode in question does not seem to be appearing anymore.
Looks like [1] shuffled the error string a bit. [2] still shows 100% of the past day's release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-nightly, release-openshift-origin-installer-e2e-azure-upgrade-4.3, and release-openshift-origin-installer-e2e-azure-upgrade-4.4 failing this test (although those periodics aren't particularly frequent). Example jobs include [3] still shows 15% of release-openshift-origin-installer-e2e-aws-upgrade jobs matching this issue and 33% of release-openshift-origin-installer-e2e-azure-upgrade-4.5 - 6 failures matching. For example: fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:226]: May 6 03:51:38.064: API was unreachable during disruption for at least 10m39s of 1h11m7s (15%): and [4]: fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:226]: May 5 19:16:14.997: API was unreachable during disruption for at least 16m45s of 56m8s (30%): [1]: https://github.com/openshift/origin/pull/24600 [2]: https://search.apps.build01.ci.devcluster.openshift.com/?name=upgrade&search=fail.*API+was+unreachable+during+disruption&maxAge=24h&type=build-log&groupBy=job [3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-nightly/85 [4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4/228
(In reply to W. Trevor King from comment #7) > still shows 15% of release-openshift-origin-installer-e2e-aws-upgrade jobs matching this issue and 33% of release-openshift-origin-installer-e2e-azure-upgrade-4.5 - 6 failures matching. For example: Please ignore this one line; it was left over from an earlier draft and I forgot to proof-read before clicking "Save Changes" :p.
This bug is actively worked on. There is a known issue that rebooting nodes has an impact on openshift-apiserver/system availability, for example, https://bugzilla.redhat.com/show_bug.cgi?id=1809031 Since the upgrade jobs run upgrades that reboot machines failing e2e tests might be related. See: https://github.com/openshift/origin/pull/24966 and https://github.com/openshift/enhancements/pull/284
This does not seem to be a regression. We have a fix in preparation, but it needs soak time and this does not seem to be a blocker for 4.5.0 as non-regression.
I question whether there is no regression here, our upgrades are failing more frequently in 4.5 than they were in 4.4, specifically with the "OpenShift API is not responding to GET requests" error: 4.5: https://search.apps.build01.ci.devcluster.openshift.com/?search=OpenShift+API+is+not+responding+to+GET+requests&maxAge=48h&context=1&type=bug%2Bjunit&name=.*to-4.5.*&maxMatches=5&maxBytes=20971520&groupBy=job Across 33 runs and 6 jobs (75.76% failed), matched 80.00% of failing runs and 66.67% of jobs 4.4: https://search.apps.build01.ci.devcluster.openshift.com/?search=OpenShift+API+is+not+responding+to+GET+requests&maxAge=48h&context=1&type=bug%2Bjunit&name=.*to-4.4.*&maxMatches=5&maxBytes=20971520&groupBy=job Across 19 runs and 7 jobs (63.16% failed), matched 58.33% of failing runs and 57.14% of jobs Moving this back to 4.5 for reassessment. If you can point to a different bug that explains why our upgrade test pass rate has gone from 60% in 4.4 to 42% in 4.5, then i can understand deferring this, but something has regressed.
To disambiguate https://bugzilla.redhat.com/show_bug.cgi?id=1801885 and https://bugzilla.redhat.com/show_bug.cgi?id=1791162, both of which are failures in the same "[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]" test, they have distinct failure modes/messages https://bugzilla.redhat.com/show_bug.cgi?id=1801885 is for failures reported as: fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:237]: Jun 2 04:14:53.680: API was unreachable during disruption for at least 8m13s of 54m30s (15%!)(MISSING): https://bugzilla.redhat.com/show_bug.cgi?id=1791162 is for: Jun 02 04:16:51.466 - 194s E openshift-apiserver OpenShift API is not responding to GET requests
recent example for this bug: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/58
(to be clear, i understand we expect to have some disruption in 4.5 and will be addressing that in 4.6, but this test is failing because we are exceeding the allowed amount of disruption)
It looks like the same upgrade job is in much better condition on AWS - https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.4-stable-to-4.5-ci?buildId= For GCP we opened https://github.com/openshift/machine-config-operator/pull/1780 as we think it might be the issue.
With https://bugzilla.redhat.com/show_bug.cgi?id=1844387 for ovirt, openstack, vsphere, bm platforms and the following: - Azure IPI: https://bugzilla.redhat.com/show_bug.cgi?id=1828382 - Azure UPI: https://bugzilla.redhat.com/show_bug.cgi?id=1836016 - AWS UPI: https://bugzilla.redhat.com/show_bug.cgi?id=1836018 - vSphere UPI: https://bugzilla.redhat.com/show_bug.cgi?id=1836017 we have a number of platform specific BZs open which address exactly these problems. This BZ here on the other hand is not actionable. Please be precise in which conditions the issues appear, do analysis of the data beforehand in order to make these BZs actionable. We have platforms which are perfectly fine. So the chance is very high that the ugprade issues have root causes in the different deployments of different platforms.
> Please be precise in which conditions the issues appear... I agree that it's nice when the reporter collects statistics, but I don't think it's wrong to expect the assigned team to collect statistics either. In this case, comment 0 links to an explicit failed job, which should be the minimum required for an actionable report. > ... do analysis of the data beforehand in order to make these BZs actionable. This certainly seems like an assigned-component responsibility to me. I certainly don't expect all bug-reporters to have more subject-matter expertise in the component's space than the assigned-component team, and closing/deferring on "you didn't get close enough for us to pick it up" seems like it's making that assumption. Although obviously anyone can chip in here, and the closer we get to root-causing the issue, the easier it becomes to get it fixed.
Created umbrella bug per platform, all linked by the top-level bug https://bugzilla.redhat.com/show_bug.cgi?id=1845411.
*** This bug has been marked as a duplicate of bug 1845412 ***