Bug 1801885 - e2e-aws-upgrade often fails on: API was unreachable during disruption for at least...
Summary: e2e-aws-upgrade often fails on: API was unreachable during disruption for at ...
Keywords:
Status: CLOSED DUPLICATE of bug 1845412
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: Lukasz Szaszkiewicz
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On: 1845412 1943804
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-11 20:19 UTC by Yu Qi Zhang
Modified: 2021-06-08 14:56 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-06-18 10:23:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 1780 0 None closed [release-4.4] Bug 1843928: gcp-routes: move to MCO, implement downfile, tweak timing 2020-06-23 12:50:32 UTC

Description Yu Qi Zhang 2020-02-11 20:19:17 UTC
Description of problem:

The release-openshift-origin-installer-e2e-aws-upgrade is pretty flaky, and fails very often on

[Disruptive] Cluster upgrade [Top Level] [Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Serial] [Suite:openshift]

Due to: 

fail [github.com/openshift/origin/test/extended/util/disruption/controlplane/controlplane.go:56]: Feb 11 17:07:56.230: API was unreachable during upgrade for at least 2m3s:

The logs have a lot of:

Feb 11 16:34:41.128 I openshift-apiserver OpenShift API started responding to GET requests
Feb 11 16:34:57.100 I openshift-apiserver OpenShift API stopped responding to GET requests: Get https://api.ci-op-4t5lqi0g-77109.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/openshift-apiserver/imagestreams/missing?timeout=15s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Feb 11 16:34:58.100 - 20s   E openshift-apiserver OpenShift API is not responding to GET requests

Example failed test: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/17130

Interestingly, even on passing tests https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/17131 I see a lot of that log too. I wonder if this is pointing to an unstable API?


How reproducible:
intermittent

Comment 2 W. Trevor King 2020-02-13 00:33:54 UTC
13 (~34% of failures) CI release-gating jobs failed on this in the past 24h [1].  Has been happening for at least 6d, with the oldest run I've found being [2], started 2020-02-06 09:20Z going from 4.4.0-0.ci-2020-02-06-023150 to 4.4.0-0.ci-2020-02-06-091545.  Might be older instances that just aren't available to the current CI search.

[1]: https://search.svc.ci.openshift.org/chart?name=^release-openshift-origin-installer-e2e-.*-upgrade$&search=fail.*test/extended/util/disruption/controlplane.*API%20was%20unreachable%20during%20upgrade%20for%20at%20least
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/16555

Comment 4 W. Trevor King 2020-02-20 18:48:33 UTC
Similar symptoms to bug 1802246.  Might be a dup?

Comment 6 Maru Newby 2020-05-06 04:03:11 UTC
Closing - the failure mode in question does not seem to be appearing anymore.

Comment 7 W. Trevor King 2020-05-06 04:29:18 UTC
Looks like [1] shuffled the error string a bit.  [2] still shows 100% of the past day's release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-nightly, release-openshift-origin-installer-e2e-azure-upgrade-4.3, and release-openshift-origin-installer-e2e-azure-upgrade-4.4 failing this test (although those periodics aren't particularly frequent).  Example jobs include [3]

still shows 15% of release-openshift-origin-installer-e2e-aws-upgrade jobs matching this issue and 33% of release-openshift-origin-installer-e2e-azure-upgrade-4.5 - 6 failures matching.  For example:

  fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:226]: May  6 03:51:38.064: API was unreachable during disruption for at least 10m39s of 1h11m7s (15%):

and [4]:

  fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:226]: May  5 19:16:14.997: API was unreachable during disruption for at least 16m45s of 56m8s (30%):

[1]: https://github.com/openshift/origin/pull/24600
[2]: https://search.apps.build01.ci.devcluster.openshift.com/?name=upgrade&search=fail.*API+was+unreachable+during+disruption&maxAge=24h&type=build-log&groupBy=job
[3]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-nightly/85
[4]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4/228

Comment 8 W. Trevor King 2020-05-06 04:31:20 UTC
(In reply to W. Trevor King from comment #7)
> still shows 15% of release-openshift-origin-installer-e2e-aws-upgrade jobs matching this issue and 33% of release-openshift-origin-installer-e2e-azure-upgrade-4.5 - 6 failures matching.  For example:

Please ignore this one line; it was left over from an earlier draft and I forgot to proof-read before clicking "Save Changes" :p.

Comment 9 Lukasz Szaszkiewicz 2020-05-20 09:14:48 UTC
This bug is actively worked on.

There is a known issue that rebooting nodes has an impact on openshift-apiserver/system availability, for example, https://bugzilla.redhat.com/show_bug.cgi?id=1809031

Since the upgrade jobs run upgrades that reboot machines failing e2e tests might be related.

See: https://github.com/openshift/origin/pull/24966 and https://github.com/openshift/enhancements/pull/284

Comment 10 Stefan Schimanski 2020-05-27 09:45:03 UTC
This does not seem to be a regression. We have a fix in preparation, but it needs soak time and this does not seem to be a blocker for 4.5.0 as non-regression.

Comment 13 Ben Parees 2020-06-03 17:45:17 UTC
I question whether there is no regression here, our upgrades are failing more frequently in 4.5 than they were in 4.4, specifically with the "OpenShift API is not responding to GET requests" error:

4.5:
https://search.apps.build01.ci.devcluster.openshift.com/?search=OpenShift+API+is+not+responding+to+GET+requests&maxAge=48h&context=1&type=bug%2Bjunit&name=.*to-4.5.*&maxMatches=5&maxBytes=20971520&groupBy=job

Across 33 runs and 6 jobs (75.76% failed), matched 80.00% of failing runs and 66.67% of jobs 

4.4:
https://search.apps.build01.ci.devcluster.openshift.com/?search=OpenShift+API+is+not+responding+to+GET+requests&maxAge=48h&context=1&type=bug%2Bjunit&name=.*to-4.4.*&maxMatches=5&maxBytes=20971520&groupBy=job

Across 19 runs and 7 jobs (63.16% failed), matched 58.33% of failing runs and 57.14% of jobs 

Moving this back to 4.5 for reassessment.  If you can point to a different bug that explains why our upgrade test pass rate has gone from 60% in 4.4 to 42% in 4.5, then i can understand deferring this, but something has regressed.

Comment 14 Ben Parees 2020-06-03 18:00:19 UTC
To disambiguate https://bugzilla.redhat.com/show_bug.cgi?id=1801885 and https://bugzilla.redhat.com/show_bug.cgi?id=1791162, both of which are failures in the same "[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] [Suite:openshift]" test, they have distinct failure modes/messages


https://bugzilla.redhat.com/show_bug.cgi?id=1801885 is for failures reported as:
fail [github.com/openshift/origin/test/extended/util/disruption/disruption.go:237]: Jun  2 04:14:53.680: API was unreachable during disruption for at least 8m13s of 54m30s (15%!)(MISSING):


https://bugzilla.redhat.com/show_bug.cgi?id=1791162 is for:
Jun 02 04:16:51.466 - 194s  E openshift-apiserver OpenShift API is not responding to GET requests

Comment 16 Ben Parees 2020-06-03 18:13:26 UTC
(to be clear, i understand we expect to have some disruption in 4.5 and will be addressing that in 4.6, but this test is failing because we are exceeding the allowed amount of disruption)

Comment 17 Lukasz Szaszkiewicz 2020-06-04 13:25:41 UTC
It looks like the same upgrade job is in much better condition on AWS - https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.4-stable-to-4.5-ci?buildId=
For GCP we opened https://github.com/openshift/machine-config-operator/pull/1780 as we think it might be the issue.

Comment 18 Stefan Schimanski 2020-06-05 14:45:53 UTC
With https://bugzilla.redhat.com/show_bug.cgi?id=1844387 for ovirt, openstack, vsphere, bm platforms and the following:

- Azure IPI: https://bugzilla.redhat.com/show_bug.cgi?id=1828382
- Azure UPI: https://bugzilla.redhat.com/show_bug.cgi?id=1836016
- AWS UPI: https://bugzilla.redhat.com/show_bug.cgi?id=1836018
- vSphere UPI: https://bugzilla.redhat.com/show_bug.cgi?id=1836017

we have a number of platform specific BZs open which address exactly these problems. This BZ here on the other hand is not actionable. Please be precise in which conditions the issues appear, do analysis of the data beforehand in order to make these BZs actionable. We have platforms which are perfectly fine. So the chance is very high that the ugprade issues have root causes in the different deployments of different platforms.

Comment 19 W. Trevor King 2020-06-05 19:07:17 UTC
> Please be precise in which conditions the issues appear...

I agree that it's nice when the reporter collects statistics, but I don't think it's wrong to expect the assigned team to collect statistics either.  In this case, comment 0 links to an explicit failed job, which should be the minimum required for an actionable report.

> ... do analysis of the data beforehand in order to make these BZs actionable.

This certainly seems like an assigned-component responsibility to me.  I certainly don't expect all bug-reporters to have more subject-matter expertise in the component's space than the assigned-component team, and closing/deferring on "you didn't get close enough for us to pick it up" seems like it's making that assumption.  Although obviously anyone can chip in here, and the closer we get to root-causing the issue, the easier it becomes to get it fixed.

Comment 20 Stefan Schimanski 2020-06-09 07:49:51 UTC
Created umbrella bug per platform, all linked by the top-level bug https://bugzilla.redhat.com/show_bug.cgi?id=1845411.

Comment 21 Lukasz Szaszkiewicz 2020-06-18 10:23:22 UTC

*** This bug has been marked as a duplicate of bug 1845412 ***


Note You need to log in before you can comment on or make changes to this bug.