Bug 2092961 - Disruption tests fire when CI cluster itself experiences network disruption
Summary: Disruption tests fire when CI cluster itself experiences network disruption
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Test Framework
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: 4.11.0
Assignee: OpenShift Release Oversight
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-02 15:45 UTC by Devan Goodwin
Modified: 2022-11-21 19:36 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-11-21 19:36:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 27179 0 None open Bug 2092961: Add disruption test to an external service, not cluster under test. 2022-06-02 15:45:42 UTC
Github openshift origin pull 27210 0 None open Bug 2092961: Flake almost all ci cluster to external svc disruption. 2022-06-03 14:20:41 UTC
Github openshift origin pull 27217 0 None open Bug 2092961: Use http for network endpoint test. 2022-06-07 13:43:33 UTC

Description Devan Goodwin 2022-06-02 15:45:08 UTC
TRT has identified scenarios where the CI cluster where tests are running (as opposed to the cluster under test) can sometimes lose networking causing disruption tests to fire. This was exposed by noticing several aggregated job runs across clouds and platforms all logged disruption at the exact same time.

To help we should add a new disruption backend to hit an external service. If we see this go down in addition to the cluster itself, we know it's not real disruption.

Comment 1 Devan Goodwin 2022-06-03 14:19:26 UTC
This is already failing too often, and not even corresponding to observed disruption in the cluster under test. We're not sure what's going on but we're going to allow up to 10 minutes before failing the test, it will flake if we see any, so we can gather data and compare.


Note You need to log in before you can comment on or make changes to this bug.