2092961 – Disruption tests fire when CI cluster itself experiences network disruption

Bug 2092961 - Disruption tests fire when CI cluster itself experiences network disruption

Summary: Disruption tests fire when CI cluster itself experiences network disruption

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Test Framework
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	low
Target Milestone:	---
Target Release:	4.11.0
Assignee:	OpenShift Release Oversight
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-02 15:45 UTC by Devan Goodwin
Modified:	2022-11-21 19:36 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-11-21 19:36:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift origin pull 27179	None	open	Bug 2092961: Add disruption test to an external service, not cluster under test.	2022-06-02 15:45:42 UTC
Github	openshift origin pull 27210	None	open	Bug 2092961: Flake almost all ci cluster to external svc disruption.	2022-06-03 14:20:41 UTC
Github	openshift origin pull 27217	None	open	Bug 2092961: Use http for network endpoint test.	2022-06-07 13:43:33 UTC

Description Devan Goodwin 2022-06-02 15:45:08 UTC

TRT has identified scenarios where the CI cluster where tests are running (as opposed to the cluster under test) can sometimes lose networking causing disruption tests to fire. This was exposed by noticing several aggregated job runs across clouds and platforms all logged disruption at the exact same time.

To help we should add a new disruption backend to hit an external service. If we see this go down in addition to the cluster itself, we know it's not real disruption.

Comment 1 Devan Goodwin 2022-06-03 14:19:26 UTC

This is already failing too often, and not even corresponding to observed disruption in the cluster under test. We're not sure what's going on but we're going to allow up to 10 minutes before failing the test, it will flake if we see any, so we can gather data and compare.

Note You need to log in before you can comment on or make changes to this bug.