2022113 – [sig-arch] Check if alerts are firing during or after upgrade success --- alert TargetDown fired for x seconds

Bug 2022113 - [sig-arch] Check if alerts are firing during or after upgrade success --- alert TargetDown fired for x seconds

Summary: [sig-arch] Check if alerts are firing during or after upgrade success --- ale...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	jamo luhrsen
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-11-10 18:55 UTC by Devan Goodwin
Modified:	2022-11-18 19:10 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	[sig-arch] Check if alerts are firing during or after upgrade success [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]
Last Closed:	2022-11-18 19:10:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Devan Goodwin 2021-11-10 18:55:48 UTC

[sig-arch] Check if alerts are firing during or after upgrade success

is failing frequently in CI specifically for azure upgrades, see:
https://sippy.ci.openshift.org/sippy-ng/tests/4.10/analysis?test=%5Bsig-arch%5D%20Check%20if%20alerts%20are%20firing%20during%20or%20after%20upgrade%20success

Azure CI is at 13% success right now so we're chasing several issues and this one in particular is a major impact.

Using this link:

https://search.ci.openshift.org/?search=alert+TargetDown+fired+for&maxAge=48h&context=1&type=bug%2Bjunit&name=azure&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

You will see that over the past 2 days, for periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade: 142 runs, 98% failed, 43% of failures match = 42% impact

Fixing this will be a major improvement. The following are some prow urls that hit the failure, with relevant details on exactly what alert was firing:



https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1458393563376128000

disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success expand_less 	1h9m29s
Nov 10 13:10:47.064: Unexpected alerts fired or pending during the upgrade:

alert TargetDown fired for 10 seconds with labels: {job="node-exporter", namespace="openshift-monitoring", service="node-exporter", severity="warning"}



https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1458393564173045760

disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success expand_less 	1h11m59s
Nov 10 13:21:18.362: Unexpected alerts fired or pending during the upgrade:

alert TargetDown fired for 120 seconds with labels: {job="sdn", namespace="openshift-sdn", service="sdn", severity="warning"}



https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1458393565854961664

disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success expand_less 	1h12m34s
Nov 10 13:19:39.118: Unexpected alerts fired or pending during the upgrade:

alert TargetDown fired for 60 seconds with labels: {job="machine-config-daemon", namespace="openshift-machine-config-operator", service="machine-config-daemon", severity="warning"}



https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1458393565431336960

disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success expand_less 	1h12m57s
Nov 10 13:18:32.677: Unexpected alerts fired or pending during the upgrade:

alert TargetDown fired for 360 seconds with labels: {job="sdn", namespace="openshift-sdn", service="sdn", severity="warning"}

: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] expand_less 	1h12m58s
fail [github.com/openshift/origin/test/extended/util/disruption/frontends/frontends.go:206]: Nov 10 13:17:41.002: Frontend was unreachable during disruption for at least 10s of 1h12m2s (0%):

Nov 10 12:26:43.704 E ns/openshift-console route/console connection/reused ns/openshift-console route/console connection/reused stopped responding to GET requests over reused connections: Get "https://console-openshift-console.apps.ci-op-cs1kl8b6-253f3.ci2.azure.devcluster.openshift.com/?timeout=10s": context deadline exceeded
Nov 10 12:26:43.704 - 10s   E ns/openshift-console route/console connection/reused ns/openshift-console route/console connection/reused is not  responding to GET requests over reused connections: <nil>
Nov 10 12:26:53.704 I ns/openshift-console route/console connection/reused ns/openshift-console route/console connection/reused started responding to GET requests over reused connections




The test failure frequently is accompanied by ": [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]" failing as well, usually with the same alert message, but not always, as shown in the last prow job above.


Given 42% impact of a job that is effectively failing 100% of the time, TRT is rating this severity high.

Networking seems like the best first attempt at a component.

Comment 1 jamo luhrsen 2022-11-02 17:07:34 UTC

@dgoodwin, looks like this slipped through and I never looked at it. Looks like these alerts will
show up, but seems like at a much lower rate than when you initially filed it:

  https://search.ci.openshift.org/?search=alert+TargetDown+fired+for&maxAge=48h&context=1&type=junit&name=azure&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Is this something specific to still look in to at this point? Sorry for missing it originally.

Comment 2 Devan Goodwin 2022-11-02 17:59:22 UTC

periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade (all) - 79 runs, 58% failed, 43% of failures match = 25% impact

There are a lot of these that need attention but the above still looks really quite substantial to me, this one is predominantly sdn. 

Other jobs show a decent number of ovnkube-node.

Personally I think this is still worth pursuing.

Comment 3 jamo luhrsen 2022-11-18 19:10:36 UTC

Closing this in order to use https://issues.redhat.com/browse/OCPBUGSM-37073 to track this.
Will be working on it actively at this point.

Note You need to log in before you can comment on or make changes to this bug.