[sig-arch] Check if alerts are firing during or after upgrade success is failing frequently in CI specifically for azure upgrades, see: https://sippy.ci.openshift.org/sippy-ng/tests/4.10/analysis?test=%5Bsig-arch%5D%20Check%20if%20alerts%20are%20firing%20during%20or%20after%20upgrade%20success Azure CI is at 13% success right now so we're chasing several issues and this one in particular is a major impact. Using this link: https://search.ci.openshift.org/?search=alert+TargetDown+fired+for&maxAge=48h&context=1&type=bug%2Bjunit&name=azure&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job You will see that over the past 2 days, for periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade: 142 runs, 98% failed, 43% of failures match = 42% impact Fixing this will be a major improvement. The following are some prow urls that hit the failure, with relevant details on exactly what alert was firing: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1458393563376128000 disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success expand_less 1h9m29s Nov 10 13:10:47.064: Unexpected alerts fired or pending during the upgrade: alert TargetDown fired for 10 seconds with labels: {job="node-exporter", namespace="openshift-monitoring", service="node-exporter", severity="warning"} https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1458393564173045760 disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success expand_less 1h11m59s Nov 10 13:21:18.362: Unexpected alerts fired or pending during the upgrade: alert TargetDown fired for 120 seconds with labels: {job="sdn", namespace="openshift-sdn", service="sdn", severity="warning"} https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1458393565854961664 disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success expand_less 1h12m34s Nov 10 13:19:39.118: Unexpected alerts fired or pending during the upgrade: alert TargetDown fired for 60 seconds with labels: {job="machine-config-daemon", namespace="openshift-machine-config-operator", service="machine-config-daemon", severity="warning"} https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1458393565431336960 disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success expand_less 1h12m57s Nov 10 13:18:32.677: Unexpected alerts fired or pending during the upgrade: alert TargetDown fired for 360 seconds with labels: {job="sdn", namespace="openshift-sdn", service="sdn", severity="warning"} : [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] expand_less 1h12m58s fail [github.com/openshift/origin/test/extended/util/disruption/frontends/frontends.go:206]: Nov 10 13:17:41.002: Frontend was unreachable during disruption for at least 10s of 1h12m2s (0%): Nov 10 12:26:43.704 E ns/openshift-console route/console connection/reused ns/openshift-console route/console connection/reused stopped responding to GET requests over reused connections: Get "https://console-openshift-console.apps.ci-op-cs1kl8b6-253f3.ci2.azure.devcluster.openshift.com/?timeout=10s": context deadline exceeded Nov 10 12:26:43.704 - 10s E ns/openshift-console route/console connection/reused ns/openshift-console route/console connection/reused is not responding to GET requests over reused connections: <nil> Nov 10 12:26:53.704 I ns/openshift-console route/console connection/reused ns/openshift-console route/console connection/reused started responding to GET requests over reused connections The test failure frequently is accompanied by ": [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]" failing as well, usually with the same alert message, but not always, as shown in the last prow job above. Given 42% impact of a job that is effectively failing 100% of the time, TRT is rating this severity high. Networking seems like the best first attempt at a component.
@dgoodwin, looks like this slipped through and I never looked at it. Looks like these alerts will show up, but seems like at a much lower rate than when you initially filed it: https://search.ci.openshift.org/?search=alert+TargetDown+fired+for&maxAge=48h&context=1&type=junit&name=azure&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Is this something specific to still look in to at this point? Sorry for missing it originally.
periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade (all) - 79 runs, 58% failed, 43% of failures match = 25% impact There are a lot of these that need attention but the above still looks really quite substantial to me, this one is predominantly sdn. Other jobs show a decent number of ovnkube-node. Personally I think this is still worth pursuing.
Closing this in order to use https://issues.redhat.com/browse/OCPBUGSM-37073 to track this. Will be working on it actively at this point.