2108192 – [bz-monitoring][invariant] alert/Watchdog must have no gaps or changes: Regression in upgrades for Azure

Bug 2108192 - [bz-monitoring][invariant] alert/Watchdog must have no gaps or changes: Regression in upgrades for Azure

Summary: [bz-monitoring][invariant] alert/Watchdog must have no gaps or changes: Regre...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Test Framework
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	OpenShift Release Oversight
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-07-18 14:46 UTC by Forrest Babcock
Modified:	2022-11-21 19:29 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-11-21 19:29:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	TRT-384	0	None	None	None	2022-07-18 14:46:45 UTC

Description Forrest Babcock 2022-07-18 14:46:46 UTC

TRT has noticed periodic release failures due to [bz-monitoring][invariant] alert/Watchdog must have no gaps or changes failures and began tracking the issue in https://issues.redhat.com/browse/TRT-384.

The trend we have identified is that the pass rate for Azure upgrades from 4.10 to 4.11 are significantly lower (66%) than they were from 4.9 to 4.10 (89%) we are highlighting this as a potential regression and requesting the Monitoring team take a look into the failures.

Comment 2 Forrest Babcock 2022-07-18 16:04:09 UTC

Adding wking & sdodson for awareness per deads2k

Comment 3 Simon Pasquier 2022-07-19 11:56:51 UTC

This one looks a bit like bug 2079679: the e2e test queries the Thanos API endpoint while prometheus-k8s-0 isn't yet available at the end of the upgrade. [1] should help mitigating the issue.

But looking at a specific job failure [2], we probably want to investigate why it takes almost 8 minutes for kubernetes to restart the prometheus-k8s-0 pod: from the logs [3], it stops at 2022-07-19T02:15:47.921Z to restart only at 2022-07-19T02:24:15.036Z. 

The kube-controller-manager logs [4] show some failures related to the pvc which eventually resolves:

I0719 02:16:08.965838       1 actual_state_of_world.go:357] Volume "kubernetes.io/csi/disk.csi.azure.com^/subscriptions/72e3a972-58b0-4afc-bd4f-da89b39ccebd/resourceGroups/ci-op-n45js6yv-fde6e-q4bfr-rg/providers/Microsoft.Compute/disks/ci-op-n45js6yv-fde6e-q4bfr-dynamic-pvc-1799fd80-e2ea-4afb-b11e-b1f48a3431b0" is already added to attachedVolume list to node "ci-op-n45js6yv-fde6e-q4bfr-worker-centralus2-2kvlm", update device path ""
E0719 02:16:08.968448       1 attach_detach_controller.go:440] Error creating spec for volume "prometheus-data", pod "openshift-monitoring"/"prometheus-k8s-0": error performing CSI migration checks and translation for PVC "openshift-monitoring"/"prometheus-data-prometheus-k8s-0": nodeName is empty
I0719 02:16:09.060292       1 reconciler.go:250] "attacherDetacher.DetachVolume started" volume={AttachedVolume:{VolumeName:kubernetes.io/azure-disk//subscriptions/72e3a972-58b0-4afc-bd4f-da89b39ccebd/resourceGroups/ci-op-n45js6yv-fde6e-q4bfr-rg/providers/Microsoft.Compute/disks/ci-op-n45js6yv-fde6e-q4bfr-dynamic-pvc-b6d11122-c73b-49b8-9486-ee4a350ef174 VolumeSpec:<nil> NodeName:ci-op-n45js6yv-fde6e-q4bfr-worker-centralus3-zz467 PluginIsAttachable:true DevicePath:0 DeviceMountPath: PluginName:} MountedByNode:false DetachRequestedTime:0001-01-01 00:00:00 +0000 UTC}
...
I0719 02:24:08.284739       1 event.go:294] "Event occurred" object="openshift-monitoring/prometheus-k8s-0" fieldPath="" kind="Pod" apiVersion="v1" type="Normal" reason="SuccessfulAttachVolume" message="AttachVolume.Attach succeeded for volume \"pvc-b6d11122-c73b-49b8-9486-ee4a350ef174\" "

Since the failures look more frequent on Azure, I suggest reassigning this bugzilla to the relevant team (storage?). @Forrest WDYT?

[1] https://github.com/openshift/origin/pull/27224
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1549183926205943808/artifacts/e2e-azure-upgrade/gather-extra/artifacts/pods/openshift-kube-controller-manager_kube-controller-manager-ci-op-n45js6yv-fde6e-q4bfr-master-1_kube-controller-manager.log
[3] https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221658095200000%22,%221658267999000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade%2F1549183926205943808%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%3D~%5C%22prometheus-k8s-0%5C%22%20%7Ccontainer_name%3D%5C%22prometheus%5C%22%7C~%5C%22Starting%20Prometheus%7CTSDB%7CStart%20listening%7CSee%20you%5C%22%22,%22refId%22:%22A%22%7D%5D
[4] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1549183926205943808/artifacts/e2e-azure-upgrade/gather-extra/artifacts/pods/openshift-kube-controller-manager_kube-controller-manager-ci-op-n45js6yv-fde6e-q4bfr-master-1_kube-controller-manager.log

Comment 4 Forrest Babcock 2022-07-19 16:57:22 UTC

Based on Simon's comments reassigning to storage to investigate the increased failures on azure regarding reassigning the pvc

Comment 5 Stephen Benjamin 2022-07-20 12:11:55 UTC

These two jobs failed on this today, blocking a payload:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-upgrade/1549561117536161792
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-upgrade/1549561119339712512

Comment 6 Forrest Babcock 2022-07-22 16:40:06 UTC

Moving back to Monitoring for now.  Researching the delay in the the new pod starting up and the PVC getting attached indicates it is due to the node the pod is assigned to getting updated / restarted.  I updated https://issues.redhat.com/browse/TRT-384 with more details on that.

Comment 7 Simon Pasquier 2022-07-25 15:46:00 UTC

Not sure why I missed it but both jobs mentioned by Benjamin show the same log message ([1], [2]):

#### alertErr=client rate limiter Wait returned an error: context canceled

Both build logs also contain a line saying "Interrupted, terminating tests".

IIUC the main context is canceled in a go routine [3] because the program received SIGINT. The same context is also passed to monitor.FetchEventIntervalsForAllAlerts() which fails since the context is already canceled. Moving back to the Test Framework component since monitoring doesn't really own that part of the code.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-upgrade/1549561119339712512/artifacts/e2e-azure-upgrade/openshift-e2e-test/build-log.txt
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-upgrade/1549561117536161792/artifacts/e2e-azure-upgrade/openshift-e2e-test/build-log.txt
[3] https://github.com/openshift/origin/blob/2ca9c0cc4191f4296b1120a4ac5ac53c4241b7ef/pkg/test/ginkgo/cmd_runsuite.go#L258
[4] https://github.com/openshift/origin/blob/2ca9c0cc4191f4296b1120a4ac5ac53c4241b7ef/pkg/test/ginkgo/cmd_runsuite.go#L445

Comment 8 Forrest Babcock 2022-07-27 14:59:41 UTC

Note that we didn't see the 

#### alertErr=client rate limiter Wait returned an error: context canceled

Both build logs also contain a line saying "Interrupted, terminating tests".

failures in all runs.  The only ones we have documented them in so far are the two runs noted above.

when doing search.ci with type=build-log we don't get any results: https://search.ci.openshift.org/?search=.*alertErr%3Dclient+rate+limiter+Wait+returned+an+error.*&maxAge=336h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

And with type=all we only get this bz: https://search.ci.openshift.org/?search=.*alertErr%3Dclient+rate+limiter+Wait+returned+an+error.*&maxAge=336h&context=1&type=all&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=none


If we don't have a way to detect / quantify this issue we might have to start with creating a junit failure for this case to track it more.

I will close the original story (https://issues.redhat.com/browse/TRT-384) researching the Monitoring issue and we can use this BZ to track work on collecting more information on the canceled context / resolving that issue.

Comment 9 Forrest Babcock 2022-07-27 15:05:01 UTC

It is also worth noting that we have observed 100% pass rates in 4.12 with the fix Simon provided earlier via https://bugzilla.redhat.com/show_bug.cgi?id=2079679 and https://bugzilla.redhat.com/show_bug.cgi?id=2108978 has been opened to manage the backport to 4.11.

Comment 12 Devan Goodwin 2022-11-21 19:29:18 UTC

Closing as outdated until this pops up again.

Note You need to log in before you can comment on or make changes to this bug.