Bug 2108192 - [bz-monitoring][invariant] alert/Watchdog must have no gaps or changes: Regression in upgrades for Azure
Summary: [bz-monitoring][invariant] alert/Watchdog must have no gaps or changes: Regre...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Test Framework
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: OpenShift Release Oversight
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-18 14:46 UTC by Forrest Babcock
Modified: 2022-11-21 19:29 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-11-21 19:29:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker TRT-384 0 None None None 2022-07-18 14:46:45 UTC

Description Forrest Babcock 2022-07-18 14:46:46 UTC
TRT has noticed periodic release failures due to [bz-monitoring][invariant] alert/Watchdog must have no gaps or changes failures and began tracking the issue in https://issues.redhat.com/browse/TRT-384.

The trend we have identified is that the pass rate for Azure upgrades from 4.10 to 4.11 are significantly lower (66%) than they were from 4.9 to 4.10 (89%) we are highlighting this as a potential regression and requesting the Monitoring team take a look into the failures.

Comment 2 Forrest Babcock 2022-07-18 16:04:09 UTC
Adding wking & sdodson for awareness per deads2k

Comment 3 Simon Pasquier 2022-07-19 11:56:51 UTC
This one looks a bit like bug 2079679: the e2e test queries the Thanos API endpoint while prometheus-k8s-0 isn't yet available at the end of the upgrade. [1] should help mitigating the issue.

But looking at a specific job failure [2], we probably want to investigate why it takes almost 8 minutes for kubernetes to restart the prometheus-k8s-0 pod: from the logs [3], it stops at 2022-07-19T02:15:47.921Z to restart only at 2022-07-19T02:24:15.036Z. 

The kube-controller-manager logs [4] show some failures related to the pvc which eventually resolves:

I0719 02:16:08.965838       1 actual_state_of_world.go:357] Volume "kubernetes.io/csi/disk.csi.azure.com^/subscriptions/72e3a972-58b0-4afc-bd4f-da89b39ccebd/resourceGroups/ci-op-n45js6yv-fde6e-q4bfr-rg/providers/Microsoft.Compute/disks/ci-op-n45js6yv-fde6e-q4bfr-dynamic-pvc-1799fd80-e2ea-4afb-b11e-b1f48a3431b0" is already added to attachedVolume list to node "ci-op-n45js6yv-fde6e-q4bfr-worker-centralus2-2kvlm", update device path ""
E0719 02:16:08.968448       1 attach_detach_controller.go:440] Error creating spec for volume "prometheus-data", pod "openshift-monitoring"/"prometheus-k8s-0": error performing CSI migration checks and translation for PVC "openshift-monitoring"/"prometheus-data-prometheus-k8s-0": nodeName is empty
I0719 02:16:09.060292       1 reconciler.go:250] "attacherDetacher.DetachVolume started" volume={AttachedVolume:{VolumeName:kubernetes.io/azure-disk//subscriptions/72e3a972-58b0-4afc-bd4f-da89b39ccebd/resourceGroups/ci-op-n45js6yv-fde6e-q4bfr-rg/providers/Microsoft.Compute/disks/ci-op-n45js6yv-fde6e-q4bfr-dynamic-pvc-b6d11122-c73b-49b8-9486-ee4a350ef174 VolumeSpec:<nil> NodeName:ci-op-n45js6yv-fde6e-q4bfr-worker-centralus3-zz467 PluginIsAttachable:true DevicePath:0 DeviceMountPath: PluginName:} MountedByNode:false DetachRequestedTime:0001-01-01 00:00:00 +0000 UTC}
...
I0719 02:24:08.284739       1 event.go:294] "Event occurred" object="openshift-monitoring/prometheus-k8s-0" fieldPath="" kind="Pod" apiVersion="v1" type="Normal" reason="SuccessfulAttachVolume" message="AttachVolume.Attach succeeded for volume \"pvc-b6d11122-c73b-49b8-9486-ee4a350ef174\" "

Since the failures look more frequent on Azure, I suggest reassigning this bugzilla to the relevant team (storage?). @Forrest WDYT?

[1] https://github.com/openshift/origin/pull/27224
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1549183926205943808/artifacts/e2e-azure-upgrade/gather-extra/artifacts/pods/openshift-kube-controller-manager_kube-controller-manager-ci-op-n45js6yv-fde6e-q4bfr-master-1_kube-controller-manager.log
[3] https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221658095200000%22,%221658267999000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade%2F1549183926205943808%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%3D~%5C%22prometheus-k8s-0%5C%22%20%7Ccontainer_name%3D%5C%22prometheus%5C%22%7C~%5C%22Starting%20Prometheus%7CTSDB%7CStart%20listening%7CSee%20you%5C%22%22,%22refId%22:%22A%22%7D%5D
[4] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1549183926205943808/artifacts/e2e-azure-upgrade/gather-extra/artifacts/pods/openshift-kube-controller-manager_kube-controller-manager-ci-op-n45js6yv-fde6e-q4bfr-master-1_kube-controller-manager.log

Comment 4 Forrest Babcock 2022-07-19 16:57:22 UTC
Based on Simon's comments reassigning to storage to investigate the increased failures on azure regarding reassigning the pvc

Comment 6 Forrest Babcock 2022-07-22 16:40:06 UTC
Moving back to Monitoring for now.  Researching the delay in the the new pod starting up and the PVC getting attached indicates it is due to the node the pod is assigned to getting updated / restarted.  I updated https://issues.redhat.com/browse/TRT-384 with more details on that.

Comment 7 Simon Pasquier 2022-07-25 15:46:00 UTC
Not sure why I missed it but both jobs mentioned by Benjamin show the same log message ([1], [2]):

#### alertErr=client rate limiter Wait returned an error: context canceled

Both build logs also contain a line saying "Interrupted, terminating tests".

IIUC the main context is canceled in a go routine [3] because the program received SIGINT. The same context is also passed to monitor.FetchEventIntervalsForAllAlerts() which fails since the context is already canceled. Moving back to the Test Framework component since monitoring doesn't really own that part of the code.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-upgrade/1549561119339712512/artifacts/e2e-azure-upgrade/openshift-e2e-test/build-log.txt
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-upgrade/1549561117536161792/artifacts/e2e-azure-upgrade/openshift-e2e-test/build-log.txt
[3] https://github.com/openshift/origin/blob/2ca9c0cc4191f4296b1120a4ac5ac53c4241b7ef/pkg/test/ginkgo/cmd_runsuite.go#L258
[4] https://github.com/openshift/origin/blob/2ca9c0cc4191f4296b1120a4ac5ac53c4241b7ef/pkg/test/ginkgo/cmd_runsuite.go#L445

Comment 8 Forrest Babcock 2022-07-27 14:59:41 UTC
Note that we didn't see the 

#### alertErr=client rate limiter Wait returned an error: context canceled

Both build logs also contain a line saying "Interrupted, terminating tests".

failures in all runs.  The only ones we have documented them in so far are the two runs noted above.

when doing search.ci with type=build-log we don't get any results: https://search.ci.openshift.org/?search=.*alertErr%3Dclient+rate+limiter+Wait+returned+an+error.*&maxAge=336h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

And with type=all we only get this bz: https://search.ci.openshift.org/?search=.*alertErr%3Dclient+rate+limiter+Wait+returned+an+error.*&maxAge=336h&context=1&type=all&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=none


If we don't have a way to detect / quantify this issue we might have to start with creating a junit failure for this case to track it more.

I will close the original story (https://issues.redhat.com/browse/TRT-384) researching the Monitoring issue and we can use this BZ to track work on collecting more information on the canceled context / resolving that issue.

Comment 9 Forrest Babcock 2022-07-27 15:05:01 UTC
It is also worth noting that we have observed 100% pass rates in 4.12 with the fix Simon provided earlier via https://bugzilla.redhat.com/show_bug.cgi?id=2079679 and https://bugzilla.redhat.com/show_bug.cgi?id=2108978 has been opened to manage the backport to 4.11.

Comment 12 Devan Goodwin 2022-11-21 19:29:18 UTC
Closing as outdated until this pops up again.


Note You need to log in before you can comment on or make changes to this bug.