TRT has noticed periodic release failures due to [bz-monitoring][invariant] alert/Watchdog must have no gaps or changes failures and began tracking the issue in https://issues.redhat.com/browse/TRT-384. The trend we have identified is that the pass rate for Azure upgrades from 4.10 to 4.11 are significantly lower (66%) than they were from 4.9 to 4.10 (89%) we are highlighting this as a potential regression and requesting the Monitoring team take a look into the failures.
Adding wking & sdodson for awareness per deads2k
This one looks a bit like bug 2079679: the e2e test queries the Thanos API endpoint while prometheus-k8s-0 isn't yet available at the end of the upgrade. [1] should help mitigating the issue. But looking at a specific job failure [2], we probably want to investigate why it takes almost 8 minutes for kubernetes to restart the prometheus-k8s-0 pod: from the logs [3], it stops at 2022-07-19T02:15:47.921Z to restart only at 2022-07-19T02:24:15.036Z. The kube-controller-manager logs [4] show some failures related to the pvc which eventually resolves: I0719 02:16:08.965838 1 actual_state_of_world.go:357] Volume "kubernetes.io/csi/disk.csi.azure.com^/subscriptions/72e3a972-58b0-4afc-bd4f-da89b39ccebd/resourceGroups/ci-op-n45js6yv-fde6e-q4bfr-rg/providers/Microsoft.Compute/disks/ci-op-n45js6yv-fde6e-q4bfr-dynamic-pvc-1799fd80-e2ea-4afb-b11e-b1f48a3431b0" is already added to attachedVolume list to node "ci-op-n45js6yv-fde6e-q4bfr-worker-centralus2-2kvlm", update device path "" E0719 02:16:08.968448 1 attach_detach_controller.go:440] Error creating spec for volume "prometheus-data", pod "openshift-monitoring"/"prometheus-k8s-0": error performing CSI migration checks and translation for PVC "openshift-monitoring"/"prometheus-data-prometheus-k8s-0": nodeName is empty I0719 02:16:09.060292 1 reconciler.go:250] "attacherDetacher.DetachVolume started" volume={AttachedVolume:{VolumeName:kubernetes.io/azure-disk//subscriptions/72e3a972-58b0-4afc-bd4f-da89b39ccebd/resourceGroups/ci-op-n45js6yv-fde6e-q4bfr-rg/providers/Microsoft.Compute/disks/ci-op-n45js6yv-fde6e-q4bfr-dynamic-pvc-b6d11122-c73b-49b8-9486-ee4a350ef174 VolumeSpec:<nil> NodeName:ci-op-n45js6yv-fde6e-q4bfr-worker-centralus3-zz467 PluginIsAttachable:true DevicePath:0 DeviceMountPath: PluginName:} MountedByNode:false DetachRequestedTime:0001-01-01 00:00:00 +0000 UTC} ... I0719 02:24:08.284739 1 event.go:294] "Event occurred" object="openshift-monitoring/prometheus-k8s-0" fieldPath="" kind="Pod" apiVersion="v1" type="Normal" reason="SuccessfulAttachVolume" message="AttachVolume.Attach succeeded for volume \"pvc-b6d11122-c73b-49b8-9486-ee4a350ef174\" " Since the failures look more frequent on Azure, I suggest reassigning this bugzilla to the relevant team (storage?). @Forrest WDYT? [1] https://github.com/openshift/origin/pull/27224 [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1549183926205943808/artifacts/e2e-azure-upgrade/gather-extra/artifacts/pods/openshift-kube-controller-manager_kube-controller-manager-ci-op-n45js6yv-fde6e-q4bfr-master-1_kube-controller-manager.log [3] https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%5B%221658095200000%22,%221658267999000%22,%22Grafana%20Cloud%22,%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fperiodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade%2F1549183926205943808%5C%22%7D%20%7C%20unpack%20%7C%20pod_name%3D~%5C%22prometheus-k8s-0%5C%22%20%7Ccontainer_name%3D%5C%22prometheus%5C%22%7C~%5C%22Starting%20Prometheus%7CTSDB%7CStart%20listening%7CSee%20you%5C%22%22,%22refId%22:%22A%22%7D%5D [4] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1549183926205943808/artifacts/e2e-azure-upgrade/gather-extra/artifacts/pods/openshift-kube-controller-manager_kube-controller-manager-ci-op-n45js6yv-fde6e-q4bfr-master-1_kube-controller-manager.log
Based on Simon's comments reassigning to storage to investigate the increased failures on azure regarding reassigning the pvc
These two jobs failed on this today, blocking a payload: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-upgrade/1549561117536161792 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-upgrade/1549561119339712512
Moving back to Monitoring for now. Researching the delay in the the new pod starting up and the PVC getting attached indicates it is due to the node the pod is assigned to getting updated / restarted. I updated https://issues.redhat.com/browse/TRT-384 with more details on that.
Not sure why I missed it but both jobs mentioned by Benjamin show the same log message ([1], [2]): #### alertErr=client rate limiter Wait returned an error: context canceled Both build logs also contain a line saying "Interrupted, terminating tests". IIUC the main context is canceled in a go routine [3] because the program received SIGINT. The same context is also passed to monitor.FetchEventIntervalsForAllAlerts() which fails since the context is already canceled. Moving back to the Test Framework component since monitoring doesn't really own that part of the code. [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-upgrade/1549561119339712512/artifacts/e2e-azure-upgrade/openshift-e2e-test/build-log.txt [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-upgrade/1549561117536161792/artifacts/e2e-azure-upgrade/openshift-e2e-test/build-log.txt [3] https://github.com/openshift/origin/blob/2ca9c0cc4191f4296b1120a4ac5ac53c4241b7ef/pkg/test/ginkgo/cmd_runsuite.go#L258 [4] https://github.com/openshift/origin/blob/2ca9c0cc4191f4296b1120a4ac5ac53c4241b7ef/pkg/test/ginkgo/cmd_runsuite.go#L445
Note that we didn't see the #### alertErr=client rate limiter Wait returned an error: context canceled Both build logs also contain a line saying "Interrupted, terminating tests". failures in all runs. The only ones we have documented them in so far are the two runs noted above. when doing search.ci with type=build-log we don't get any results: https://search.ci.openshift.org/?search=.*alertErr%3Dclient+rate+limiter+Wait+returned+an+error.*&maxAge=336h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job And with type=all we only get this bz: https://search.ci.openshift.org/?search=.*alertErr%3Dclient+rate+limiter+Wait+returned+an+error.*&maxAge=336h&context=1&type=all&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=none If we don't have a way to detect / quantify this issue we might have to start with creating a junit failure for this case to track it more. I will close the original story (https://issues.redhat.com/browse/TRT-384) researching the Monitoring issue and we can use this BZ to track work on collecting more information on the canceled context / resolving that issue.
It is also worth noting that we have observed 100% pass rates in 4.12 with the fix Simon provided earlier via https://bugzilla.redhat.com/show_bug.cgi?id=2079679 and https://bugzilla.redhat.com/show_bug.cgi?id=2108978 has been opened to manage the backport to 4.11.
Closing as outdated until this pops up again.