2093016 – [azure disk] add metric and alert to help identify cascading test failures

Bug 2093016 - [azure disk] add metric and alert to help identify cascading test failures

Summary: [azure disk] add metric and alert to help identify cascading test failures

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.11
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Jan Safranek
QA Contact:	Rohit Patil
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-06-02 18:15 UTC by Jonathan Dobson
Modified:	2023-01-17 19:49 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-01-17 19:49:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-storage-operator pull 324	0	None	open	Bug 2093016: Add alert about attach / mount failing	2022-10-18 11:08:01 UTC
Red Hat Product Errata	RHSA-2022:7399	0	None	None	None	2023-01-17 19:49:50 UTC

Description Jonathan Dobson 2022-06-02 18:15:04 UTC

This job has a failure cascade in it.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1532295441427730432

Under "events should not repeat pathologically" see "Unable to attach or mount volumes: unmounted volumes=[prometheus-data]".

It seems specific to one worker node: ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1532295441427730432/artifacts/e2e-azure-upgrade/gather-extra/artifacts/oc_cmds/pods

prometheus-k8s-0 failed to come up on that node, but is prometheus-k8s-1 is running in eastus21:

openshift-monitoring                                 prometheus-k8s-0                                                     0/6     Init:0/1            0              102m    <none>         ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs   <none>           <none>
openshift-monitoring                                 prometheus-k8s-1                                                     6/6     Running             0              116m    10.131.0.24    ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus21-9jgd9   <none>           <none>

The inline volume tester pods are stuck in ContainerCreating on that node as well:

e2e-ephemeral-1425                                   inline-volume-tester-8wjcj                                           0/1     ContainerCreating   0              51m     <none>         ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs   <none>           <none>
e2e-ephemeral-1835                                   inline-volume-tester-5b8l4                                           0/1     ContainerCreating   0              34m     <none>         ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs   <none>           <none>
e2e-ephemeral-3595                                   inline-volume-tester-dkpnm                                           0/1     ContainerCreating   0              57m     <none>         ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs   <none>           <none>

This is the disk csi driver pod running on that node:

openshift-cluster-csi-drivers                        azure-disk-csi-driver-node-m4j2j                                     3/3     Running             3              130m    10.0.128.5     ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs   <none>           <none>

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1532295441427730432/artifacts/e2e-azure-upgrade/gather-extra/artifacts/pods/openshift-cluster-csi-drivers_azure-disk-csi-driver-node-m4j2j_csi-driver.log

Lots of GRPC timeout errors in that log while trying to find disks:
E0602 13:23:00.349065       1 utils.go:82] GRPC error: rpc error: code = Internal desc = failed to find disk on lun 7. timed out waiting for the condition
E0602 13:24:22.390001       1 utils.go:82] GRPC error: rpc error: code = Internal desc = failed to find disk on lun 6. timed out waiting for the condition
E0602 13:24:44.866653       1 utils.go:82] GRPC error: rpc error: code = Internal desc = failed to find disk on lun 3. timed out waiting for the condition

That error is coming from this line in the driver:
https://github.com/openshift/azure-disk-csi-driver/blob/0fe424e846435a1695920c2b05fcf25b42d9f76d/pkg/azuredisk/nodeserver.go#L99-L102

Which... looks like it's just timing out waiting for the lun to appear after rescanning scsi devices:
https://github.com/openshift/azure-disk-csi-driver/blob/0fe424e846435a1695920c2b05fcf25b42d9f76d/pkg/azuredisk/nodeserver.go#L642

David Eads requested:
Can you author something in openshift-tests to help identify this error mode?  this particular failure mode looks ripe for a metric and an alert

Comment 1 Jan Safranek 2022-06-07 14:36:29 UTC

We need to check:

* Why the volume is not mounted (and check if it was attached correctly).
* If we can get this info in a synthetic CI test, https://github.com/openshift/origin/tree/master/pkg/synthetictests.
* If we can detect it in the CSI driver and emit a metric (+ alert) for it, with some useful info how to fix it.

Comment 2 Jan Safranek 2022-06-10 16:14:27 UTC

The root cause of this issue were missing udev rules in RHCOS 8.6. This has been fixed in https://github.com/openshift/os/pull/836

Now, how can we report such errors better?

Comment 3 Jan Safranek 2022-10-18 11:18:38 UTC

I added an alert to OCP that will report when all volume mounts or volume attachments are failing for a volume plugin on a node for 5 minutes. A single success will make the alert go away.
I.e. create a PV pointing to a non-existing iSCSI volume + run a Pod with it. The alert should be Pending in 5 minutes and Firing in 10 minutes. Create a second Pod with a working iSCSI volume and run it on the same node as the first pod - the alert should go away soon(ish), as there will be mixed iSCSI volume mount successes and failures on the node and even a single success is enough not to trigger the alert.

Comment 10 errata-xmlrpc 2023-01-17 19:49:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.