Bug 2093016 - [azure disk] add metric and alert to help identify cascading test failures
Summary: [azure disk] add metric and alert to help identify cascading test failures
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.11
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: 4.12.0
Assignee: Jan Safranek
QA Contact: Rohit Patil
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-02 18:15 UTC by Jonathan Dobson
Modified: 2023-01-17 19:49 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-17 19:49:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-storage-operator pull 324 0 None open Bug 2093016: Add alert about attach / mount failing 2022-10-18 11:08:01 UTC
Red Hat Product Errata RHSA-2022:7399 0 None None None 2023-01-17 19:49:50 UTC

Description Jonathan Dobson 2022-06-02 18:15:04 UTC
This job has a failure cascade in it.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1532295441427730432

Under "events should not repeat pathologically" see "Unable to attach or mount volumes: unmounted volumes=[prometheus-data]".

It seems specific to one worker node: ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1532295441427730432/artifacts/e2e-azure-upgrade/gather-extra/artifacts/oc_cmds/pods

prometheus-k8s-0 failed to come up on that node, but is prometheus-k8s-1 is running in eastus21:

openshift-monitoring                                 prometheus-k8s-0                                                     0/6     Init:0/1            0              102m    <none>         ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs   <none>           <none>
openshift-monitoring                                 prometheus-k8s-1                                                     6/6     Running             0              116m    10.131.0.24    ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus21-9jgd9   <none>           <none>

The inline volume tester pods are stuck in ContainerCreating on that node as well:

e2e-ephemeral-1425                                   inline-volume-tester-8wjcj                                           0/1     ContainerCreating   0              51m     <none>         ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs   <none>           <none>
e2e-ephemeral-1835                                   inline-volume-tester-5b8l4                                           0/1     ContainerCreating   0              34m     <none>         ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs   <none>           <none>
e2e-ephemeral-3595                                   inline-volume-tester-dkpnm                                           0/1     ContainerCreating   0              57m     <none>         ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs   <none>           <none>

This is the disk csi driver pod running on that node:

openshift-cluster-csi-drivers                        azure-disk-csi-driver-node-m4j2j                                     3/3     Running             3              130m    10.0.128.5     ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs   <none>           <none>

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1532295441427730432/artifacts/e2e-azure-upgrade/gather-extra/artifacts/pods/openshift-cluster-csi-drivers_azure-disk-csi-driver-node-m4j2j_csi-driver.log

Lots of GRPC timeout errors in that log while trying to find disks:
E0602 13:23:00.349065       1 utils.go:82] GRPC error: rpc error: code = Internal desc = failed to find disk on lun 7. timed out waiting for the condition
E0602 13:24:22.390001       1 utils.go:82] GRPC error: rpc error: code = Internal desc = failed to find disk on lun 6. timed out waiting for the condition
E0602 13:24:44.866653       1 utils.go:82] GRPC error: rpc error: code = Internal desc = failed to find disk on lun 3. timed out waiting for the condition

That error is coming from this line in the driver:
https://github.com/openshift/azure-disk-csi-driver/blob/0fe424e846435a1695920c2b05fcf25b42d9f76d/pkg/azuredisk/nodeserver.go#L99-L102

Which... looks like it's just timing out waiting for the lun to appear after rescanning scsi devices:
https://github.com/openshift/azure-disk-csi-driver/blob/0fe424e846435a1695920c2b05fcf25b42d9f76d/pkg/azuredisk/nodeserver.go#L642

David Eads requested:
Can you author something in openshift-tests to help identify this error mode?  this particular failure mode looks ripe for a metric and an alert

Comment 1 Jan Safranek 2022-06-07 14:36:29 UTC
We need to check:

* Why the volume is not mounted (and check if it was attached correctly).
* If we can get this info in a synthetic CI test, https://github.com/openshift/origin/tree/master/pkg/synthetictests.
* If we can detect it in the CSI driver and emit a metric (+ alert) for it, with some useful info how to fix it.

Comment 2 Jan Safranek 2022-06-10 16:14:27 UTC
The root cause of this issue were missing udev rules in RHCOS 8.6. This has been fixed in https://github.com/openshift/os/pull/836

Now, how can we report such errors better?

Comment 3 Jan Safranek 2022-10-18 11:18:38 UTC
I added an alert to OCP that will report when all volume mounts or volume attachments are failing for a volume plugin on a node for 5 minutes. A single success will make the alert go away.
I.e. create a PV pointing to a non-existing iSCSI volume + run a Pod with it. The alert should be Pending in 5 minutes and Firing in 10 minutes. Create a second Pod with a working iSCSI volume and run it on the same node as the first pod - the alert should go away soon(ish), as there will be mixed iSCSI volume mount successes and failures on the node and even a single success is enough not to trigger the alert.

Comment 10 errata-xmlrpc 2023-01-17 19:49:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399


Note You need to log in before you can comment on or make changes to this bug.