Bug 2093016

Summary: [azure disk] add metric and alert to help identify cascading test failures
Product: OpenShift Container Platform Reporter: Jonathan Dobson <jdobson>
Component: StorageAssignee: Jan Safranek <jsafrane>
Storage sub component: Storage QA Contact: Rohit Patil <ropatil>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: jsafrane, ropatil
Version: 4.11   
Target Milestone: ---   
Target Release: 4.12.0   
Hardware: All   
OS: Linux   
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-17 19:49:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Jonathan Dobson 2022-06-02 18:15:04 UTC
This job has a failure cascade in it.


Under "events should not repeat pathologically" see "Unable to attach or mount volumes: unmounted volumes=[prometheus-data]".

It seems specific to one worker node: ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs

prometheus-k8s-0 failed to come up on that node, but is prometheus-k8s-1 is running in eastus21:

openshift-monitoring                                 prometheus-k8s-0                                                     0/6     Init:0/1            0              102m    <none>         ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs   <none>           <none>
openshift-monitoring                                 prometheus-k8s-1                                                     6/6     Running             0              116m    ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus21-9jgd9   <none>           <none>

The inline volume tester pods are stuck in ContainerCreating on that node as well:

e2e-ephemeral-1425                                   inline-volume-tester-8wjcj                                           0/1     ContainerCreating   0              51m     <none>         ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs   <none>           <none>
e2e-ephemeral-1835                                   inline-volume-tester-5b8l4                                           0/1     ContainerCreating   0              34m     <none>         ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs   <none>           <none>
e2e-ephemeral-3595                                   inline-volume-tester-dkpnm                                           0/1     ContainerCreating   0              57m     <none>         ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs   <none>           <none>

This is the disk csi driver pod running on that node:

openshift-cluster-csi-drivers                        azure-disk-csi-driver-node-m4j2j                                     3/3     Running             3              130m     ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs   <none>           <none>


Lots of GRPC timeout errors in that log while trying to find disks:
E0602 13:23:00.349065       1 utils.go:82] GRPC error: rpc error: code = Internal desc = failed to find disk on lun 7. timed out waiting for the condition
E0602 13:24:22.390001       1 utils.go:82] GRPC error: rpc error: code = Internal desc = failed to find disk on lun 6. timed out waiting for the condition
E0602 13:24:44.866653       1 utils.go:82] GRPC error: rpc error: code = Internal desc = failed to find disk on lun 3. timed out waiting for the condition

That error is coming from this line in the driver:

Which... looks like it's just timing out waiting for the lun to appear after rescanning scsi devices:

David Eads requested:
Can you author something in openshift-tests to help identify this error mode?  this particular failure mode looks ripe for a metric and an alert

Comment 1 Jan Safranek 2022-06-07 14:36:29 UTC
We need to check:

* Why the volume is not mounted (and check if it was attached correctly).
* If we can get this info in a synthetic CI test, https://github.com/openshift/origin/tree/master/pkg/synthetictests.
* If we can detect it in the CSI driver and emit a metric (+ alert) for it, with some useful info how to fix it.

Comment 2 Jan Safranek 2022-06-10 16:14:27 UTC
The root cause of this issue were missing udev rules in RHCOS 8.6. This has been fixed in https://github.com/openshift/os/pull/836

Now, how can we report such errors better?

Comment 3 Jan Safranek 2022-10-18 11:18:38 UTC
I added an alert to OCP that will report when all volume mounts or volume attachments are failing for a volume plugin on a node for 5 minutes. A single success will make the alert go away.
I.e. create a PV pointing to a non-existing iSCSI volume + run a Pod with it. The alert should be Pending in 5 minutes and Firing in 10 minutes. Create a second Pod with a working iSCSI volume and run it on the same node as the first pod - the alert should go away soon(ish), as there will be mixed iSCSI volume mount successes and failures on the node and even a single success is enough not to trigger the alert.

Comment 10 errata-xmlrpc 2023-01-17 19:49:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.