This job has a failure cascade in it.
Under "events should not repeat pathologically" see "Unable to attach or mount volumes: unmounted volumes=[prometheus-data]".
It seems specific to one worker node: ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs
prometheus-k8s-0 failed to come up on that node, but is prometheus-k8s-1 is running in eastus21:
openshift-monitoring prometheus-k8s-0 0/6 Init:0/1 0 102m <none> ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs <none> <none>
openshift-monitoring prometheus-k8s-1 6/6 Running 0 116m 10.131.0.24 ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus21-9jgd9 <none> <none>
The inline volume tester pods are stuck in ContainerCreating on that node as well:
e2e-ephemeral-1425 inline-volume-tester-8wjcj 0/1 ContainerCreating 0 51m <none> ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs <none> <none>
e2e-ephemeral-1835 inline-volume-tester-5b8l4 0/1 ContainerCreating 0 34m <none> ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs <none> <none>
e2e-ephemeral-3595 inline-volume-tester-dkpnm 0/1 ContainerCreating 0 57m <none> ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs <none> <none>
This is the disk csi driver pod running on that node:
openshift-cluster-csi-drivers azure-disk-csi-driver-node-m4j2j 3/3 Running 3 130m 10.0.128.5 ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs <none> <none>
Lots of GRPC timeout errors in that log while trying to find disks:
E0602 13:23:00.349065 1 utils.go:82] GRPC error: rpc error: code = Internal desc = failed to find disk on lun 7. timed out waiting for the condition
E0602 13:24:22.390001 1 utils.go:82] GRPC error: rpc error: code = Internal desc = failed to find disk on lun 6. timed out waiting for the condition
E0602 13:24:44.866653 1 utils.go:82] GRPC error: rpc error: code = Internal desc = failed to find disk on lun 3. timed out waiting for the condition
That error is coming from this line in the driver:
Which... looks like it's just timing out waiting for the lun to appear after rescanning scsi devices:
David Eads requested:
Can you author something in openshift-tests to help identify this error mode? this particular failure mode looks ripe for a metric and an alert
We need to check:
* Why the volume is not mounted (and check if it was attached correctly).
* If we can get this info in a synthetic CI test, https://github.com/openshift/origin/tree/master/pkg/synthetictests.
* If we can detect it in the CSI driver and emit a metric (+ alert) for it, with some useful info how to fix it.
The root cause of this issue were missing udev rules in RHCOS 8.6. This has been fixed in https://github.com/openshift/os/pull/836
Now, how can we report such errors better?
I added an alert to OCP that will report when all volume mounts or volume attachments are failing for a volume plugin on a node for 5 minutes. A single success will make the alert go away.
I.e. create a PV pointing to a non-existing iSCSI volume + run a Pod with it. The alert should be Pending in 5 minutes and Firing in 10 minutes. Create a second Pod with a working iSCSI volume and run it on the same node as the first pod - the alert should go away soon(ish), as there will be mixed iSCSI volume mount successes and failures on the node and even a single success is enough not to trigger the alert.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.