This job has a failure cascade in it. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1532295441427730432 Under "events should not repeat pathologically" see "Unable to attach or mount volumes: unmounted volumes=[prometheus-data]". It seems specific to one worker node: ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1532295441427730432/artifacts/e2e-azure-upgrade/gather-extra/artifacts/oc_cmds/pods prometheus-k8s-0 failed to come up on that node, but is prometheus-k8s-1 is running in eastus21: openshift-monitoring prometheus-k8s-0 0/6 Init:0/1 0 102m <none> ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs <none> <none> openshift-monitoring prometheus-k8s-1 6/6 Running 0 116m 10.131.0.24 ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus21-9jgd9 <none> <none> The inline volume tester pods are stuck in ContainerCreating on that node as well: e2e-ephemeral-1425 inline-volume-tester-8wjcj 0/1 ContainerCreating 0 51m <none> ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs <none> <none> e2e-ephemeral-1835 inline-volume-tester-5b8l4 0/1 ContainerCreating 0 34m <none> ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs <none> <none> e2e-ephemeral-3595 inline-volume-tester-dkpnm 0/1 ContainerCreating 0 57m <none> ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs <none> <none> This is the disk csi driver pod running on that node: openshift-cluster-csi-drivers azure-disk-csi-driver-node-m4j2j 3/3 Running 3 130m 10.0.128.5 ci-op-b6y95fpf-fde6e-sp5tc-worker-eastus23-m2tjs <none> <none> https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade/1532295441427730432/artifacts/e2e-azure-upgrade/gather-extra/artifacts/pods/openshift-cluster-csi-drivers_azure-disk-csi-driver-node-m4j2j_csi-driver.log Lots of GRPC timeout errors in that log while trying to find disks: E0602 13:23:00.349065 1 utils.go:82] GRPC error: rpc error: code = Internal desc = failed to find disk on lun 7. timed out waiting for the condition E0602 13:24:22.390001 1 utils.go:82] GRPC error: rpc error: code = Internal desc = failed to find disk on lun 6. timed out waiting for the condition E0602 13:24:44.866653 1 utils.go:82] GRPC error: rpc error: code = Internal desc = failed to find disk on lun 3. timed out waiting for the condition That error is coming from this line in the driver: https://github.com/openshift/azure-disk-csi-driver/blob/0fe424e846435a1695920c2b05fcf25b42d9f76d/pkg/azuredisk/nodeserver.go#L99-L102 Which... looks like it's just timing out waiting for the lun to appear after rescanning scsi devices: https://github.com/openshift/azure-disk-csi-driver/blob/0fe424e846435a1695920c2b05fcf25b42d9f76d/pkg/azuredisk/nodeserver.go#L642 David Eads requested: Can you author something in openshift-tests to help identify this error mode? this particular failure mode looks ripe for a metric and an alert
We need to check: * Why the volume is not mounted (and check if it was attached correctly). * If we can get this info in a synthetic CI test, https://github.com/openshift/origin/tree/master/pkg/synthetictests. * If we can detect it in the CSI driver and emit a metric (+ alert) for it, with some useful info how to fix it.
The root cause of this issue were missing udev rules in RHCOS 8.6. This has been fixed in https://github.com/openshift/os/pull/836 Now, how can we report such errors better?
I added an alert to OCP that will report when all volume mounts or volume attachments are failing for a volume plugin on a node for 5 minutes. A single success will make the alert go away. I.e. create a PV pointing to a non-existing iSCSI volume + run a Pod with it. The alert should be Pending in 5 minutes and Firing in 10 minutes. Create a second Pod with a working iSCSI volume and run it on the same node as the first pod - the alert should go away soon(ish), as there will be mixed iSCSI volume mount successes and failures on the node and even a single success is enough not to trigger the alert.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399