Description of problem (please be detailed as possible and provide log snippests): During tier4c runs on VSphere platform, one test case failed during I/O and 4 test cases failed while creating app pods. Test case: tests/functional/pv/pv_services/test_daemon_kill_during_pvc_pod_creation_deletion_and_io.py::TestDaemonKillDuringMultipleCreateDeleteOperations::test_daemon_kill_during_pvc_pod_creation_deletion_and_io Error Details: ocs_ci.ocs.exceptions.CommandFailed: Error during execution of command: oc -n namespace-test-ca5e9912d6294f2bb77fedd81 rsh pod-test-rbd-29a2024be616406e9035d7d739f fio --name=fio-rand-readwrite --filename=/var/lib/www/html/pod-test-rbd-29a2024be616406e9035d7d739f_io --readwrite=randrw --bs=4K --direct=0 --numjobs=1 --time_based=1 --runtime=30 --size=2G --iodepth=4 --invalidate=1 --fsync_on_close=1 --rwmixread=75 --ioengine=libaio --rate=1m --rate_process=poisson --output-format=json. Error is fio: pid=0, err=30/file:filesetup.c:253, func=fsync, error=Read-only file system command terminated with exit code 1 The test case mentioned above deletes 1 daemon of each of osd, mds, mgr and mon at the same time. Must-gather logs - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-o14/jijoy-o14_20241014T034907/logs/failed_testcase_ocs_logs_1728897431/test_daemon_kill_during_pvc_pod_creation_deletion_and_io_ocs_logs/ After the above error, 4 test case failed with the error given below (even before the test case disrupt any ceph pods) Test case: tests/functional/pv/pv_services/test_resource_deletion_during_pvc_pod_creation_deletion_and_io.py::TestResourceDeletionDuringMultipleCreateDeleteOperations::test_resource_deletion_during_pvc_pod_creation_deletion_and_io Error while creating pod: Warning FailedMount 16s (x4 over 82s) kubelet MountVolume.MountDevice failed for volume "pvc-ef7f7397-b98c-465e-aea2-7b844296c169" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 108) occurred while running rbd args: [--id csi-rbd-node -m 172.30.40.190:3300,172.30.108.78:3300,172.30.252.93:3300 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-7bed0289-4c5e-47ae-bc67-42ec980a2b80 --device-type krbd --options noudev --options read_from_replica=localize,crush_location=host:jijoy-o14-ctlk7-worker-0-lv4wb|rack:rack0], rbd error output: rbd: sysfs write failed rbd: map failed: (108) Cannot send after transport endpoint shutdown" http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-o14/jijoy-o14_20241014T034907/logs/failed_testcase_ocs_logs_1728897431/test_resource_deletion_during_pvc_pod_creation_deletion_and_io_ocs_logs/ 3 other test cases also failed with the error "rbd: map failed: (108) Cannot send after transport endpoint shutdown" Test report - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/43041/testReport/ Must-gather collected after each test case failure - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-o14/jijoy-o14_20241014T034907/logs/failed_testcase_ocs_logs_1728897431/ [Each directory name represents the failed test case names] The issue reported in this bug is with ODF 4.17.0-120 on VSphere platform. The issue was initially seen in 4.17.0-117 on VSphere platform(reported in the comment https://bugzilla.redhat.com/show_bug.cgi?id=2302073#c19). On the same platform these test cases passed with the build 4.17.0-107. These test cases passed on AWS platform with 4.17.0-120. ====================================== Version of all relevant components (if applicable): ODF 4.17.0-117 and 4.17.0-120 OCP 4.17.0-0.nightly-2024-10-13-113132 Ceph 18.2.1-229.el9cp (ef652b206f2487adfc86613646a4cac946f6b4e0) reef (stable) ========================================= Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, unable to create app pods Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Tried and reproduced one time in VSphere platform Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Test case passed on the same platform (VSphere) with the build 4.17.0-107 ================================================= Steps to Reproduce: Run a set of tier4c test cases. 1. tests/functional/pv/pv_services/test_daemon_kill_during_pvc_pod_creation_deletion_and_io.py::TestDaemonKillDuringMultipleCreateDeleteOperations::test_daemon_kill_during_pvc_pod_creation_deletion_and_io 2. tests/functional/pv/pv_services/test_resource_deletion_during_pvc_pod_creation_deletion_and_io.py::TestResourceDeletionDuringMultipleCreateDeleteOperations::test_resource_deletion_during_pvc_pod_creation_deletion_and_io 3. tests/functional/pv/pvc_clone/test_resource_deletion_during_pvc_clone.py::TestResourceDeletionDuringPvcClone::test_resource_deletion_during_pvc_clone 4. tests/functional/pv/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[mgr] 5. tests/functional/pv/pvc_resize/test_resource_deletion_during_pvc_e xpansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[osd] 6. tests/functional/pv/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[rbdplugin] 7. tests/functional/pv/pvc_snapshot/test_resource_deletion_during_snapshot_restore.py::TestResourceDeletionDuringSnapshotRestore::test_resource_deletion_during_snapshot_restore Actual results: Test cases failed with errors given above Expected results: Tests should pass. Additional info: This was initially reported in the bug https://bugzilla.redhat.com/show_bug.cgi?id=2318528. That bug address CephFS issue. Created this bug to address RBD issue.