Bug 2302073

Summary: [External mode] Fio failed on RBD volume followed by error while creating app pod with RBD PVC
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Jilju Joy <jijoy>
Component: csi-driverAssignee: Praveen M <mpraveen>
Status: NEW --- QA Contact: krishnaram Karthick <kramdoss>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.16CC: idryomov, mpraveen, odf-bz-bot, rar
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jilju Joy 2024-07-31 15:46:36 UTC
Description of problem (please be detailed as possible and provide log
snippests):

The test cases given below failed due to two different errors in external mode cluster. Though this test cases are disruptive in nature, the errors occurred before starting any disruption in the cluster.  

1. tests/functional/pv/pv_services/test_resource_deletion_during_pvc_pod_creation_deletion_and_io.py::TestResourceDeletionDuringMultipleCreateDeleteOperations::test_resource_deletion_during_pvc_pod_creation_deletion_and_io

The test case failed during fio on pod with RBD Block volume mode PVC.

Test case error:

ocs_ci.ocs.exceptions.CommandFailed: Error during execution of command: oc -n namespace-test-646e9cdc1ac041a3b822c5d9c rsh pod-test-rbd-ec7be2dff97f4a3fa6038ca1a18 fio --name=fio-rand-readwrite --filename=/dev/rbdblock --readwrite=randrw --bs=4K --direct=1 --numjobs=1 --time_based=1 --runtime=30 --size=2G --iodepth=4 --invalidate=0 --fsync_on_close=1 --rwmixread=75 --ioengine=libaio --rate=1m --rate_process=poisson --output-format=json.
Error is fio: io_u error on file /dev/rbdblock: I/O error: read offset=686010368, buflen=4096
fio: io_u error on file /dev/rbdblock: I/O error: read offset=1549148160, buflen=4096
fio: io_u error on file /dev/rbdblock: I/O error: read offset=886792192, buflen=4096
fio: io_u error on file /dev/rbdblock: I/O error: read offset=1520287744, buflen=4096
command terminated with exit code 1


2. After the fio error in the previous test case , the test cases that followed failed while creating RBD PVC (either Block or Filesystem volume mode).
* tests/functional/pv/pvc_clone/test_resource_deletion_during_pvc_clone.py::TestResourceDeletionDuringPvcClone::test_resource_deletion_during_pvc_clone

* tests/functional/pv/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[rbdplugin]

* tests/functional/pv/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[cephfsplugin]

* tests/functional/pv/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[rbdplugin_provisioner]

Error  from the test case test_resource_deletion_during_pvc_clone:

Events:
  Type     Reason                  Age               From                     Message
  ----     ------                  ----              ----                     -------
  Normal   Scheduled               76s               default-scheduler        Successfully assigned namespace-test-a3ed2aa514084ff6840faf7fb/pod-test-rbd-a59ee5fb88884539adb884322ef to compute-1
  Normal   SuccessfulAttachVolume  76s               attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-47995060-2825-439b-8fa1-a3e3af68833d"
  Warning  FailedMount             6s (x8 over 73s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-47995060-2825-439b-8fa1-a3e3af68833d" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 108) occurred while running rbd args: [--id csi-rbd-node -m 10.1.160.202:6789,10.1.160.201:6789,10.1.160.198:6789 --keyfile=***stripped*** map rbd/csi-vol-72a53628-18ba-4b5a-adfd-074add00b015 --device-type krbd --options noudev], rbd error output: rbd: sysfs write failed
rbd: map failed: (108) Cannot send after transport endpoint shutdown


The pod creations failed on the node compute-1. The pod in which fio failed was also on the node compute-1.


Test report with error details - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/40443/testReport/

Must-gather logs collected after individual test failure - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-ext/jijoy-ext_20240731T011605/logs/failed_testcase_ocs_logs_1722434106/

Must-gather collected at the end of all tests - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-ext/jijoy-ext_20240731T011605/logs/testcases_1722434106/jijoy-ext/

==============================================================================
Version of all relevant components (if applicable):
Cluster Version	4.16.0-0.nightly-2024-07-30-181230
ODF 4.16.1-6
Ceph Version	18.2.1-194.el9cp (04a992766839cd3207877e518a1238cdbac3787e) reef (stable)

===============================================================================

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, I/O failure and error while creating pod with RBD PVC

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
These test cases passed in 4.16.0 and previous versions of ODF in external mode.

Steps to Reproduce:
(describing automated test steps)
1. Create an external mode ODF cluster
2. Create multiple PVCs of RBD block and filesystem volume mode with supported access modes. Create CephFS PVCS as well. 
3. Attach the PVCs to pods. RWX PVC on more than 1 pod.
4. Run fio on pods.
(from next test)
5. Create new RBD PVC and attach it to app pod. Select node where the pod where fio failed in the step 4 was present.

To replicate the exact procedure, run the set of test cases
* tests/functional/pv/pv_services/test_resource_deletion_during_pvc_pod_creation_deletion_and_io.py
* tests/functional/pv/pvc_clone/test_resource_deletion_during_pvc_clone.py::TestResourceDeletionDuringPvcClone::test_resource_deletion_during_pvc_clone

* tests/functional/pv/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[rbdplugin]

* tests/functional/pv/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[cephfsplugin]

* tests/functional/pv/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[rbdplugin_provisioner]

==============================================================================
Actual results:
In step 4, fio failed on RBD Block volume mode PVC.
In step 5, app pod creation failed.

Expected results:
fio and app pods creation should be successful

Additional info: