2318774 – I/O error "Read-only file system" followed by error "rbd: map failed: (108) Cannot send after transport endpoint shutdown" while creating app pod

Bug 2318774 - I/O error "Read-only file system" followed by error "rbd: map failed: (108) Cannot send after transport endpoint shutdown" while creating app pod

Summary: I/O error "Read-only file system" followed by error "rbd: map failed: (108) C...

Keywords:
Status:	CLOSED DUPLICATE of bug 2302073
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.17
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Ilya Dryomov
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-10-15 13:23 UTC by Jilju Joy
Modified:	2024-10-17 09:37 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-10-17 09:37:31 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OCSBZM-9381	0	None	None	None	2024-10-15 13:23:50 UTC

Description Jilju Joy 2024-10-15 13:23:17 UTC

Description of problem (please be detailed as possible and provide log
snippests):
During tier4c runs on VSphere platform, one test case failed during I/O and 4 test cases failed while creating app pods.


Test case: 

tests/functional/pv/pv_services/test_daemon_kill_during_pvc_pod_creation_deletion_and_io.py::TestDaemonKillDuringMultipleCreateDeleteOperations::test_daemon_kill_during_pvc_pod_creation_deletion_and_io

Error Details: 

ocs_ci.ocs.exceptions.CommandFailed: Error during execution of command: oc -n namespace-test-ca5e9912d6294f2bb77fedd81 rsh pod-test-rbd-29a2024be616406e9035d7d739f fio --name=fio-rand-readwrite --filename=/var/lib/www/html/pod-test-rbd-29a2024be616406e9035d7d739f_io --readwrite=randrw --bs=4K --direct=0 --numjobs=1 --time_based=1 --runtime=30 --size=2G --iodepth=4 --invalidate=1 --fsync_on_close=1 --rwmixread=75 --ioengine=libaio --rate=1m --rate_process=poisson --output-format=json.
Error is fio: pid=0, err=30/file:filesetup.c:253, func=fsync, error=Read-only file system
command terminated with exit code 1


The test case mentioned above deletes 1 daemon of each of osd, mds, mgr and mon at the same time.

Must-gather logs - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-o14/jijoy-o14_20241014T034907/logs/failed_testcase_ocs_logs_1728897431/test_daemon_kill_during_pvc_pod_creation_deletion_and_io_ocs_logs/

After the above error, 4 test case failed with the error given below (even before the test case disrupt any ceph pods)

Test case: tests/functional/pv/pv_services/test_resource_deletion_during_pvc_pod_creation_deletion_and_io.py::TestResourceDeletionDuringMultipleCreateDeleteOperations::test_resource_deletion_during_pvc_pod_creation_deletion_and_io


Error while creating pod:

Warning  FailedMount  16s (x4 over 82s)  kubelet  MountVolume.MountDevice failed for volume "pvc-ef7f7397-b98c-465e-aea2-7b844296c169" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 108) occurred while running rbd args: [--id csi-rbd-node -m 172.30.40.190:3300,172.30.108.78:3300,172.30.252.93:3300 --keyfile=***stripped*** map ocs-storagecluster-cephblockpool/csi-vol-7bed0289-4c5e-47ae-bc67-42ec980a2b80 --device-type krbd --options noudev --options read_from_replica=localize,crush_location=host:jijoy-o14-ctlk7-worker-0-lv4wb|rack:rack0], rbd error output: rbd: sysfs write failed
rbd: map failed: (108) Cannot send after transport endpoint shutdown"

http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-o14/jijoy-o14_20241014T034907/logs/failed_testcase_ocs_logs_1728897431/test_resource_deletion_during_pvc_pod_creation_deletion_and_io_ocs_logs/

3 other test cases also failed with the error "rbd: map failed: (108) Cannot send after transport endpoint shutdown"



Test report - https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/43041/testReport/

Must-gather collected after each test case failure - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-o14/jijoy-o14_20241014T034907/logs/failed_testcase_ocs_logs_1728897431/
[Each directory name represents the failed test case names]


The issue reported in this bug is with ODF 4.17.0-120 on VSphere platform. The issue was initially seen in 4.17.0-117 on VSphere platform(reported in the comment https://bugzilla.redhat.com/show_bug.cgi?id=2302073#c19). On the same platform these test cases passed with the build 4.17.0-107.  These test cases passed on AWS platform with 4.17.0-120.

======================================
Version of all relevant components (if applicable):
ODF 4.17.0-117 and 4.17.0-120
OCP 4.17.0-0.nightly-2024-10-13-113132
Ceph 18.2.1-229.el9cp (ef652b206f2487adfc86613646a4cac946f6b4e0) reef (stable)

=========================================

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, unable to create app pods

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Tried and reproduced one time in VSphere platform

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
Test case passed on the same platform (VSphere) with the build 4.17.0-107

=================================================
Steps to Reproduce:
Run a set of tier4c test cases.

1. tests/functional/pv/pv_services/test_daemon_kill_during_pvc_pod_creation_deletion_and_io.py::TestDaemonKillDuringMultipleCreateDeleteOperations::test_daemon_kill_during_pvc_pod_creation_deletion_and_io

2. tests/functional/pv/pv_services/test_resource_deletion_during_pvc_pod_creation_deletion_and_io.py::TestResourceDeletionDuringMultipleCreateDeleteOperations::test_resource_deletion_during_pvc_pod_creation_deletion_and_io

3. tests/functional/pv/pvc_clone/test_resource_deletion_during_pvc_clone.py::TestResourceDeletionDuringPvcClone::test_resource_deletion_during_pvc_clone

4. tests/functional/pv/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[mgr]

5. tests/functional/pv/pvc_resize/test_resource_deletion_during_pvc_e xpansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[osd]

6. tests/functional/pv/pvc_resize/test_resource_deletion_during_pvc_expansion.py::TestResourceDeletionDuringPvcExpansion::test_resource_deletion_during_pvc_expansion[rbdplugin]

7. tests/functional/pv/pvc_snapshot/test_resource_deletion_during_snapshot_restore.py::TestResourceDeletionDuringSnapshotRestore::test_resource_deletion_during_snapshot_restore


Actual results:
Test cases failed with errors given above

Expected results:
Tests should pass.



Additional info:
This was initially reported in the bug https://bugzilla.redhat.com/show_bug.cgi?id=2318528. That bug address CephFS issue. Created this bug to address RBD issue.

Note You need to log in before you can comment on or make changes to this bug.