Description of problem (please be detailed as possible and provide log snippests): If the csi-cephfsplugin pod is deleted after creating the app pod attached with a CephFS PVC, fio command will be hung for ever on the app pod. Tried "df" and "ls <volume mount point>" command also and observed the same behavior. The deleted csi-cephfsplugin pod was residing on the same node as that of the app pod. Attempt to delete the app pod will keep the pod in Terminating state. Test case: tests/manage/pv_services/test_delete_plugin_pod.py::TestDeletePluginPod::test_delete_plugin_pod[CephFileSystem-cephfsplugin] logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-jun10/jijoy-jun10_20210610T055333/logs/failed_testcase_ocs_logs_1623319643/test_delete_plugin_pod%5bCephFileSystem-cephfsplugin%5d_ocs_logs/ Name of the pod in with the issue occurred is "pod-test-cephfs-d26468f87cc34524b6b6455b" =========================================================================== Version of all relevant components (if applicable): OCP 4.8.0-0.nightly-2021-06-10-000903 OCS 4.8.0-413.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, application pod is not usable Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: Will be updated. Steps to Reproduce: 1. Create a CephFS PVC 2. Attach the PVC to a pod running on the node 'node1' 3. Delete the csi-cephfsplugin pod running on the node 'node1' and wait for the new csi-cephfsplugin to create. 4. Run I/O on the app pod Or run the test case tests/manage/pv_services/test_delete_plugin_pod.py::TestDeletePluginPod::test_delete_plugin_pod[CephFileSystem-cephfsplugin] , PR https://github.com/red-hat-storage/ocs-ci/pull/4419 Actual results: Fio used for step 4 became hung for ever Expected results: Read/write operation should succeed Additional info:
The issue is seen with RBD also after deleting csi-rbdplugin pod. Test case tests/manage/pv_services/test_delete_plugin_pod.py::TestDeletePluginPod::test_delete_plugin_pod[CephBlockPool-rbdplugin] Test case error: E subprocess.TimeoutExpired: Command '['oc', '-n', 'namespace-test-15bb6677abaa4b7e94ca3a8eb', 'rsh', 'pod-test-rbd-5d5c78511d94459aaef0e106fff', 'fio', '--name=fio-rand-readwrite', '--filename=/var/lib/www/html/fio-rand-readwrite', '--readwrite=randrw', '--bs=4K', '--direct=1', '--numjobs=1', '--time_based=1', '--runtime=20', '--size=1G', '--iodepth=4', '--invalidate=1', '--fsync_on_close=1', '--rwmixread=75', '--ioengine=libaio', '--rate=1m', '--rate_process=poisson', '--output-format=json']' timed out after 600 seconds /usr/lib64/python3.8/subprocess.py:1068: TimeoutExpired Logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-jun10/jijoy-jun10_20210610T055333/logs/failed_testcase_ocs_logs_1623324214/ ocs-ci logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-jun10/jijoy-jun10_20210610T055333/logs/ocs-ci-logs-1623324214/by_outcome/failed/tests/ I tested this manually and found that read is working. So unlike the initially reported CephFS issue, in this case df, ls, cat <file name> commands work.
in Rook 4.8 (1.6) the cephcsi pods are switched from host networking to pod networking because of that we are seeing this issue. For now, we can move back to hostnetworking to fix this issue. but we need to debug why it's happening with pod networking(in upstream) @Jilju can we do the same testing with multus. because for multus the cephcsi plugin pods run with pod networking.
Adding regression keyword because the issue is seen only in OCS 4.8. Tested and passed in version: OCS operator v4.7.1-410.ci OCP 4.7.0-0.nightly-2021-06-09-233032 ocs-ci logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-june10/jijoy-june10_20210610T152031/logs/
Verified in version: ocs-operator.v4.8.0-416.ci OCP 4.8.0-0.nightly-2021-06-13-101614 ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable) Verified manually and using ocs-ci test cases tests/manage/pv_services/test_delete_plugin_pod.py::TestDeletePluginPod::test_delete_plugin_pod[CephFileSystem-cephfsplugin] tests/manage/pv_services/test_delete_plugin_pod.py::TestDeletePluginPod::test_delete_plugin_pod[CephBlockPool-rbdplugin] The test case was executed from PR https://github.com/red-hat-storage/ocs-ci/pull/4419. Test case logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-lso-jun14/jijoy-lso-jun14_20210614T080510/logs/ocs-ci-logs-1623752666/ Manual test: RBD: $ oc -n project-1970352 get pod pod-test-pvcrbd-1970352 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-test-pvcrbd-1970352 1/1 Running 0 9m23s 10.131.0.127 compute-0 <none> <none> $ oc -n openshift-storage get pod -l app=csi-rbdplugin -o wide | grep compute-0 csi-rbdplugin-5trz8 3/3 Running 3 26h 10.1.160.185 compute-0 <none> <none> $ oc -n openshift-storage delete pod csi-rbdplugin-5trz8 pod "csi-rbdplugin-5trz8" deleted $ $ oc -n openshift-storage get pod -l app=csi-rbdplugin -o wide | grep compute-0 csi-rbdplugin-8wbtn 3/3 Running 0 3m47s 10.1.160.185 compute-0 <none> <none> $ oc -n project-1970352 rsh pod-test-pvcrbd-1970352 # df | grep rbd /dev/rbd0 3030800 9220 3005196 1% /var/lib/www/html # cat /var/lib/www/html/f1.txt testfilebeforepluginrespin # # echo testfileaterpluginrespin > /var/lib/www/html/f2.txt # cat /var/lib/www/html/f2.txt testfileaterpluginrespin # CephFS: $ oc -n project-1970352 get pod pod-test-pvccephfs-1970352 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-test-pvccephfs-1970352 1/1 Running 0 20m 10.131.0.128 compute-0 <none> <none> $ oc -n openshift-storage get pod -l app=csi-cephfsplugin -o wide | grep compute-0 csi-cephfsplugin-5559d 3/3 Running 3 26h 10.1.160.185 compute-0 <none> <none> $ oc -n openshift-storage delete pod csi-cephfsplugin-5559d pod "csi-cephfsplugin-5559d" deleted $ $ oc -n openshift-storage get pod -l app=csi-cephfsplugin -o wide | grep compute-0 csi-cephfsplugin-qxcmw 3/3 Running 0 28s 10.1.160.185 compute-0 <none> <none> $ oc -n project-1970352 rsh pod-test-pvccephfs-1970352 # df | grep csi-vol 172.30.87.94:6789,172.30.156.155:6789,172.30.229.154:6789:/volumes/csi/csi-vol-fa182e7a-cdda-11eb-a130-0a580a810229/224366c4-1a21-4187-9dee-574751316f6a 3145728 0 3145728 0% /var/lib/www/html # cat /var/lib/www/html/f1.txt testfilebeforepluginrespin # # echo testfileafterpluginrespin > /var/lib/www/html/filetest.txt # cat /var/lib/www/html/filetest.txt testfileafterpluginrespin # Cluster configuration - VMware LSO
(In reply to Jilju Joy from comment #12) > Verified in version: > > ocs-operator.v4.8.0-416.ci > OCP 4.8.0-0.nightly-2021-06-13-101614 > ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) > nautilus (stable) > Cluster configuration - VMware LSO $ oc logs rook-ceph-operator-78485bb655-x528h| egrep CSI_ENABLE_HOST_NETWORK 2021-06-15 09:41:15.323767 I | op-k8sutil: CSI_ENABLE_HOST_NETWORK="true" (default)
A new bug will be opened if this issue is producible in a multus enabled cluster.
Based on comment #12 , marking this bug as verified.