Bug 1970352

Summary:	fio command is hung on app pod after deleting the plugin pod
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Jilju Joy <jijoy>
Component:	rook	Assignee:	Rakshith <rar>
Status:	VERIFIED ---	QA Contact:	Jilju Joy <jijoy>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.8	CC:	ebondare, jcall, muagarwa, nberry, pdonnell, rar, tnielsen
Target Milestone:	---	Keywords:	Automation, Regression
Target Release:	OCS 4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.8.0-416.ci	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:		Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jilju Joy 2021-06-10 10:39:50 UTC

Description of problem (please be detailed as possible and provide log
snippests):
If the csi-cephfsplugin pod is deleted after creating the app pod attached with a CephFS PVC, fio command will be hung for ever on the app pod. Tried "df" and "ls <volume mount point>" command also and observed the same behavior. The deleted csi-cephfsplugin pod was residing on the same node as that of the app pod.
Attempt to delete the app pod will keep the pod in Terminating state.


Test case: tests/manage/pv_services/test_delete_plugin_pod.py::TestDeletePluginPod::test_delete_plugin_pod[CephFileSystem-cephfsplugin]

logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-jun10/jijoy-jun10_20210610T055333/logs/failed_testcase_ocs_logs_1623319643/test_delete_plugin_pod%5bCephFileSystem-cephfsplugin%5d_ocs_logs/

Name of the pod in with the issue occurred is "pod-test-cephfs-d26468f87cc34524b6b6455b"

===========================================================================
Version of all relevant components (if applicable):
OCP 4.8.0-0.nightly-2021-06-10-000903
OCS 4.8.0-413.ci 

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes, application pod is not usable 

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
Will be updated.

Steps to Reproduce:
1. Create a CephFS PVC
2. Attach the PVC to a pod running on the node 'node1'
3. Delete the csi-cephfsplugin pod running on the node 'node1'  and wait for the new csi-cephfsplugin to create.
4. Run I/O on the app pod

Or run the test case tests/manage/pv_services/test_delete_plugin_pod.py::TestDeletePluginPod::test_delete_plugin_pod[CephFileSystem-cephfsplugin] , PR https://github.com/red-hat-storage/ocs-ci/pull/4419


Actual results:
Fio used for step 4 became hung for ever

Expected results:
Read/write operation should succeed 

Additional info:

Comment 2 Jilju Joy 2021-06-10 12:34:33 UTC

The issue is seen with RBD also after deleting csi-rbdplugin pod.

Test case tests/manage/pv_services/test_delete_plugin_pod.py::TestDeletePluginPod::test_delete_plugin_pod[CephBlockPool-rbdplugin]

Test case error:

E           subprocess.TimeoutExpired: Command '['oc', '-n', 'namespace-test-15bb6677abaa4b7e94ca3a8eb', 'rsh', 'pod-test-rbd-5d5c78511d94459aaef0e106fff', 'fio', '--name=fio-rand-readwrite', '--filename=/var/lib/www/html/fio-rand-readwrite', '--readwrite=randrw', '--bs=4K', '--direct=1', '--numjobs=1', '--time_based=1', '--runtime=20', '--size=1G', '--iodepth=4', '--invalidate=1', '--fsync_on_close=1', '--rwmixread=75', '--ioengine=libaio', '--rate=1m', '--rate_process=poisson', '--output-format=json']' timed out after 600 seconds

/usr/lib64/python3.8/subprocess.py:1068: TimeoutExpired


Logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-jun10/jijoy-jun10_20210610T055333/logs/failed_testcase_ocs_logs_1623324214/

ocs-ci logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-jun10/jijoy-jun10_20210610T055333/logs/ocs-ci-logs-1623324214/by_outcome/failed/tests/



I tested this manually and found that read is working. So unlike the initially reported CephFS issue, in this case df, ls, cat <file name> commands work.

Comment 6 Madhu Rajanna 2021-06-10 13:17:22 UTC

in Rook 4.8 (1.6) the cephcsi pods are switched from host networking to pod networking because of that we are seeing this issue. For now, we can move back to hostnetworking
to fix this issue. but we need to debug why it's happening with pod networking(in upstream)

@Jilju can we do the same testing with multus. because for multus the cephcsi plugin pods run with pod networking.

Comment 8 Jilju Joy 2021-06-10 18:14:46 UTC

Adding regression keyword because the issue is seen only in OCS 4.8.

Tested and passed in version:
OCS operator	v4.7.1-410.ci
OCP	4.7.0-0.nightly-2021-06-09-233032
ocs-ci logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-june10/jijoy-june10_20210610T152031/logs/

Comment 12 Jilju Joy 2021-06-15 13:58:32 UTC

Verified in version:

ocs-operator.v4.8.0-416.ci
OCP 4.8.0-0.nightly-2021-06-13-101614
ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)


Verified manually and using ocs-ci test cases
tests/manage/pv_services/test_delete_plugin_pod.py::TestDeletePluginPod::test_delete_plugin_pod[CephFileSystem-cephfsplugin]
tests/manage/pv_services/test_delete_plugin_pod.py::TestDeletePluginPod::test_delete_plugin_pod[CephBlockPool-rbdplugin]

The test case was executed from PR https://github.com/red-hat-storage/ocs-ci/pull/4419.
Test case logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-lso-jun14/jijoy-lso-jun14_20210614T080510/logs/ocs-ci-logs-1623752666/


Manual test:

RBD:

$ oc -n project-1970352 get pod pod-test-pvcrbd-1970352 -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP             NODE        NOMINATED NODE   READINESS GATES
pod-test-pvcrbd-1970352   1/1     Running   0          9m23s   10.131.0.127   compute-0   <none>           <none>
$ oc -n openshift-storage get pod -l app=csi-rbdplugin -o wide | grep compute-0
csi-rbdplugin-5trz8   3/3     Running   3          26h    10.1.160.185   compute-0   <none>           <none>
$ oc -n openshift-storage delete pod csi-rbdplugin-5trz8
pod "csi-rbdplugin-5trz8" deleted
$ 
$ oc -n openshift-storage get pod -l app=csi-rbdplugin -o wide | grep compute-0
csi-rbdplugin-8wbtn   3/3     Running   0          3m47s   10.1.160.185   compute-0   <none>           <none>
$ oc -n project-1970352 rsh pod-test-pvcrbd-1970352
# df | grep rbd
/dev/rbd0        3030800     9220   3005196   1% /var/lib/www/html
# cat /var/lib/www/html/f1.txt
testfilebeforepluginrespin
#
# echo testfileaterpluginrespin > /var/lib/www/html/f2.txt
# cat /var/lib/www/html/f2.txt
testfileaterpluginrespin
# 


CephFS:

$ oc -n project-1970352 get pod pod-test-pvccephfs-1970352 -o wide
NAME                         READY   STATUS    RESTARTS   AGE   IP             NODE        NOMINATED NODE   READINESS GATES
pod-test-pvccephfs-1970352   1/1     Running   0          20m   10.131.0.128   compute-0   <none>           <none>
$ oc -n openshift-storage get pod -l app=csi-cephfsplugin -o wide | grep compute-0
csi-cephfsplugin-5559d   3/3     Running   3          26h     10.1.160.185   compute-0   <none>           <none>
$ oc -n openshift-storage delete pod csi-cephfsplugin-5559d
pod "csi-cephfsplugin-5559d" deleted
$ 
$ oc -n openshift-storage get pod -l app=csi-cephfsplugin -o wide | grep compute-0
csi-cephfsplugin-qxcmw   3/3     Running   0          28s     10.1.160.185   compute-0   <none>           <none>
$ oc -n project-1970352 rsh pod-test-pvccephfs-1970352
# df | grep csi-vol
172.30.87.94:6789,172.30.156.155:6789,172.30.229.154:6789:/volumes/csi/csi-vol-fa182e7a-cdda-11eb-a130-0a580a810229/224366c4-1a21-4187-9dee-574751316f6a   3145728        0   3145728   0% /var/lib/www/html
# cat /var/lib/www/html/f1.txt
testfilebeforepluginrespin
#
# echo testfileafterpluginrespin > /var/lib/www/html/filetest.txt
# cat /var/lib/www/html/filetest.txt
testfileafterpluginrespin
# 


Cluster configuration - VMware LSO

Comment 13 Jilju Joy 2021-06-15 14:13:29 UTC

(In reply to Jilju Joy from comment #12)
> Verified in version:
> 
> ocs-operator.v4.8.0-416.ci
> OCP 4.8.0-0.nightly-2021-06-13-101614
> ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f)
> nautilus (stable)

> Cluster configuration - VMware LSO

$ oc logs rook-ceph-operator-78485bb655-x528h| egrep CSI_ENABLE_HOST_NETWORK
2021-06-15 09:41:15.323767 I | op-k8sutil: CSI_ENABLE_HOST_NETWORK="true" (default)

Comment 14 Jilju Joy 2021-06-16 07:26:41 UTC

A new bug will be opened if this issue is producible in a multus enabled cluster.

Comment 15 Jilju Joy 2021-06-16 07:27:50 UTC

Based on comment #12 , marking this bug as verified.