Bug 1970352
| Summary: | fio command is hung on app pod after deleting the plugin pod | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Jilju Joy <jijoy> |
| Component: | rook | Assignee: | Rakshith <rar> |
| Status: | VERIFIED --- | QA Contact: | Jilju Joy <jijoy> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.8 | CC: | ebondare, jcall, muagarwa, nberry, owasserm, pdonnell, rar, tnielsen |
| Target Milestone: | --- | Keywords: | Automation, Regression |
| Target Release: | OCS 4.8.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.8.0-416.ci | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | Bug | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Jilju Joy
2021-06-10 10:39:50 UTC
The issue is seen with RBD also after deleting csi-rbdplugin pod. Test case tests/manage/pv_services/test_delete_plugin_pod.py::TestDeletePluginPod::test_delete_plugin_pod[CephBlockPool-rbdplugin] Test case error: E subprocess.TimeoutExpired: Command '['oc', '-n', 'namespace-test-15bb6677abaa4b7e94ca3a8eb', 'rsh', 'pod-test-rbd-5d5c78511d94459aaef0e106fff', 'fio', '--name=fio-rand-readwrite', '--filename=/var/lib/www/html/fio-rand-readwrite', '--readwrite=randrw', '--bs=4K', '--direct=1', '--numjobs=1', '--time_based=1', '--runtime=20', '--size=1G', '--iodepth=4', '--invalidate=1', '--fsync_on_close=1', '--rwmixread=75', '--ioengine=libaio', '--rate=1m', '--rate_process=poisson', '--output-format=json']' timed out after 600 seconds /usr/lib64/python3.8/subprocess.py:1068: TimeoutExpired Logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-jun10/jijoy-jun10_20210610T055333/logs/failed_testcase_ocs_logs_1623324214/ ocs-ci logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-jun10/jijoy-jun10_20210610T055333/logs/ocs-ci-logs-1623324214/by_outcome/failed/tests/ I tested this manually and found that read is working. So unlike the initially reported CephFS issue, in this case df, ls, cat <file name> commands work. in Rook 4.8 (1.6) the cephcsi pods are switched from host networking to pod networking because of that we are seeing this issue. For now, we can move back to hostnetworking to fix this issue. but we need to debug why it's happening with pod networking(in upstream) @Jilju can we do the same testing with multus. because for multus the cephcsi plugin pods run with pod networking. Adding regression keyword because the issue is seen only in OCS 4.8. Tested and passed in version: OCS operator v4.7.1-410.ci OCP 4.7.0-0.nightly-2021-06-09-233032 ocs-ci logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-june10/jijoy-june10_20210610T152031/logs/ Verified in version: ocs-operator.v4.8.0-416.ci OCP 4.8.0-0.nightly-2021-06-13-101614 ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable) Verified manually and using ocs-ci test cases tests/manage/pv_services/test_delete_plugin_pod.py::TestDeletePluginPod::test_delete_plugin_pod[CephFileSystem-cephfsplugin] tests/manage/pv_services/test_delete_plugin_pod.py::TestDeletePluginPod::test_delete_plugin_pod[CephBlockPool-rbdplugin] The test case was executed from PR https://github.com/red-hat-storage/ocs-ci/pull/4419. Test case logs http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-lso-jun14/jijoy-lso-jun14_20210614T080510/logs/ocs-ci-logs-1623752666/ Manual test: RBD: $ oc -n project-1970352 get pod pod-test-pvcrbd-1970352 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-test-pvcrbd-1970352 1/1 Running 0 9m23s 10.131.0.127 compute-0 <none> <none> $ oc -n openshift-storage get pod -l app=csi-rbdplugin -o wide | grep compute-0 csi-rbdplugin-5trz8 3/3 Running 3 26h 10.1.160.185 compute-0 <none> <none> $ oc -n openshift-storage delete pod csi-rbdplugin-5trz8 pod "csi-rbdplugin-5trz8" deleted $ $ oc -n openshift-storage get pod -l app=csi-rbdplugin -o wide | grep compute-0 csi-rbdplugin-8wbtn 3/3 Running 0 3m47s 10.1.160.185 compute-0 <none> <none> $ oc -n project-1970352 rsh pod-test-pvcrbd-1970352 # df | grep rbd /dev/rbd0 3030800 9220 3005196 1% /var/lib/www/html # cat /var/lib/www/html/f1.txt testfilebeforepluginrespin # # echo testfileaterpluginrespin > /var/lib/www/html/f2.txt # cat /var/lib/www/html/f2.txt testfileaterpluginrespin # CephFS: $ oc -n project-1970352 get pod pod-test-pvccephfs-1970352 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES pod-test-pvccephfs-1970352 1/1 Running 0 20m 10.131.0.128 compute-0 <none> <none> $ oc -n openshift-storage get pod -l app=csi-cephfsplugin -o wide | grep compute-0 csi-cephfsplugin-5559d 3/3 Running 3 26h 10.1.160.185 compute-0 <none> <none> $ oc -n openshift-storage delete pod csi-cephfsplugin-5559d pod "csi-cephfsplugin-5559d" deleted $ $ oc -n openshift-storage get pod -l app=csi-cephfsplugin -o wide | grep compute-0 csi-cephfsplugin-qxcmw 3/3 Running 0 28s 10.1.160.185 compute-0 <none> <none> $ oc -n project-1970352 rsh pod-test-pvccephfs-1970352 # df | grep csi-vol 172.30.87.94:6789,172.30.156.155:6789,172.30.229.154:6789:/volumes/csi/csi-vol-fa182e7a-cdda-11eb-a130-0a580a810229/224366c4-1a21-4187-9dee-574751316f6a 3145728 0 3145728 0% /var/lib/www/html # cat /var/lib/www/html/f1.txt testfilebeforepluginrespin # # echo testfileafterpluginrespin > /var/lib/www/html/filetest.txt # cat /var/lib/www/html/filetest.txt testfileafterpluginrespin # Cluster configuration - VMware LSO (In reply to Jilju Joy from comment #12) > Verified in version: > > ocs-operator.v4.8.0-416.ci > OCP 4.8.0-0.nightly-2021-06-13-101614 > ceph version 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) > nautilus (stable) > Cluster configuration - VMware LSO $ oc logs rook-ceph-operator-78485bb655-x528h| egrep CSI_ENABLE_HOST_NETWORK 2021-06-15 09:41:15.323767 I | op-k8sutil: CSI_ENABLE_HOST_NETWORK="true" (default) A new bug will be opened if this issue is producible in a multus enabled cluster. Based on comment #12 , marking this bug as verified. |