Description of problem (please be detailed as possible and provide log snippests): In IBM Cloud 1AZ RHEL cluster, some ocs-ci tier4b tests in test_ceph_daemon_kill_during_resource_deletion.py[1] are failing due to timeout while waiting for a pod to delete. When trying to delete multiple pods in bulk, pod deletion is not started within 60 seconds and pods are taking more than ~180 seconds to get deleted. Also observed following similar error message in events related to pods, indicating there is problem with killing of the pods. --- message: 'error killing pod: failed to "KillContainer" for "web-server" with KillContainerError: "rpc error: code = Unknown desc = failed to get container ''stopped'' status 0953bf51c35d17ae518a865102afadc8b1adccdcdd7b77739bd484d24fe08c45: failed to get container stopped status: 30s timeout reached"' --- Although this error occurs while killing pods, the pods ultimately does get deleted before the teardown part. Raising this in csi-driver component for initial analysis, feel free to change the component if required. [1] https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py Version of all relevant components (if applicable): OCP: 4.7.12 OCS: 4.7.2-429.ci Ceph: 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable) Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No But pod deletion should work as expected without error messages Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? Not sure If this is a regression, please provide more details to justify this: Steps to Reproduce: Run below test from test_ceph_daemon_kill_during_resource_deletion.py tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mgr] OR Manual steps: The above tests includes the following steps: 1. Create 12 CephFS PVCs (6 RWO + 6 RWX mode) 2. Create one pod using each RWO PVC and two pods using each RWX PVC 3. Start IO on each pod. RWX PVC will be used by two pods 4. Start deleting pods in bulk and wait for pod deletion to start. Observe that pod deletion doesn't start within 60 seconds. And there is "KillContainerError" message in pod events. Actual results: Pod taking prolonged time to delete and KillContainerError message in pod events Expected results: Pod should get deleted without any error within expected time. Additional info: Following tests are failing due to timeout waiting for a pod to delete tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-mgr] tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mgr] tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-mon] tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mon] tests/manage/pv_services/test_ceph_daemon_kill _during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-osd] tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-osd] tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mds] must-gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j003icm1r3-t4b/j003icm1r3-t4b_20210625T110119/logs/failed_testcase_ocs_logs_1624627393/ test execution logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j003icm1r3-t4b/j003icm1r3-t4b_20210625T110119/logs/ocs-ci-logs-1624627393/by_outcome/failed/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py/ Note: The test mentioned in "Steps to Reproduce" fails at a stage before any disruptive operation (daemon kill) was performed.
This looks expected to me, >> io_size = self.pvc_size - 1 >> pod_obj.run_io( >> storage_type="fs", size=f"{io_size}G", fio_filename=f"{pod_obj.name}_io" >> ) run_io() function which is used to run IOs has a parameter fio_sync which should be set to 1 in order for the sync to happen after fio finishes. If the pod deletion is attempted just after the IO is run on the pod, it will wait for the data to flush to the disk before killing/stopping the container. Can we enable the sync and then try to reproduce this? Also, can you please point me to the specific pods which we failed to delete?
@Mudit Did you meant this parameter "fsync_on_close" ? fsync_on_close=1 is already set in the fio command. Consider this failure case in test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mon] >> Fio started with fsync_on_close = 1 on pod "pod-test-cephfs-1041dd086ca44f2b979eda89" 2021-06-26 16:05:16,624 - ThreadPoolExecutor-627_0 - INFO - ocs_ci.utility.utils.exec_cmd.486 - Executing command: oc -n namespace-test-2b6c084cb1204a9786183bfe3 rsh pod-test-cephfs-1041dd086ca44f2b979eda89 fio --name=fio-rand-readwrite --filename=/var/lib/www/html/pod-test-cephfs-1041dd086ca44f2b979eda89_io --readwrite=randrw --bs=4K --direct=1 --numjobs=1 --time_based=1 --runtime=60 --size=2G --iodepth=4 --invalidate=1 --fsync_on_close=1 --rwmixread=75 --ioengine=libaio --rate=1m --rate_process=poisson --output-format=json >> Delete command 2021-06-26 16:05:30,876 - ThreadPoolExecutor-625_0 - INFO - ocs_ci.utility.utils.exec_cmd.486 - Executing command: oc -n namespace-test-2b6c084cb1204a9786183bfe3 delete Pod pod-test-cephfs-1041dd086ca44f2b979eda89 --wait=false >> Last check if pod got deleted --> Pod was still present 2021-06-26 16:10:09,961 - MainThread - INFO - ocs_ci.utility.utils.exec_cmd.486 - Executing command: oc -n namespace-test-2b6c084cb1204a9786183bfe3 get Pod pod-test-cephfs-1041dd086ca44f2b979eda89 -n namespace-test-2b6c084cb1204a9786183bfe3 -o yaml >> Pod events ... Normal Started 36m kubelet Started container web-server Normal Killing 4m46s kubelet Stopping container web-server Warning FailedKillPod 10s kubelet error killing pod: failed to "KillContainer" for "web-server" with KillContainerError: "rpc error: code = Unknown desc = failed to get container 'stopped' status 0932c6a0a5c161028a4899814c065df585e2489e5422b8d85b95b2c1bb5cf27d: failed to get container stopped status: 30s timeout reached" >> In teardown phase, pod was not found. It was already deleted by that time 2021-06-26 16:44:43,093 - MainThread - WARNING - ocs_ci.utility.utils.exec_cmd.505 - Command stderr: Error from server (NotFound): pods "pod-test-cephfs-1041dd086ca44f2b979eda89" not found
Humble, any further inputs on this bug? Is there anything we need to check on the ROKS platform as to why the pods mounting CephFS PVC take time to be deleted?
(In reply to Sahina Bose from comment #4) > Humble, any further inputs on this bug? Is there anything we need to check > on the ROKS platform as to why the pods mounting CephFS PVC take time to be > deleted? Sahina, as Mudit mentioned we have seen delay in deleting the PODS due to the in flight IO sync/flush process taking time. Mudit, anything else to add on that front? Apart from that, recently I was looking into various locking issues we have in the driver to improve certain operations. One of the fix introduced here improved the "delete" performance of PODs to a good extent, that said, there was an improvement of delete time ie 1000 pods delete time has been improved from 980seconds to ~280 seconds with the fix in place. https://github.com/ceph/ceph-csi/pull/2149/#issuecomment-872922124 however if the same test suite (tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mgr]) does not exhibit issues on other platforms and this reported BZ is only on "ROKS" platform, I am not sure why thats the case. meanwhile, I would also like to ask, do we see the same issue on OCS 4.8/4.9 builds while running on ROKS?
(In reply to Humble Chirammal from comment #5) > (In reply to Sahina Bose from comment #4) > > Humble, any further inputs on this bug? Is there anything we need to check > > on the ROKS platform as to why the pods mounting CephFS PVC take time to be > > deleted? > > Sahina, as Mudit mentioned we have seen delay in deleting the PODS due to > the in flight IO sync/flush process taking time. Mudit, anything else to add > on that front? Since the tests is the same on all platforms, is this an issue with the underlying storage performance on ROKS (sync/flush taking time)? > > Apart from that, recently I was looking into various locking issues we have > in the driver to improve certain operations. One of the fix introduced here > improved the "delete" performance of PODs to a good extent, that said, there > was an improvement of delete time ie 1000 pods delete time has been > improved from 980seconds to ~280 seconds with the fix in place. > > https://github.com/ceph/ceph-csi/pull/2149/#issuecomment-872922124 > > however if the same test suite > (tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py:: > TestDaemonKillDuringPodPvcDeletion:: > test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods- > mgr]) does not exhibit issues on other platforms and this reported BZ is > only on "ROKS" platform, I am not sure why thats the case. These tests seem to fail only on ROKS platform. @Sidhant, can you confirm? > > > meanwhile, I would also like to ask, do we see the same issue on OCS 4.8/4.9 > builds while running on ROKS? IBM ROKS is testing OCS 4.7.2. Humble, is there a change in OCS 4.8/4.9 that could help here?
(In reply to Sahina Bose from comment #6) > > These tests seem to fail only on ROKS platform. @Sidhant, can you confirm? Yes, tests seems to fail only on ROKS platform.
Sorry, missed this. Will take a look at the logs again. Targeting it for 4.9 for now, fix(if any) can be backported to 4.8.z if required.
Sidhant/Sahina, can you please confirm we hit this issue in every test run on this platform ? or is it frequently ? (In reply to Mudit Agarwal from comment #9) > Sorry, missed this. Will take a look at the logs again. Targeting it for 4.9 > for now, fix(if any) can be backported to 4.8.z if required. As discussed in CSI team bug triage call, Yug will be taking a look at this bug, changing ownership.
We have seen these failures in recent run of Tier4c, https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/testReport/ 1)tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pods-mgr] 2)tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-mon] 3)tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pods-mds] 4)tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pods-cephfsplugin] 5)tests/manage/pv_services/test_resource_deleti on_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pods-operator]
(In reply to ggeddam from comment #13) > We have seen these failures in recent run of Tier4c, > https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/ > qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/testReport/ Hey Gangadhara, Does this have the fix mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1980873#c12 ?
HI Yug Gupta I have run this job on 17th, https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/ So I think I ran on this fix
(In reply to ggeddam from comment #15) > HI Yug Gupta > > I have run this job on 17th, > https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/ > qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/ > So I think I ran on this fix On checking the job run, I see that the Ceph Cluster is not Healthy, and we are seeing some network failures `ocs_ci.ocs.exceptions.CephHealthException: Ceph cluster health is not OK. Health: HEALTH_WARN 1/3 mons down, quorum a,c` See: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/testReport/tests.manage.z_cluster.nodes.test_worker_nodes_network_failures/TestWorkerNodesFailure/test_all_worker_nodes_short_network_failure_colocated_/ Can you try to reproduce with the fix on a cluster that is HEALTHY? As that might be causing the issue here.
(In reply to Yug Gupta from comment #17) > (In reply to ggeddam from comment #15) > > HI Yug Gupta > > > > I have run this job on 17th, > > https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/ > > qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/ > > So I think I ran on this fix > > On checking the job run, I see that the Ceph Cluster is not Healthy, and we > are seeing some network failures > `ocs_ci.ocs.exceptions.CephHealthException: Ceph cluster health is not OK. > Health: HEALTH_WARN 1/3 mons down, quorum a,c` > > See: > https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/ > qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/testReport/tests.manage. > z_cluster.nodes.test_worker_nodes_network_failures/TestWorkerNodesFailure/ > test_all_worker_nodes_short_network_failure_colocated_/ > > Can you try to reproduce with the fix on a cluster that is HEALTHY? As that > might be causing the issue here. Yeah, cluster health and slow response time has been in suspicious path since start, so its good to have more attention on this track.
Please reopen if this is reproducible on a healthy cluster (plus the csi fix mentioned in the above comments)