Bug 1980873
| Summary: | [IBM Cloud] Pods using CephFS PVC taking longer time to delete due to KillContainerError | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Sidhant Agrawal <sagrawal> |
| Component: | csi-driver | Assignee: | Yug Gupta <ygupta> |
| Status: | CLOSED WORKSFORME | QA Contact: | Elad <ebenahar> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.7 | CC: | ggeddam, hchiramm, madam, muagarwa, ocs-bugs, odf-bz-bot, sabose |
| Target Milestone: | --- | Flags: | ygupta:
needinfo?
(ggeddam) |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-09-21 13:30:40 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Sidhant Agrawal
2021-07-09 16:51:56 UTC
This looks expected to me,
>> io_size = self.pvc_size - 1
>> pod_obj.run_io(
>> storage_type="fs", size=f"{io_size}G", fio_filename=f"{pod_obj.name}_io"
>> )
run_io() function which is used to run IOs has a parameter fio_sync which should be set to 1 in order for the sync to happen after fio finishes.
If the pod deletion is attempted just after the IO is run on the pod, it will wait for the data to flush to the disk before killing/stopping the container.
Can we enable the sync and then try to reproduce this?
Also, can you please point me to the specific pods which we failed to delete?
@Mudit Did you meant this parameter "fsync_on_close" ? fsync_on_close=1 is already set in the fio command. Consider this failure case in test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mon] >> Fio started with fsync_on_close = 1 on pod "pod-test-cephfs-1041dd086ca44f2b979eda89" 2021-06-26 16:05:16,624 - ThreadPoolExecutor-627_0 - INFO - ocs_ci.utility.utils.exec_cmd.486 - Executing command: oc -n namespace-test-2b6c084cb1204a9786183bfe3 rsh pod-test-cephfs-1041dd086ca44f2b979eda89 fio --name=fio-rand-readwrite --filename=/var/lib/www/html/pod-test-cephfs-1041dd086ca44f2b979eda89_io --readwrite=randrw --bs=4K --direct=1 --numjobs=1 --time_based=1 --runtime=60 --size=2G --iodepth=4 --invalidate=1 --fsync_on_close=1 --rwmixread=75 --ioengine=libaio --rate=1m --rate_process=poisson --output-format=json >> Delete command 2021-06-26 16:05:30,876 - ThreadPoolExecutor-625_0 - INFO - ocs_ci.utility.utils.exec_cmd.486 - Executing command: oc -n namespace-test-2b6c084cb1204a9786183bfe3 delete Pod pod-test-cephfs-1041dd086ca44f2b979eda89 --wait=false >> Last check if pod got deleted --> Pod was still present 2021-06-26 16:10:09,961 - MainThread - INFO - ocs_ci.utility.utils.exec_cmd.486 - Executing command: oc -n namespace-test-2b6c084cb1204a9786183bfe3 get Pod pod-test-cephfs-1041dd086ca44f2b979eda89 -n namespace-test-2b6c084cb1204a9786183bfe3 -o yaml >> Pod events ... Normal Started 36m kubelet Started container web-server Normal Killing 4m46s kubelet Stopping container web-server Warning FailedKillPod 10s kubelet error killing pod: failed to "KillContainer" for "web-server" with KillContainerError: "rpc error: code = Unknown desc = failed to get container 'stopped' status 0932c6a0a5c161028a4899814c065df585e2489e5422b8d85b95b2c1bb5cf27d: failed to get container stopped status: 30s timeout reached" >> In teardown phase, pod was not found. It was already deleted by that time 2021-06-26 16:44:43,093 - MainThread - WARNING - ocs_ci.utility.utils.exec_cmd.505 - Command stderr: Error from server (NotFound): pods "pod-test-cephfs-1041dd086ca44f2b979eda89" not found Humble, any further inputs on this bug? Is there anything we need to check on the ROKS platform as to why the pods mounting CephFS PVC take time to be deleted? (In reply to Sahina Bose from comment #4) > Humble, any further inputs on this bug? Is there anything we need to check > on the ROKS platform as to why the pods mounting CephFS PVC take time to be > deleted? Sahina, as Mudit mentioned we have seen delay in deleting the PODS due to the in flight IO sync/flush process taking time. Mudit, anything else to add on that front? Apart from that, recently I was looking into various locking issues we have in the driver to improve certain operations. One of the fix introduced here improved the "delete" performance of PODs to a good extent, that said, there was an improvement of delete time ie 1000 pods delete time has been improved from 980seconds to ~280 seconds with the fix in place. https://github.com/ceph/ceph-csi/pull/2149/#issuecomment-872922124 however if the same test suite (tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mgr]) does not exhibit issues on other platforms and this reported BZ is only on "ROKS" platform, I am not sure why thats the case. meanwhile, I would also like to ask, do we see the same issue on OCS 4.8/4.9 builds while running on ROKS? (In reply to Humble Chirammal from comment #5) > (In reply to Sahina Bose from comment #4) > > Humble, any further inputs on this bug? Is there anything we need to check > > on the ROKS platform as to why the pods mounting CephFS PVC take time to be > > deleted? > > Sahina, as Mudit mentioned we have seen delay in deleting the PODS due to > the in flight IO sync/flush process taking time. Mudit, anything else to add > on that front? Since the tests is the same on all platforms, is this an issue with the underlying storage performance on ROKS (sync/flush taking time)? > > Apart from that, recently I was looking into various locking issues we have > in the driver to improve certain operations. One of the fix introduced here > improved the "delete" performance of PODs to a good extent, that said, there > was an improvement of delete time ie 1000 pods delete time has been > improved from 980seconds to ~280 seconds with the fix in place. > > https://github.com/ceph/ceph-csi/pull/2149/#issuecomment-872922124 > > however if the same test suite > (tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py:: > TestDaemonKillDuringPodPvcDeletion:: > test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods- > mgr]) does not exhibit issues on other platforms and this reported BZ is > only on "ROKS" platform, I am not sure why thats the case. These tests seem to fail only on ROKS platform. @Sidhant, can you confirm? > > > meanwhile, I would also like to ask, do we see the same issue on OCS 4.8/4.9 > builds while running on ROKS? IBM ROKS is testing OCS 4.7.2. Humble, is there a change in OCS 4.8/4.9 that could help here? (In reply to Sahina Bose from comment #6) > > These tests seem to fail only on ROKS platform. @Sidhant, can you confirm? Yes, tests seems to fail only on ROKS platform. Sorry, missed this. Will take a look at the logs again. Targeting it for 4.9 for now, fix(if any) can be backported to 4.8.z if required. Sidhant/Sahina, can you please confirm we hit this issue in every test run on this platform ? or is it frequently ? (In reply to Mudit Agarwal from comment #9) > Sorry, missed this. Will take a look at the logs again. Targeting it for 4.9 > for now, fix(if any) can be backported to 4.8.z if required. As discussed in CSI team bug triage call, Yug will be taking a look at this bug, changing ownership. We have seen these failures in recent run of Tier4c, https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/testReport/ 1)tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pods-mgr] 2)tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-mon] 3)tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pods-mds] 4)tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pods-cephfsplugin] 5)tests/manage/pv_services/test_resource_deleti on_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pods-operator] (In reply to ggeddam from comment #13) > We have seen these failures in recent run of Tier4c, > https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/ > qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/testReport/ Hey Gangadhara, Does this have the fix mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1980873#c12 ? HI Yug Gupta I have run this job on 17th, https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/ So I think I ran on this fix (In reply to ggeddam from comment #15) > HI Yug Gupta > > I have run this job on 17th, > https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/ > qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/ > So I think I ran on this fix On checking the job run, I see that the Ceph Cluster is not Healthy, and we are seeing some network failures `ocs_ci.ocs.exceptions.CephHealthException: Ceph cluster health is not OK. Health: HEALTH_WARN 1/3 mons down, quorum a,c` See: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/testReport/tests.manage.z_cluster.nodes.test_worker_nodes_network_failures/TestWorkerNodesFailure/test_all_worker_nodes_short_network_failure_colocated_/ Can you try to reproduce with the fix on a cluster that is HEALTHY? As that might be causing the issue here. (In reply to Yug Gupta from comment #17) > (In reply to ggeddam from comment #15) > > HI Yug Gupta > > > > I have run this job on 17th, > > https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/ > > qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/ > > So I think I ran on this fix > > On checking the job run, I see that the Ceph Cluster is not Healthy, and we > are seeing some network failures > `ocs_ci.ocs.exceptions.CephHealthException: Ceph cluster health is not OK. > Health: HEALTH_WARN 1/3 mons down, quorum a,c` > > See: > https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/ > qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/testReport/tests.manage. > z_cluster.nodes.test_worker_nodes_network_failures/TestWorkerNodesFailure/ > test_all_worker_nodes_short_network_failure_colocated_/ > > Can you try to reproduce with the fix on a cluster that is HEALTHY? As that > might be causing the issue here. Yeah, cluster health and slow response time has been in suspicious path since start, so its good to have more attention on this track. Please reopen if this is reproducible on a healthy cluster (plus the csi fix mentioned in the above comments) |