1980873 – [IBM Cloud] Pods using CephFS PVC taking longer time to delete due to KillContainerError

Bug 1980873 - [IBM Cloud] Pods using CephFS PVC taking longer time to delete due to KillContainerError

Summary: [IBM Cloud] Pods using CephFS PVC taking longer time to delete due to KillCon...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	csi-driver
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Yug Gupta
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-07-09 16:51 UTC by Sidhant Agrawal
Modified:	2023-12-08 04:25 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-09-21 13:30:40 UTC
Embargoed:

Attachments	(Terms of Use)

Description Sidhant Agrawal 2021-07-09 16:51:56 UTC

Description of problem (please be detailed as possible and provide log
snippests):
In IBM Cloud 1AZ RHEL cluster, some ocs-ci tier4b tests in test_ceph_daemon_kill_during_resource_deletion.py[1] are failing due to timeout while waiting for a pod to delete.
When trying to delete multiple pods in bulk, pod deletion is not started within 60 seconds and pods are taking more than ~180 seconds to get deleted.
Also observed following similar error message in events related to pods, indicating there is problem with killing of the pods.

---
  message: 'error killing pod: failed to "KillContainer" for "web-server" with KillContainerError: "rpc error: code = Unknown desc = failed to get container ''stopped'' status 0953bf51c35d17ae518a865102afadc8b1adccdcdd7b77739bd484d24fe08c45: failed to get container stopped status: 30s timeout reached"'
---

Although this error occurs while killing pods, the pods ultimately does get deleted before the teardown part.

Raising this in csi-driver component for initial analysis, feel free to change the component if required.

[1] https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py

Version of all relevant components (if applicable):
OCP: 4.7.12
OCS: 4.7.2-429.ci
Ceph: 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No
But pod deletion should work as expected without error messages

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Not sure

If this is a regression, please provide more details to justify this:


Steps to Reproduce:

Run below test from test_ceph_daemon_kill_during_resource_deletion.py
tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mgr]

OR
Manual steps:

The above tests includes the following steps:

1. Create 12 CephFS PVCs (6 RWO + 6 RWX mode) 
2. Create one pod using each RWO PVC and two pods using each RWX PVC
3. Start IO on each pod. RWX PVC will be used by two pods
4. Start deleting pods in bulk and wait for pod deletion to start.
Observe that pod deletion doesn't start within 60 seconds. And there is "KillContainerError" message in pod events.


Actual results:
Pod taking prolonged time to delete and KillContainerError message in pod events

Expected results:
Pod should get deleted without any error within expected time.

Additional info:

Following tests are failing due to timeout waiting for a pod to delete
tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-mgr]
tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mgr]
tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-mon]
tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mon]
tests/manage/pv_services/test_ceph_daemon_kill _during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-osd]
tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-osd]
tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mds]

must-gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j003icm1r3-t4b/j003icm1r3-t4b_20210625T110119/logs/failed_testcase_ocs_logs_1624627393/

test execution logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j003icm1r3-t4b/j003icm1r3-t4b_20210625T110119/logs/ocs-ci-logs-1624627393/by_outcome/failed/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py/

Note: The test mentioned in "Steps to Reproduce" fails at a stage before any disruptive operation (daemon kill) was performed.

Comment 2 Mudit Agarwal 2021-07-13 07:07:28 UTC

This looks expected to me, 

>>     io_size = self.pvc_size - 1
>>            pod_obj.run_io(
>>                storage_type="fs", size=f"{io_size}G", fio_filename=f"{pod_obj.name}_io"
>>            )

run_io() function which is used to run IOs has a parameter fio_sync which should be set to 1 in order for the sync to happen after fio finishes.

If the pod deletion is attempted just after the IO is run on the pod, it will wait for the data to flush to the disk before killing/stopping the container.
Can we enable the sync and then try to reproduce this?

Also, can you please point me to the specific pods which we failed to delete?

Comment 3 Sidhant Agrawal 2021-07-14 08:48:05 UTC

@Mudit
Did you meant this parameter "fsync_on_close" ?
fsync_on_close=1 is already set in the fio command.

Consider this failure case in test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mon]

>> Fio started with fsync_on_close = 1 on pod "pod-test-cephfs-1041dd086ca44f2b979eda89"

2021-06-26 16:05:16,624 - ThreadPoolExecutor-627_0 - INFO - ocs_ci.utility.utils.exec_cmd.486 - Executing command: oc -n namespace-test-2b6c084cb1204a9786183bfe3 rsh pod-test-cephfs-1041dd086ca44f2b979eda89 fio --name=fio-rand-readwrite --filename=/var/lib/www/html/pod-test-cephfs-1041dd086ca44f2b979eda89_io --readwrite=randrw --bs=4K --direct=1 --numjobs=1 --time_based=1 --runtime=60 --size=2G --iodepth=4 --invalidate=1 --fsync_on_close=1 --rwmixread=75 --ioengine=libaio --rate=1m --rate_process=poisson --output-format=json


>> Delete command
2021-06-26 16:05:30,876 - ThreadPoolExecutor-625_0 - INFO - ocs_ci.utility.utils.exec_cmd.486 - Executing command: oc -n namespace-test-2b6c084cb1204a9786183bfe3 delete Pod pod-test-cephfs-1041dd086ca44f2b979eda89 --wait=false


>> Last check if pod got deleted  --> Pod was still present
2021-06-26 16:10:09,961 - MainThread - INFO - ocs_ci.utility.utils.exec_cmd.486 - Executing command: oc -n namespace-test-2b6c084cb1204a9786183bfe3 get Pod pod-test-cephfs-1041dd086ca44f2b979eda89 -n namespace-test-2b6c084cb1204a9786183bfe3 -o yaml

>> Pod events
...
  Normal   Started                 36m    kubelet                  Started container web-server
  Normal   Killing                 4m46s  kubelet                  Stopping container web-server
  Warning  FailedKillPod           10s    kubelet                  error killing pod: failed to "KillContainer" for "web-server" with KillContainerError: "rpc error: code = Unknown desc = failed to get container 'stopped' status 0932c6a0a5c161028a4899814c065df585e2489e5422b8d85b95b2c1bb5cf27d: failed to get container stopped status: 30s timeout reached"


>> In teardown phase, pod was not found. It was already deleted by that time 
2021-06-26 16:44:43,093 - MainThread - WARNING - ocs_ci.utility.utils.exec_cmd.505 - Command stderr: Error from server (NotFound): pods "pod-test-cephfs-1041dd086ca44f2b979eda89" not found

Comment 4 Sahina Bose 2021-08-03 06:20:40 UTC

Humble, any further inputs on this bug? Is there anything we need to check on the ROKS platform as to why the pods mounting CephFS PVC take time to be deleted?

Comment 5 Humble Chirammal 2021-08-03 07:58:36 UTC

(In reply to Sahina Bose from comment #4)
> Humble, any further inputs on this bug? Is there anything we need to check
> on the ROKS platform as to why the pods mounting CephFS PVC take time to be
> deleted?

Sahina, as Mudit mentioned we have seen delay in deleting the PODS due to the in flight IO sync/flush process taking time. Mudit, anything else to add on that front? 

Apart from that, recently I was looking into various locking issues we have in the driver to improve certain operations. One of the fix introduced here improved the "delete" performance of PODs to a good extent, that said, there was an improvement of delete time  ie 1000 pods delete time has been improved from 980seconds to ~280 seconds with the fix in place.

https://github.com/ceph/ceph-csi/pull/2149/#issuecomment-872922124

however if the same test suite (tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mgr]) does not exhibit issues on other platforms and this reported BZ is only on "ROKS" platform, I am not sure why thats the case.


meanwhile, I would also like to ask, do we see the same issue on OCS 4.8/4.9 builds while running on ROKS?

Comment 6 Sahina Bose 2021-08-03 15:35:32 UTC

(In reply to Humble Chirammal from comment #5)
> (In reply to Sahina Bose from comment #4)
> > Humble, any further inputs on this bug? Is there anything we need to check
> > on the ROKS platform as to why the pods mounting CephFS PVC take time to be
> > deleted?
> 
> Sahina, as Mudit mentioned we have seen delay in deleting the PODS due to
> the in flight IO sync/flush process taking time. Mudit, anything else to add
> on that front? 

Since the tests is the same on all platforms, is this an issue with the underlying storage performance on ROKS (sync/flush taking time)?

> 
> Apart from that, recently I was looking into various locking issues we have
> in the driver to improve certain operations. One of the fix introduced here
> improved the "delete" performance of PODs to a good extent, that said, there
> was an improvement of delete time  ie 1000 pods delete time has been
> improved from 980seconds to ~280 seconds with the fix in place.
> 
> https://github.com/ceph/ceph-csi/pull/2149/#issuecomment-872922124
> 
> however if the same test suite
> (tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::
> TestDaemonKillDuringPodPvcDeletion::
> test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-
> mgr]) does not exhibit issues on other platforms and this reported BZ is
> only on "ROKS" platform, I am not sure why thats the case.

These tests seem to fail only on ROKS platform. @Sidhant, can you confirm?

> 
> 
> meanwhile, I would also like to ask, do we see the same issue on OCS 4.8/4.9
> builds while running on ROKS?

IBM ROKS is testing OCS 4.7.2. Humble, is there a change in OCS 4.8/4.9 that could help here?

Comment 7 Sidhant Agrawal 2021-08-04 04:28:07 UTC

(In reply to Sahina Bose from comment #6)
> 
> These tests seem to fail only on ROKS platform. @Sidhant, can you confirm?

Yes, tests seems to fail only on ROKS platform.

Comment 9 Mudit Agarwal 2021-08-09 08:02:37 UTC

Sorry, missed this. Will take a look at the logs again. Targeting it for 4.9 for now, fix(if any) can be backported to 4.8.z if required.

Comment 10 Humble Chirammal 2021-08-16 14:18:05 UTC

Sidhant/Sahina, can you please confirm we hit this issue in every test run on this platform ? or is it frequently ? 

(In reply to Mudit Agarwal from comment #9)
> Sorry, missed this. Will take a look at the logs again. Targeting it for 4.9
> for now, fix(if any) can be backported to 4.8.z if required.

As discussed in CSI team bug triage call, Yug will be taking a look at this bug, changing ownership.

Comment 13 ggeddam 2021-08-20 12:45:35 UTC

We have seen these failures in recent run of Tier4c, https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/testReport/

1)tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pods-mgr]

2)tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-mon]

3)tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pods-mds]

4)tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pods-cephfsplugin]

5)tests/manage/pv_services/test_resource_deleti on_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pods-operator]

Comment 14 Yug Gupta 2021-08-20 12:50:34 UTC

(In reply to ggeddam from comment #13)
> We have seen these failures in recent run of Tier4c,
> https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/
> qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/testReport/


Hey Gangadhara,
Does this have the fix mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1980873#c12 ?

Comment 15 ggeddam 2021-08-20 13:12:05 UTC

HI Yug Gupta

I have run this job on 17th, https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/
So I think I ran on this fix

Comment 17 Yug Gupta 2021-08-23 07:05:35 UTC

 (In reply to ggeddam from comment #15)
> HI Yug Gupta
> 
> I have run this job on 17th,
> https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/
> qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/
> So I think I ran on this fix

On checking the job run, I see that the Ceph Cluster is not Healthy, and we are seeing some network failures
`ocs_ci.ocs.exceptions.CephHealthException: Ceph cluster health is not OK. Health: HEALTH_WARN 1/3 mons down, quorum a,c`

See: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/testReport/tests.manage.z_cluster.nodes.test_worker_nodes_network_failures/TestWorkerNodesFailure/test_all_worker_nodes_short_network_failure_colocated_/

Can you try to reproduce with the fix on a cluster that is HEALTHY? As that might be causing the issue here.

Comment 18 Humble Chirammal 2021-08-23 07:10:51 UTC

(In reply to Yug Gupta from comment #17)
>  (In reply to ggeddam from comment #15)
> > HI Yug Gupta
> > 
> > I have run this job on 17th,
> > https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/
> > qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/
> > So I think I ran on this fix
> 
> On checking the job run, I see that the Ceph Cluster is not Healthy, and we
> are seeing some network failures
> `ocs_ci.ocs.exceptions.CephHealthException: Ceph cluster health is not OK.
> Health: HEALTH_WARN 1/3 mons down, quorum a,c`
> 
> See:
> https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/
> qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/testReport/tests.manage.
> z_cluster.nodes.test_worker_nodes_network_failures/TestWorkerNodesFailure/
> test_all_worker_nodes_short_network_failure_colocated_/
> 
> Can you try to reproduce with the fix on a cluster that is HEALTHY? As that
> might be causing the issue here.

Yeah, cluster health and slow response time has been in suspicious path since start, so its good to have more attention on this track.

Comment 19 Mudit Agarwal 2021-09-21 13:30:40 UTC

Please reopen if this is reproducible on a healthy cluster (plus the csi fix mentioned in the above comments)

Comment 20 Red Hat Bugzilla 2023-12-08 04:25:40 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.