Bug 1980873

Summary: [IBM Cloud] Pods using CephFS PVC taking longer time to delete due to KillContainerError
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Sidhant Agrawal <sagrawal>
Component: csi-driverAssignee: Yug Gupta <ygupta>
Status: CLOSED WORKSFORME QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: high    
Version: 4.7CC: ggeddam, hchiramm, madam, muagarwa, ocs-bugs, odf-bz-bot, sabose
Target Milestone: ---Flags: ygupta: needinfo? (ggeddam)
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-21 13:30:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sidhant Agrawal 2021-07-09 16:51:56 UTC
Description of problem (please be detailed as possible and provide log
snippests):
In IBM Cloud 1AZ RHEL cluster, some ocs-ci tier4b tests in test_ceph_daemon_kill_during_resource_deletion.py[1] are failing due to timeout while waiting for a pod to delete.
When trying to delete multiple pods in bulk, pod deletion is not started within 60 seconds and pods are taking more than ~180 seconds to get deleted.
Also observed following similar error message in events related to pods, indicating there is problem with killing of the pods.

---
  message: 'error killing pod: failed to "KillContainer" for "web-server" with KillContainerError: "rpc error: code = Unknown desc = failed to get container ''stopped'' status 0953bf51c35d17ae518a865102afadc8b1adccdcdd7b77739bd484d24fe08c45: failed to get container stopped status: 30s timeout reached"'
---

Although this error occurs while killing pods, the pods ultimately does get deleted before the teardown part.

Raising this in csi-driver component for initial analysis, feel free to change the component if required.

[1] https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py

Version of all relevant components (if applicable):
OCP: 4.7.12
OCS: 4.7.2-429.ci
Ceph: 14.2.11-181.el8cp (68fea1005601531fe60d2979c56ea63bc073c84f) nautilus (stable)

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No
But pod deletion should work as expected without error messages

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Not sure

If this is a regression, please provide more details to justify this:


Steps to Reproduce:

Run below test from test_ceph_daemon_kill_during_resource_deletion.py
tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mgr]

OR
Manual steps:

The above tests includes the following steps:

1. Create 12 CephFS PVCs (6 RWO + 6 RWX mode) 
2. Create one pod using each RWO PVC and two pods using each RWX PVC
3. Start IO on each pod. RWX PVC will be used by two pods
4. Start deleting pods in bulk and wait for pod deletion to start.
Observe that pod deletion doesn't start within 60 seconds. And there is "KillContainerError" message in pod events.


Actual results:
Pod taking prolonged time to delete and KillContainerError message in pod events

Expected results:
Pod should get deleted without any error within expected time.

Additional info:

Following tests are failing due to timeout waiting for a pod to delete
tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-mgr]
tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mgr]
tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-mon]
tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mon]
tests/manage/pv_services/test_ceph_daemon_kill _during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-osd]
tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-osd]
tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mds]

must-gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j003icm1r3-t4b/j003icm1r3-t4b_20210625T110119/logs/failed_testcase_ocs_logs_1624627393/

test execution logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j003icm1r3-t4b/j003icm1r3-t4b_20210625T110119/logs/ocs-ci-logs-1624627393/by_outcome/failed/tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py/

Note: The test mentioned in "Steps to Reproduce" fails at a stage before any disruptive operation (daemon kill) was performed.

Comment 2 Mudit Agarwal 2021-07-13 07:07:28 UTC
This looks expected to me, 

>>     io_size = self.pvc_size - 1
>>            pod_obj.run_io(
>>                storage_type="fs", size=f"{io_size}G", fio_filename=f"{pod_obj.name}_io"
>>            )

run_io() function which is used to run IOs has a parameter fio_sync which should be set to 1 in order for the sync to happen after fio finishes.

If the pod deletion is attempted just after the IO is run on the pod, it will wait for the data to flush to the disk before killing/stopping the container.
Can we enable the sync and then try to reproduce this?

Also, can you please point me to the specific pods which we failed to delete?

Comment 3 Sidhant Agrawal 2021-07-14 08:48:05 UTC
@Mudit
Did you meant this parameter "fsync_on_close" ?
fsync_on_close=1 is already set in the fio command.

Consider this failure case in test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mon]

>> Fio started with fsync_on_close = 1 on pod "pod-test-cephfs-1041dd086ca44f2b979eda89"

2021-06-26 16:05:16,624 - ThreadPoolExecutor-627_0 - INFO - ocs_ci.utility.utils.exec_cmd.486 - Executing command: oc -n namespace-test-2b6c084cb1204a9786183bfe3 rsh pod-test-cephfs-1041dd086ca44f2b979eda89 fio --name=fio-rand-readwrite --filename=/var/lib/www/html/pod-test-cephfs-1041dd086ca44f2b979eda89_io --readwrite=randrw --bs=4K --direct=1 --numjobs=1 --time_based=1 --runtime=60 --size=2G --iodepth=4 --invalidate=1 --fsync_on_close=1 --rwmixread=75 --ioengine=libaio --rate=1m --rate_process=poisson --output-format=json


>> Delete command
2021-06-26 16:05:30,876 - ThreadPoolExecutor-625_0 - INFO - ocs_ci.utility.utils.exec_cmd.486 - Executing command: oc -n namespace-test-2b6c084cb1204a9786183bfe3 delete Pod pod-test-cephfs-1041dd086ca44f2b979eda89 --wait=false


>> Last check if pod got deleted  --> Pod was still present
2021-06-26 16:10:09,961 - MainThread - INFO - ocs_ci.utility.utils.exec_cmd.486 - Executing command: oc -n namespace-test-2b6c084cb1204a9786183bfe3 get Pod pod-test-cephfs-1041dd086ca44f2b979eda89 -n namespace-test-2b6c084cb1204a9786183bfe3 -o yaml

>> Pod events
...
  Normal   Started                 36m    kubelet                  Started container web-server
  Normal   Killing                 4m46s  kubelet                  Stopping container web-server
  Warning  FailedKillPod           10s    kubelet                  error killing pod: failed to "KillContainer" for "web-server" with KillContainerError: "rpc error: code = Unknown desc = failed to get container 'stopped' status 0932c6a0a5c161028a4899814c065df585e2489e5422b8d85b95b2c1bb5cf27d: failed to get container stopped status: 30s timeout reached"


>> In teardown phase, pod was not found. It was already deleted by that time 
2021-06-26 16:44:43,093 - MainThread - WARNING - ocs_ci.utility.utils.exec_cmd.505 - Command stderr: Error from server (NotFound): pods "pod-test-cephfs-1041dd086ca44f2b979eda89" not found

Comment 4 Sahina Bose 2021-08-03 06:20:40 UTC
Humble, any further inputs on this bug? Is there anything we need to check on the ROKS platform as to why the pods mounting CephFS PVC take time to be deleted?

Comment 5 Humble Chirammal 2021-08-03 07:58:36 UTC
(In reply to Sahina Bose from comment #4)
> Humble, any further inputs on this bug? Is there anything we need to check
> on the ROKS platform as to why the pods mounting CephFS PVC take time to be
> deleted?

Sahina, as Mudit mentioned we have seen delay in deleting the PODS due to the in flight IO sync/flush process taking time. Mudit, anything else to add on that front? 

Apart from that, recently I was looking into various locking issues we have in the driver to improve certain operations. One of the fix introduced here improved the "delete" performance of PODs to a good extent, that said, there was an improvement of delete time  ie 1000 pods delete time has been improved from 980seconds to ~280 seconds with the fix in place.

https://github.com/ceph/ceph-csi/pull/2149/#issuecomment-872922124

however if the same test suite (tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::TestDaemonKillDuringPodPvcDeletion::test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-mgr]) does not exhibit issues on other platforms and this reported BZ is only on "ROKS" platform, I am not sure why thats the case.


meanwhile, I would also like to ask, do we see the same issue on OCS 4.8/4.9 builds while running on ROKS?

Comment 6 Sahina Bose 2021-08-03 15:35:32 UTC
(In reply to Humble Chirammal from comment #5)
> (In reply to Sahina Bose from comment #4)
> > Humble, any further inputs on this bug? Is there anything we need to check
> > on the ROKS platform as to why the pods mounting CephFS PVC take time to be
> > deleted?
> 
> Sahina, as Mudit mentioned we have seen delay in deleting the PODS due to
> the in flight IO sync/flush process taking time. Mudit, anything else to add
> on that front? 

Since the tests is the same on all platforms, is this an issue with the underlying storage performance on ROKS (sync/flush taking time)?

> 
> Apart from that, recently I was looking into various locking issues we have
> in the driver to improve certain operations. One of the fix introduced here
> improved the "delete" performance of PODs to a good extent, that said, there
> was an improvement of delete time  ie 1000 pods delete time has been
> improved from 980seconds to ~280 seconds with the fix in place.
> 
> https://github.com/ceph/ceph-csi/pull/2149/#issuecomment-872922124
> 
> however if the same test suite
> (tests/manage/pv_services/test_ceph_daemon_kill_during_resource_deletion.py::
> TestDaemonKillDuringPodPvcDeletion::
> test_ceph_daemon_kill_during_pod_pvc_deletion[CephFileSystem-delete_pods-
> mgr]) does not exhibit issues on other platforms and this reported BZ is
> only on "ROKS" platform, I am not sure why thats the case.

These tests seem to fail only on ROKS platform. @Sidhant, can you confirm?

> 
> 
> meanwhile, I would also like to ask, do we see the same issue on OCS 4.8/4.9
> builds while running on ROKS?

IBM ROKS is testing OCS 4.7.2. Humble, is there a change in OCS 4.8/4.9 that could help here?

Comment 7 Sidhant Agrawal 2021-08-04 04:28:07 UTC
(In reply to Sahina Bose from comment #6)
> 
> These tests seem to fail only on ROKS platform. @Sidhant, can you confirm?

Yes, tests seems to fail only on ROKS platform.

Comment 9 Mudit Agarwal 2021-08-09 08:02:37 UTC
Sorry, missed this. Will take a look at the logs again. Targeting it for 4.9 for now, fix(if any) can be backported to 4.8.z if required.

Comment 10 Humble Chirammal 2021-08-16 14:18:05 UTC
Sidhant/Sahina, can you please confirm we hit this issue in every test run on this platform ? or is it frequently ? 

(In reply to Mudit Agarwal from comment #9)
> Sorry, missed this. Will take a look at the logs again. Targeting it for 4.9
> for now, fix(if any) can be backported to 4.8.z if required.

As discussed in CSI team bug triage call, Yug will be taking a look at this bug, changing ownership.

Comment 13 ggeddam 2021-08-20 12:45:35 UTC
We have seen these failures in recent run of Tier4c, https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/testReport/

1)tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pods-mgr]

2)tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pvcs-mon]

3)tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pods-mds]

4)tests/manage/pv_services/test_resource_deletion_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pods-cephfsplugin]

5)tests/manage/pv_services/test_resource_deleti on_during_pod_pvc_deletion.py::TestDeleteResourceDuringPodPvcDeletion::test_disruptive_during_pod_pvc_deletion[CephFileSystem-delete_pods-operator]

Comment 14 Yug Gupta 2021-08-20 12:50:34 UTC
(In reply to ggeddam from comment #13)
> We have seen these failures in recent run of Tier4c,
> https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/
> qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/testReport/


Hey Gangadhara,
Does this have the fix mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1980873#c12 ?

Comment 15 ggeddam 2021-08-20 13:12:05 UTC
HI Yug Gupta

I have run this job on 17th, https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/
So I think I ran on this fix

Comment 17 Yug Gupta 2021-08-23 07:05:35 UTC
 (In reply to ggeddam from comment #15)
> HI Yug Gupta
> 
> I have run this job on 17th,
> https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/
> qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/
> So I think I ran on this fix

On checking the job run, I see that the Ceph Cluster is not Healthy, and we are seeing some network failures
`ocs_ci.ocs.exceptions.CephHealthException: Ceph cluster health is not OK. Health: HEALTH_WARN 1/3 mons down, quorum a,c`

See: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/testReport/tests.manage.z_cluster.nodes.test_worker_nodes_network_failures/TestWorkerNodesFailure/test_all_worker_nodes_short_network_failure_colocated_/

Can you try to reproduce with the fix on a cluster that is HEALTHY? As that might be causing the issue here.

Comment 18 Humble Chirammal 2021-08-23 07:10:51 UTC
(In reply to Yug Gupta from comment #17)
>  (In reply to ggeddam from comment #15)
> > HI Yug Gupta
> > 
> > I have run this job on 17th,
> > https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/
> > qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/
> > So I think I ran on this fix
> 
> On checking the job run, I see that the Ceph Cluster is not Healthy, and we
> are seeing some network failures
> `ocs_ci.ocs.exceptions.CephHealthException: Ceph cluster health is not OK.
> Health: HEALTH_WARN 1/3 mons down, quorum a,c`
> 
> See:
> https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/view/Tier4/job/
> qe-trigger-ibmcloud-managed-1az-rhel-3w-tier4c/9/testReport/tests.manage.
> z_cluster.nodes.test_worker_nodes_network_failures/TestWorkerNodesFailure/
> test_all_worker_nodes_short_network_failure_colocated_/
> 
> Can you try to reproduce with the fix on a cluster that is HEALTHY? As that
> might be causing the issue here.

Yeah, cluster health and slow response time has been in suspicious path since start, so its good to have more attention on this track.

Comment 19 Mudit Agarwal 2021-09-21 13:30:40 UTC
Please reopen if this is reproducible on a healthy cluster (plus the csi fix mentioned in the above comments)