2225420 – Do not run rbd sparsify when volume is in use

Bug 2225420 - Do not run rbd sparsify when volume is in use

Summary: Do not run rbd sparsify when volume is in use

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	csi-driver
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Rakshith
QA Contact:	krishnaram Karthick
Docs Contact:
URL:
Whiteboard:
Depends On:	2225431 2225433 2225436
Blocks:
TreeView+	depends on / blocked

Reported:	2023-07-25 09:38 UTC by Rakshith
Modified:	2023-09-14 09:40 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-09-01 07:39:59 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ceph ceph-csi pull 3985	0	None	Merged	rbd: do not execute rbd sparsify when volume is in use	2023-07-25 09:52:43 UTC
Github	red-hat-storage ceph-csi pull 170	0	None	Merged	BUG 2225420: rbd: do not execute rbd sparsify when volume is in use	2023-09-01 07:10:05 UTC

Description Rakshith 2023-07-25 09:38:41 UTC

This bug was initially created as a copy of Bug #2214838

I am copying this bug because: 



Description of problem:

Attempt to restart VMI using OCS storage and hot plug disks results in pod 'terminating' indefinitely and virt-launcher "Failed to terminate process 68 with SIGTERM: Device or resource busy" 

Cannot kill pod and restart it, rendering the VM down. 


Version-Release number of selected component (if applicable):


How reproducible:

Not fully reproducible as it appears to be intermittent.


Steps to Reproduce:

Rebooted worker node in order to clear previously hung pods.  

Restart VMI multiple times. 


Actual results:

After restarting VMI: 

# oc get events:  
1h5m       Normal   Killing               pod/hp-volume-kzn4h                Stopping container hotplug-disk
5m56s      Normal   Killing               pod/virt-launcher-awrhdv500-qhs86  Stopping container compute
5m56s      Warning  FailedKillPod         pod/virt-launcher-awrhdv500-qhs86  error killing pod: [failed to "KillContainer" for "compute" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "19db6081-760e-42f4-859e-fe2b79239275" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
13m        Warning  FailedKillPod         pod/virt-launcher-awrhdv500-qhs86  error killing pod: [failed to "KillContainer" for "compute" with KillContainerError: "rpc error: code = Unknown desc = failed to stop container 21e9a094da73c02fbf5e6c4c21b5892b7ae6a161e9ac79a358ebe1040fec826a: context deadline exceeded", failed to "KillPodSandbox" for "19db6081-760e-42f4-859e-fe2b79239275" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container for pod sandbox 3ca982adc188e0fd08c40f49a9d2593b476d66d6084142a0b24fa6de119df262: failed to stop container k8s_compute_virt-launcher-awrhdv500-qhs86_XXXX-os-images_19db6081-760e-42f4-859e-fe2b79239275_0: context deadline exceeded"]XX
20m        Warning  FailedKillPod         pod/virt-launcher-awrhdv500-qhs86  error killing pod: [failed to "KillContainer" for "compute" with KillContainerError: "rpc error: code = Unknown desc = failed to stop container 21e9a094da73c02fbf5e6c4c21b5892b7ae6a161e9ac79a358ebe1040fec826a: context deadline exceeded", failed to "KillPodSandbox" for "19db6081-760e-42f4-859e-fe2b79239275" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]

# omg logs virt-launcher-awrhdv500-qhs86
2023-06-13T16:35:31.597516920Z {"component":"virt-launcher","kind":"","level":"info","msg":"Signaled vmi shutdown","name":"awrhdv500","namespace":"XXX-os-images","pos":"server.go:311","timestamp":"2023-06-13T16:35:31.597448Z","uid":"42e57a49-2ade-4a26-ba02-b4d4adeb43bc"}
2023-06-13T16:35:47.026189598Z {"component":"virt-launcher","level":"error","msg":"Failed to terminate process 68 with SIGTERM: Device or resource busy","pos":"virProcessKillPainfullyDelay:472","subcomponent":"libvirt","thread":"26","timestamp":"2023-06-13T16:35:47.025000Z"}
2023-06-13T16:35:47.030654658Z {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 6 with reason 2 received","pos":"client.go:435","timestamp":"2023-06-13T16:35:47.030612Z"}
2023-06-13T16:35:48.786034568Z {"component":"virt-launcher","level":"info","msg":"Grace Period expired, shutting down.","pos":"monitor.go:165","timestamp":"2023-06-13T16:35:48.785937Z"}




Expected results:

VM restarts cleanly. 


Additional info:

This is the second time this has happened in a week.  Previously we entered the node to find a defunct qemu-kvm process and the 'conmon' process still running: 

ps -ef | grep -i awrhdv500 -> 
conmon process still running 

sh-4.4# ps -ef | grep -i awrhdv500
root     1297341 1292627  0 18:40 ?        00:00:00 grep -i awrhdv500
root     3588286       1  0 Jun08 ?        00:00:00 /usr/bin/conmon -b /run/containers/storage/overlay-containers/d129e13673e5d2280e3a931f07c2f58e64d880da4ef615274e09553576a3a1c2/userdata -c d129e13673e5d2280e3a931f07c2f58e64d880da4ef615274e09553576a3a1c2 --exit-dir /var/run/crio/exits -l /var/log/pods/XXX-os-images_virt-launcher-awrhdv500-kvkf8_57ab818f-aea7-4a5c-ad20-46fdc5e547ee/compute/


We tried to force delete the pod virt-launcher-awrhdv500-kvkf8.  The pod was removed, but we still saw the  /usr/bin/conmon -b /run/containers/storage/overlay-containers/ process running on the node.  This process wasn't deleted.

We then tried starting the vm awrhdv500 and it got stuck in a "Pending state". 

Only way we found to clear it was a reboot of the worker node.

Comment 5 krishnaram Karthick 2023-08-14 15:06:41 UTC

Has there been any request to backport the fix to 4.10? 
From what I know, CNV dont use 4.10 and 4.10 is going to be out of support soon. So, I propose closing this bug.

Comment 9 Niels de Vos 2023-09-01 07:39:59 UTC

This is addressed in 

ODF-4.11 bug 2225431
ODF-4.12 bug 2225433
ODF-4.13 bug 2225436

There is no customer support case attached to this BZ, so I am closing this one out. If a fix is needed in ODF-4.10, please re-open this bug with a description for the justification.

Note You need to log in before you can comment on or make changes to this bug.