2225431 – Failed to restart VMI in cnv - Failed to terminate process Device or resource busy

Bug 2225431 - Failed to restart VMI in cnv - Failed to terminate process Device or resource busy

Summary: Failed to restart VMI in cnv - Failed to terminate process Device or resource...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	csi-driver
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	ODF 4.11.11
Assignee:	Rakshith
QA Contact:	Vishakha Kathole
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2225420
TreeView+	depends on / blocked

Reported:	2023-07-25 09:55 UTC by Rakshith
Modified:	2023-09-28 07:40 UTC (History)
CC List:	4 users (show)
Fixed In Version:	4.11.11-2
Doc Type:	Bug Fix
Doc Text:	Previously, during the reclaimspace operation, I/O and performance was impacted when the `rbd sparsify` command was executed on the RADOS block device (RBD) persistent volume claim (PVC) while it was attached to a pod. With this fix, the execution of the `rbd sparsify` command is skipped when the RBD PVC is found to be attached to a pod during the operation. As a result, any negative impact of running the reclaim space operation on a RBD PVC attached to a pod is mitigated.
Clone Of:
Environment:
Last Closed:	2023-09-28 07:39:49 UTC
Embargoed:
Flags:	kramdoss: needinfo+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph-csi pull 3985	None	Merged	rbd: do not execute rbd sparsify when volume is in use	2023-07-25 09:55:45 UTC
Github	red-hat-storage ceph-csi pull 171	None	Merged	BUG 2225431: rbd: do not execute rbd sparsify when volume is in use	2023-07-25 14:49:49 UTC
Github	red-hat-storage ceph-csi pull 186	None	open	BUG 2225431: rbd: do not execute rbd sparsify when volume is in use	2023-08-30 11:03:05 UTC
Red Hat Product Errata	RHBA-2023:5393	None	None	None	2023-09-28 07:40:10 UTC

Description Rakshith 2023-07-25 09:55:46 UTC

This bug was initially created as a copy of Bug #2214838

I am copying this bug because: 



Description of problem:

Attempt to restart VMI using OCS storage and hot plug disks results in pod 'terminating' indefinitely and virt-launcher "Failed to terminate process 68 with SIGTERM: Device or resource busy" 

Cannot kill pod and restart it, rendering the VM down. 


Version-Release number of selected component (if applicable):


How reproducible:

Not fully reproducible as it appears to be intermittent.


Steps to Reproduce:

Rebooted worker node in order to clear previously hung pods.  

Restart VMI multiple times. 


Actual results:

After restarting VMI: 

# oc get events:  
1h5m       Normal   Killing               pod/hp-volume-kzn4h                Stopping container hotplug-disk
5m56s      Normal   Killing               pod/virt-launcher-awrhdv500-qhs86  Stopping container compute
5m56s      Warning  FailedKillPod         pod/virt-launcher-awrhdv500-qhs86  error killing pod: [failed to "KillContainer" for "compute" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "19db6081-760e-42f4-859e-fe2b79239275" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
13m        Warning  FailedKillPod         pod/virt-launcher-awrhdv500-qhs86  error killing pod: [failed to "KillContainer" for "compute" with KillContainerError: "rpc error: code = Unknown desc = failed to stop container 21e9a094da73c02fbf5e6c4c21b5892b7ae6a161e9ac79a358ebe1040fec826a: context deadline exceeded", failed to "KillPodSandbox" for "19db6081-760e-42f4-859e-fe2b79239275" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container for pod sandbox 3ca982adc188e0fd08c40f49a9d2593b476d66d6084142a0b24fa6de119df262: failed to stop container k8s_compute_virt-launcher-awrhdv500-qhs86_XXXX-os-images_19db6081-760e-42f4-859e-fe2b79239275_0: context deadline exceeded"]XX
20m        Warning  FailedKillPod         pod/virt-launcher-awrhdv500-qhs86  error killing pod: [failed to "KillContainer" for "compute" with KillContainerError: "rpc error: code = Unknown desc = failed to stop container 21e9a094da73c02fbf5e6c4c21b5892b7ae6a161e9ac79a358ebe1040fec826a: context deadline exceeded", failed to "KillPodSandbox" for "19db6081-760e-42f4-859e-fe2b79239275" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]

# omg logs virt-launcher-awrhdv500-qhs86
2023-06-13T16:35:31.597516920Z {"component":"virt-launcher","kind":"","level":"info","msg":"Signaled vmi shutdown","name":"awrhdv500","namespace":"XXX-os-images","pos":"server.go:311","timestamp":"2023-06-13T16:35:31.597448Z","uid":"42e57a49-2ade-4a26-ba02-b4d4adeb43bc"}
2023-06-13T16:35:47.026189598Z {"component":"virt-launcher","level":"error","msg":"Failed to terminate process 68 with SIGTERM: Device or resource busy","pos":"virProcessKillPainfullyDelay:472","subcomponent":"libvirt","thread":"26","timestamp":"2023-06-13T16:35:47.025000Z"}
2023-06-13T16:35:47.030654658Z {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 6 with reason 2 received","pos":"client.go:435","timestamp":"2023-06-13T16:35:47.030612Z"}
2023-06-13T16:35:48.786034568Z {"component":"virt-launcher","level":"info","msg":"Grace Period expired, shutting down.","pos":"monitor.go:165","timestamp":"2023-06-13T16:35:48.785937Z"}




Expected results:

VM restarts cleanly. 


Additional info:

This is the second time this has happened in a week.  Previously we entered the node to find a defunct qemu-kvm process and the 'conmon' process still running: 

ps -ef | grep -i awrhdv500 -> 
conmon process still running 

sh-4.4# ps -ef | grep -i awrhdv500
root     1297341 1292627  0 18:40 ?        00:00:00 grep -i awrhdv500
root     3588286       1  0 Jun08 ?        00:00:00 /usr/bin/conmon -b /run/containers/storage/overlay-containers/d129e13673e5d2280e3a931f07c2f58e64d880da4ef615274e09553576a3a1c2/userdata -c d129e13673e5d2280e3a931f07c2f58e64d880da4ef615274e09553576a3a1c2 --exit-dir /var/run/crio/exits -l /var/log/pods/XXX-os-images_virt-launcher-awrhdv500-kvkf8_57ab818f-aea7-4a5c-ad20-46fdc5e547ee/compute/


We tried to force delete the pod virt-launcher-awrhdv500-kvkf8.  The pod was removed, but we still saw the  /usr/bin/conmon -b /run/containers/storage/overlay-containers/ process running on the node.  This process wasn't deleted.

We then tried starting the vm awrhdv500 and it got stuck in a "Pending state". 

Only way we found to clear it was a reboot of the worker node.

Comment 8 Niels de Vos 2023-08-30 11:03:05 UTC

https://github.com/red-hat-storage/ceph-csi/pull/186 has been created to address this bug. Once the merge window for ODF 4.11.11 opens, leave a comment

/bugzilla refresh

in the PR to get it merged.

Comment 19 errata-xmlrpc 2023-09-28 07:39:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Data Foundation 4.11.11 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:5393

Note You need to log in before you can comment on or make changes to this bug.