2214838 – Failed to restart VMI in cnv - Failed to terminate process Device or resource busy

Bug 2214838 - Failed to restart VMI in cnv - Failed to terminate process Device or resource busy

Summary: Failed to restart VMI in cnv - Failed to terminate process Device or resource...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	csi-driver
Sub Component:
Version:	4.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	ODF 4.14.0
Assignee:	Rakshith
QA Contact:	Vishakha Kathole
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2244409
TreeView+	depends on / blocked

Reported:	2023-06-13 19:37 UTC by Sean Haselden
Modified:	2024-02-20 06:05 UTC (History)
CC List:	13 users (show)
Fixed In Version:	4.14.0-105
Doc Type:	Bug Fix
Doc Text:	. Mitigation of negative impact while running the reclaim space operation on a RBD PVC attached to a pod Previously, during the reclaimspace operation, IO and performance was impacted when the rbd sparsify command was executed on the RADOS block device (RBD) persistent volume claim (PVC) while it was attached to a pod. With this fix, the execution of rbd sparsify command is skipped when RBD PVC is found to be attached to a pod during the operation. As a result, any negative impact of running the reclaim space operation on a RBD PVC attached to a pod is mitigated.
Clone Of:
Clones:	2215917 (view as bug list)
Environment:
Last Closed:	2023-11-08 18:51:26 UTC
Embargoed:
Flags:	shaselde: needinfo+ kramdoss: needinfo+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph-csi pull 3985	None	Merged	rbd: do not execute rbd sparsify when volume is in use	2023-07-19 07:58:08 UTC
Github	red-hat-storage ocs-ci pull 9154	None	Merged	GSS BZ-2214838 Automation, Test Rook Reclaim Namespace	2024-02-20 06:05:55 UTC
Red Hat Issue Tracker	CNV-29868	None	None	None	2023-06-13 19:47:44 UTC

Description Sean Haselden 2023-06-13 19:37:47 UTC

Description of problem:

Attempt to restart VMI using OCS storage and hot plug disks results in pod 'terminating' indefinitely and virt-launcher "Failed to terminate process 68 with SIGTERM: Device or resource busy" 

Cannot kill pod and restart it, rendering the VM down. 


Version-Release number of selected component (if applicable):


How reproducible:

Not fully reproducible as it appears to be intermittent.


Steps to Reproduce:

Rebooted worker node in order to clear previously hung pods.  

Restart VMI multiple times. 


Actual results:

After restarting VMI: 

# oc get events:  
1h5m       Normal   Killing               pod/hp-volume-kzn4h                Stopping container hotplug-disk
5m56s      Normal   Killing               pod/virt-launcher-awrhdv500-qhs86  Stopping container compute
5m56s      Warning  FailedKillPod         pod/virt-launcher-awrhdv500-qhs86  error killing pod: [failed to "KillContainer" for "compute" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "19db6081-760e-42f4-859e-fe2b79239275" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
13m        Warning  FailedKillPod         pod/virt-launcher-awrhdv500-qhs86  error killing pod: [failed to "KillContainer" for "compute" with KillContainerError: "rpc error: code = Unknown desc = failed to stop container 21e9a094da73c02fbf5e6c4c21b5892b7ae6a161e9ac79a358ebe1040fec826a: context deadline exceeded", failed to "KillPodSandbox" for "19db6081-760e-42f4-859e-fe2b79239275" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container for pod sandbox 3ca982adc188e0fd08c40f49a9d2593b476d66d6084142a0b24fa6de119df262: failed to stop container k8s_compute_virt-launcher-awrhdv500-qhs86_XXXX-os-images_19db6081-760e-42f4-859e-fe2b79239275_0: context deadline exceeded"]XX
20m        Warning  FailedKillPod         pod/virt-launcher-awrhdv500-qhs86  error killing pod: [failed to "KillContainer" for "compute" with KillContainerError: "rpc error: code = Unknown desc = failed to stop container 21e9a094da73c02fbf5e6c4c21b5892b7ae6a161e9ac79a358ebe1040fec826a: context deadline exceeded", failed to "KillPodSandbox" for "19db6081-760e-42f4-859e-fe2b79239275" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]

# omg logs virt-launcher-awrhdv500-qhs86
2023-06-13T16:35:31.597516920Z {"component":"virt-launcher","kind":"","level":"info","msg":"Signaled vmi shutdown","name":"awrhdv500","namespace":"XXX-os-images","pos":"server.go:311","timestamp":"2023-06-13T16:35:31.597448Z","uid":"42e57a49-2ade-4a26-ba02-b4d4adeb43bc"}
2023-06-13T16:35:47.026189598Z {"component":"virt-launcher","level":"error","msg":"Failed to terminate process 68 with SIGTERM: Device or resource busy","pos":"virProcessKillPainfullyDelay:472","subcomponent":"libvirt","thread":"26","timestamp":"2023-06-13T16:35:47.025000Z"}
2023-06-13T16:35:47.030654658Z {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 6 with reason 2 received","pos":"client.go:435","timestamp":"2023-06-13T16:35:47.030612Z"}
2023-06-13T16:35:48.786034568Z {"component":"virt-launcher","level":"info","msg":"Grace Period expired, shutting down.","pos":"monitor.go:165","timestamp":"2023-06-13T16:35:48.785937Z"}




Expected results:

VM restarts cleanly. 


Additional info:

This is the second time this has happened in a week.  Previously we entered the node to find a defunct qemu-kvm process and the 'conmon' process still running: 

ps -ef | grep -i awrhdv500 -> 
conmon process still running 

sh-4.4# ps -ef | grep -i awrhdv500
root     1297341 1292627  0 18:40 ?        00:00:00 grep -i awrhdv500
root     3588286       1  0 Jun08 ?        00:00:00 /usr/bin/conmon -b /run/containers/storage/overlay-containers/d129e13673e5d2280e3a931f07c2f58e64d880da4ef615274e09553576a3a1c2/userdata -c d129e13673e5d2280e3a931f07c2f58e64d880da4ef615274e09553576a3a1c2 --exit-dir /var/run/crio/exits -l /var/log/pods/XXX-os-images_virt-launcher-awrhdv500-kvkf8_57ab818f-aea7-4a5c-ad20-46fdc5e547ee/compute/


We tried to force delete the pod virt-launcher-awrhdv500-kvkf8.  The pod was removed, but we still saw the  /usr/bin/conmon -b /run/containers/storage/overlay-containers/ process running on the node.  This process wasn't deleted.

We then tried starting the vm awrhdv500 and it got stuck in a "Pending state". 

Only way we found to clear it was a reboot of the worker node.

Comment 2 Fabian Deutsch 2023-06-14 12:41:31 UTC

Moving this storage for further investigation.

Comment 54 errata-xmlrpc 2023-11-08 18:51:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.14.0 security, enhancement & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6832

Note You need to log in before you can comment on or make changes to this bug.