This bug was initially created as a copy of Bug #2214838 I am copying this bug because: Description of problem: Attempt to restart VMI using OCS storage and hot plug disks results in pod 'terminating' indefinitely and virt-launcher "Failed to terminate process 68 with SIGTERM: Device or resource busy" Cannot kill pod and restart it, rendering the VM down. Version-Release number of selected component (if applicable): How reproducible: Not fully reproducible as it appears to be intermittent. Steps to Reproduce: Rebooted worker node in order to clear previously hung pods. Restart VMI multiple times. Actual results: After restarting VMI: # oc get events: 1h5m Normal Killing pod/hp-volume-kzn4h Stopping container hotplug-disk 5m56s Normal Killing pod/virt-launcher-awrhdv500-qhs86 Stopping container compute 5m56s Warning FailedKillPod pod/virt-launcher-awrhdv500-qhs86 error killing pod: [failed to "KillContainer" for "compute" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "19db6081-760e-42f4-859e-fe2b79239275" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"] 13m Warning FailedKillPod pod/virt-launcher-awrhdv500-qhs86 error killing pod: [failed to "KillContainer" for "compute" with KillContainerError: "rpc error: code = Unknown desc = failed to stop container 21e9a094da73c02fbf5e6c4c21b5892b7ae6a161e9ac79a358ebe1040fec826a: context deadline exceeded", failed to "KillPodSandbox" for "19db6081-760e-42f4-859e-fe2b79239275" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to stop container for pod sandbox 3ca982adc188e0fd08c40f49a9d2593b476d66d6084142a0b24fa6de119df262: failed to stop container k8s_compute_virt-launcher-awrhdv500-qhs86_XXXX-os-images_19db6081-760e-42f4-859e-fe2b79239275_0: context deadline exceeded"]XX 20m Warning FailedKillPod pod/virt-launcher-awrhdv500-qhs86 error killing pod: [failed to "KillContainer" for "compute" with KillContainerError: "rpc error: code = Unknown desc = failed to stop container 21e9a094da73c02fbf5e6c4c21b5892b7ae6a161e9ac79a358ebe1040fec826a: context deadline exceeded", failed to "KillPodSandbox" for "19db6081-760e-42f4-859e-fe2b79239275" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"] # omg logs virt-launcher-awrhdv500-qhs86 2023-06-13T16:35:31.597516920Z {"component":"virt-launcher","kind":"","level":"info","msg":"Signaled vmi shutdown","name":"awrhdv500","namespace":"XXX-os-images","pos":"server.go:311","timestamp":"2023-06-13T16:35:31.597448Z","uid":"42e57a49-2ade-4a26-ba02-b4d4adeb43bc"} 2023-06-13T16:35:47.026189598Z {"component":"virt-launcher","level":"error","msg":"Failed to terminate process 68 with SIGTERM: Device or resource busy","pos":"virProcessKillPainfullyDelay:472","subcomponent":"libvirt","thread":"26","timestamp":"2023-06-13T16:35:47.025000Z"} 2023-06-13T16:35:47.030654658Z {"component":"virt-launcher","level":"info","msg":"DomainLifecycle event 6 with reason 2 received","pos":"client.go:435","timestamp":"2023-06-13T16:35:47.030612Z"} 2023-06-13T16:35:48.786034568Z {"component":"virt-launcher","level":"info","msg":"Grace Period expired, shutting down.","pos":"monitor.go:165","timestamp":"2023-06-13T16:35:48.785937Z"} Expected results: VM restarts cleanly. Additional info: This is the second time this has happened in a week. Previously we entered the node to find a defunct qemu-kvm process and the 'conmon' process still running: ps -ef | grep -i awrhdv500 -> conmon process still running sh-4.4# ps -ef | grep -i awrhdv500 root 1297341 1292627 0 18:40 ? 00:00:00 grep -i awrhdv500 root 3588286 1 0 Jun08 ? 00:00:00 /usr/bin/conmon -b /run/containers/storage/overlay-containers/d129e13673e5d2280e3a931f07c2f58e64d880da4ef615274e09553576a3a1c2/userdata -c d129e13673e5d2280e3a931f07c2f58e64d880da4ef615274e09553576a3a1c2 --exit-dir /var/run/crio/exits -l /var/log/pods/XXX-os-images_virt-launcher-awrhdv500-kvkf8_57ab818f-aea7-4a5c-ad20-46fdc5e547ee/compute/ We tried to force delete the pod virt-launcher-awrhdv500-kvkf8. The pod was removed, but we still saw the /usr/bin/conmon -b /run/containers/storage/overlay-containers/ process running on the node. This process wasn't deleted. We then tried starting the vm awrhdv500 and it got stuck in a "Pending state". Only way we found to clear it was a reboot of the worker node.
steps to reproduce - https://bugzilla.redhat.com/show_bug.cgi?id=2214838#c46
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.13.3 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:5376