Description of problem: Virt-launcher pods are slow to terminate in some cases which are not yet well known. This is what is observed: 1. Launcher is notified to gracefully shut down. (Note it seems we are not trying to forcefully shut down the domain after graceful shut down) 2. "gracefully closed notify pipe connection for vmi" is observed after the domain shut down 3. Lot of loops follows with "detected unresponsive virt-launcher command socket" <- This is the main issue why we don't clean up 4. Final clean-up is performed and Pod is terminated shortly The most notable change in this area was safepath handling which might be a cause of different paths of clean up. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
We saw this bug in several different network tests, which all include a linux_bridge, NAD and VM. Steps to reproduce (for example, from 'test_veth_removed_from_host_after_vm_deleted'): 1. Create a NAD with the type "linux bridge": oc create -f br1test_nad.yaml 2. Create a linux bridge policy (NNCP) on worker 1: oc create -f br1test_nncp.yaml 3. Create a VM (fedora) connected to the NAD: oc create -f vma.yaml 4. wait for the VM to be Running: oc get VM -w 5. Delete the VM: oc delete vm vma The virt-launcher pod is stuck in Terminating status for about 8 minutes. In a similar scenario, with bond NNCP, the pods don't behave the same way and are terminated fast.
Deferring to 4.13 due to capacity and lack of clarity about the root cause.
Deferring to 4.14 due to capacity.