Bug 2128808

Summary: [4.9.z] Virt-launcher Pods are slow to terminate
Product: Container Native Virtualization (CNV) Reporter: lpivarc
Component: VirtualizationAssignee: Itamar Holder <iholder>
Status: CLOSED WORKSFORME QA Contact: Kedar Bidarkar <kbidarka>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.9.6CC: acardace, awax, fdeutsch, sgott
Target Milestone: ---   
Target Release: 4.14.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-13 11:00:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description lpivarc 2022-09-21 14:37:07 UTC
Description of problem:
Virt-launcher pods are slow to terminate in some cases which are not yet well known. This is what is observed:
1. Launcher is notified to gracefully shut down. (Note it seems we are not trying to forcefully shut down the domain after graceful shut down)
2. "gracefully closed notify pipe connection for vmi" is observed after the domain shut down
3. Lot of loops follows with "detected unresponsive virt-launcher command socket"
<- This is the main issue why we don't clean up
4. Final clean-up is performed and Pod is terminated shortly

The most notable change in this area was safepath handling which might be a cause of different paths of clean up.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 awax 2022-09-21 23:19:43 UTC
We saw this bug in several different network tests, which all include a linux_bridge, NAD and VM.
Steps to reproduce (for example, from 'test_veth_removed_from_host_after_vm_deleted'):
1. Create a NAD with the type "linux bridge":
oc create -f br1test_nad.yaml

2. Create a linux bridge policy (NNCP) on worker 1:
oc create -f br1test_nncp.yaml

3. Create a VM (fedora) connected to the NAD:
oc create -f vma.yaml

4. wait for the VM to be Running:
oc get VM -w

5. Delete the VM:
oc delete vm vma

The virt-launcher pod is stuck in Terminating status for about 8 minutes.

In a similar scenario, with bond NNCP, the pods don't behave the same way and are terminated fast.

Comment 8 Antonio Cardace 2022-10-28 12:51:00 UTC
Deferring to 4.13 due to capacity and lack of clarity about the root cause.

Comment 10 Antonio Cardace 2023-03-03 16:47:35 UTC
Deferring to 4.14 due to capacity.