Bug 2005693

Summary: SSH connection to VM failed once in a while after migration
Product: Container Native Virtualization (CNV) Reporter: Israel Pinto <ipinto>
Component: NetworkingAssignee: Edward Haas <edwardh>
Status: NEW --- QA Contact: Meni Yakove <myakove>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.9.0CC: jpeimer, phoracek, rnetser
Target Milestone: ---Keywords: Reopened, TestBlocker
Target Release: futureFlags: phoracek: needinfo? (ipinto)
edwardh: needinfo? (ipinto)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-29 12:33:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Israel Pinto 2021-09-19 14:57:26 UTC
Description of problem:
While testing VM migration we notice that sometime SSH connection to VM failed. 
Ping the VM right after migration from one of the nodes and see if we didn't witness packet loss.

Version-Release number of selected component (if applicable):
CNV 4.9

Steps to Reproduce:
1. Create migratable VM with OCS storage, create ssh service to VM
2. Migrate VM 
3. Connect via SSH 
4. Pause, Un-pause VM 
5. Connect VM SSH

Actual results:
Once in while SSH connection failing, we see it a lot on the automation runs.

Additional info:
(from the mail thread)
1. 
We did automation test to get statistics of the issue:
Ran a loop of migrate vm + connect via ssh for an hour (after each migration perform 10 times ssh_vm-pause-unpause-ssh_vm):
---------------------------------------------------
vm = golden_image_vm_object_from_template_multi_fedora_os_multi_storage_scope_class
iter_pass = 0
iter_fail = 0
import time
with open('test.log', 'w') as ff:
    while True:
        ff.write("-----------------Migrate VM-----------------\n")
        migrate_vm_and_verify(vm=vm, check_ssh_connectivity=True)
        for i in range(0,10):
            try:
                validate_pause_unpause_linux_vm(vm=vm, pre_pause_pid=ping_process_in_fedora_os)
                iter_pass += 1
            except Exception:
                ff.write("FAIL!!!\n")
                iter_fail += 1
                time.sleep(1)
            ff.write(f"PASSED: {iter_pass}\n")
            ff.write(f"FAILED: {iter_fail}\n")

1) migrate_vm_and_verify migrates vm, checks if it succeeded and check ssh connection
2) validate_pause_unpause_linux_vm connects via ssh and creates ping process, pause/unpause vm, ssh and check process id

(counter is for validate_pause_unpause_linux_vm)
The result is:
PASSED: 396
FAILED: 14

Meaning:
validate_pause_unpause_linux_vm rarely fails on first iteration ONLY (rest 9 succeeds)


SSH failure: socket.timeout: 10.1.156.18: timeout(10.0)
2. 
Ran loop of 400 SSH connection to VM --> all pass.

Comment 1 Ruth Netser 2021-09-19 15:25:51 UTC
Seen also in flows that do no involve pause/unpause but rather have a successful SSH connection to the VM after migration and then a failed one (Unable to connect to port <ssh nodePort> on <node ip>)

Comment 2 Petr Horáček 2021-09-20 07:58:06 UTC
Israel, please share used VM definiton, virt-handler and virt-launcher logs, how often it reproduces (in numbers), and a manual reproducer. We cannot work with cnv-tests suite and provided info is not giving us much insight. Thanks.

Comment 3 Petr Horáček 2022-01-20 10:48:57 UTC
Update from an offline discussion:

There is a suspicion that this may be caused by us declaring the migration as complete before the source VM becomes inactive and stops receiving traffic.

Comment 5 Petr Horáček 2023-06-29 12:33:08 UTC
Feel free to reopen if this happens again and we are able to attach required logs.

Comment 6 Petr Horáček 2023-08-03 09:27:29 UTC
I shouldn't have closed this as INSUFFICIENT_DATA, there has been a mail thread where additional information was shared (Subject: "https://bugzilla.redhat.com/show_bug.cgi?id=2005693").

Moving this to NEW, so we can reevaluate this and decide on the next steps.

FWIW, there is another open live-migration-related BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2186372

Comment 7 Edward Haas 2023-08-03 10:39:59 UTC
Base on the reporter (Israel):

> we see that after migration the SSH connection to VM is not responsive for 3-5 seconds,
> but it not happening all the time (14 times it failed to connect out of 400 tries)

Things are working like this for a long time and we have not seen users complaining about it much.
This means that it is rare enough not to cause a substantial problem.

My suggestion is to accept this as a limitation and close it with WONTFIX, reasoning that it is
rare and once it occurs it gets resolved in 2-3 seconds.
The alternative to try and fix this will require a large amount of investment and may cause other
side effects we cannot expect.

We can reconsider handling this issue once more feedback is received from the field.