Bug 1565064

Summary: Much more failures when migration back concurrently hit error: "unable to connect to server: Connection timed out"
Product: Red Hat Enterprise Linux 7 Reporter: chhu
Component: libvirtAssignee: Jiri Denemark <jdenemar>
Status: CLOSED DUPLICATE QA Contact: chhu
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.6CC: chhu, dyuan, fjin, jdenemar, lmen, xuzhang, yafu, yalzhang
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-06 13:24:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
mig_3.log
none
mig_4.log
none
Time out related log in libvirtd_source log
none
libvirtd_log.tar.gz.0
none
libvirtd_log.tar.gz.1
none
libvirtd_log.tar.gz.2
none
libvirtd_log.tar.gz.3
none
libvirtd_log.tar.gz.4
none
libvirtd_log.tar.gz.5
none
libvirtd_log.tar.gz.6
none
libvirtd_log.tar.gz.7
none
libvirtd_log.tar.gz.8 none

Description chhu 2018-04-09 09:53:10 UTC
Description of problem:
Migrate 60 guests concurrently for 10 times(total 600 guests) from source to target, then migrate back. Much more failures when migration back concurrently. Hit "unable to connect to server ****: Connection timed out"

Version-Release number of selected component (if applicable):
libvirt-3.9.0-14.el7_5.2.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.1.x86_64
kernel-3.10.0-861.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create 600 guests on source host
2. Configure on both source and target server
- libvirtd.cfg
listen_tls = 1
auth_tls = "none"
max_clients = 5000
max_queued_clients = 1000
min_workers = 500
max_workers = 1000
max_client_requests = 1000
keepalive_interval = -1

- qemu.cfg
lock_manager = "lockd"
max_processes = 65535
max_files = 65535
keepalive_interval = -1

- Restart libvirtd service

2. Migrate 60 guests concurrently for 10 times(total 600 guests) from source to target.
# cat migcon60.sh
#!/bin/sh
for i in {0..9} ; do
let j=i*60+1
k=`expr $j + 60`
while [ $j -lt $k ]
do
virsh migrate --live --p2p --undefinesource --persistent --verbose guest-$j qemu+tls://****/system &
j=`expr $j + 1`
done
wait
done
# sh migcon60.sh

3. Check there are 4 guests failed to migrate due to timeout in file: mig_3.log, and the status are running in the source host.
Migration: [ 96 %]error: unable to connect to server at '****:49166': Connection timed out

4. Migrate the left 4 guests to target host manually.

5. Migarte 60 guests concurrently for 10 times(total 600 guests) back.

6. Check there are 22 guests failed to migrate due to timeout in file: mig_4.log


Actual results:
In step6: Much more guests failed to migrate back concurrently, hit error: "unable to connect to server ****: Connection timed out".

Expected results:
In step6: The guests can be migrated back successfully

Additional info:
 - file: mig_3.log, mig_4.log, libvirtd logs
 - Current env: run 4 times
    - 1: from hostA to hostB, one guest failed to migrate
    - 2: migrate back, 26 guests failed to migrate
    - 3: from hostA to hostB, 4 guests failed to migrate 
    - 4: migrate back, 22 guests failed to migrate

Comment 2 chhu 2018-04-09 09:55:19 UTC
Created attachment 1419172 [details]
mig_3.log

Comment 3 chhu 2018-04-09 09:56:37 UTC
Created attachment 1419173 [details]
mig_4.log

Comment 4 chhu 2018-04-09 10:28:59 UTC
Created attachment 1419180 [details]
Time out related log in libvirtd_source log

Comment 5 Jiri Denemark 2018-04-09 10:31:54 UTC
So apparently the hosts are quite loaded and so is the network. Either the
network is so loaded that the packets just don't come to the destination
libvirtd and back in time or libvirtd is not able to accept the connection in
time. Could you please capture the TCP connection attempts on both hosts and
attach the pcap files and corresponding debug logs from libvirtd from both
hosts?

Comment 6 chhu 2018-04-12 01:46:24 UTC
Created attachment 1420615 [details]
libvirtd_log.tar.gz.0

Comment 7 chhu 2018-04-12 02:04:12 UTC
Created attachment 1420620 [details]
libvirtd_log.tar.gz.1

Comment 8 chhu 2018-04-12 02:05:52 UTC
Created attachment 1420621 [details]
libvirtd_log.tar.gz.2

Comment 9 chhu 2018-04-12 02:07:20 UTC
Created attachment 1420622 [details]
libvirtd_log.tar.gz.3

Comment 10 chhu 2018-04-12 02:09:25 UTC
Created attachment 1420623 [details]
libvirtd_log.tar.gz.4

Comment 11 chhu 2018-04-12 02:10:57 UTC
Created attachment 1420624 [details]
libvirtd_log.tar.gz.5

Comment 12 chhu 2018-04-12 02:12:24 UTC
Created attachment 1420625 [details]
libvirtd_log.tar.gz.6

Comment 13 chhu 2018-04-12 02:13:46 UTC
Created attachment 1420626 [details]
libvirtd_log.tar.gz.7

Comment 14 chhu 2018-04-12 02:14:59 UTC
Created attachment 1420628 [details]
libvirtd_log.tar.gz.8

Comment 15 chhu 2018-04-12 02:17:23 UTC
Hi, Jiri

I sent an email to you for more libvirtd logs on env, I'll try to capture the TCP connection attempts on both hosts and attach the pcap files later, thank you!


Regards,
chhu

Comment 16 Jiri Denemark 2018-09-06 13:24:58 UTC
We're still waiting for the pcap files.

Anyway, this seems to be pretty similar to bug 1614182, where some packets
were delayed a lot. If it's the packet which is initiating a new connection,
the connection times out.

*** This bug has been marked as a duplicate of bug 1614182 ***