Bug 1018695

Summary:	qemu live migration port conflicts with other users of ephemeral port(s)
Product:	Red Hat Enterprise Linux 6	Reporter:	Stefan Hajnoczi <stefanha>
Component:	libvirt	Assignee:	Jiri Denemark <jdenemar>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	6.5	CC:	aavati, areis, barumuga, berrange, cdhouch, chorn, clalancette, dani-rh, dshetty, dyuan, gianluca.cecchi, herrold, itamar, jforbes, jherrman, juzhang, kkeithle, laine, libvirt-maint, mzhan, rbalakri, rhodain, s.kieske, veillard, virt-maint, ydu, zpeng
Target Milestone:	rc	Keywords:	Upstream, ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:	libvirt-0.10.2-37.el6	Doc Type:	Bug Fix
Doc Text:	Prior to this update, migrating a virtual machine failed when the libvirtd service used a transmission control protocol (TCP) port that was already in use. Now, it is possible to predefine a custom migration TCP port range in case the default port is in use. In addition, libvirtd now ensures that the port it chooses from the custom range is not used by another process.	Story Points:	---
Clone Of:	1018530
Clones:	1019237 1340368 1340479 (view as bug list)		Environment:
Last Closed:	2014-10-14 04:17:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1018178, 1018383, 1045196, 1061468, 1340368, 1340479

Description Stefan Hajnoczi 2013-10-14 08:11:09 UTC

+++ This bug was initially created as a clone of Bug #1018530 +++

+++ This bug was initially created as a clone of Bug #987555 +++

Description of problem:
Starting with GlusterFS 3.4, glusterfsd uses the IANA defined ephemeral port range (49152 and upward). If you happen to use the same network for storage and qemu-kvm live migration, sometimes you get a port conflict, and live migration aborts

Here's a log of a failed live migration on the destination host:

2013-07-23 15:54:32.619+0000: starting up
LC_ALL=C PATH=/sbin:/usr/sbin:/bin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name ipasserelle QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name ipasserelle -S -M-M rhel6.4.0  -enable-kvm -m 20482048 -smp-smp 2,sockets=2,cores=1,threads=1  -uuid  8505958b-8227-0a46-91a7-41d3247544e2 -nodefconfig-nodefconfig -nodefaults  -chardev  socket,id=charmonitor,path=/var/lib/libvirt/qemu/ipasserelle.monitor,server,nowait -mon-mon chardev=charmonitor,id=monitor,mode=controlchardev=charmonitor,id=monitor,mode=control -rtc  base=utc  -no-shutdown -device  piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/libvirt/images/gluster/ipasserelle.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=nonefile=/var/lib/libvirt/images/gluster/ipasserelle.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device  virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=rawif=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device  ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev-netdev tap,fd=22,id=hostnet0,vhost=on,vhostfd=23  -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:2b:14:d7,bus=pci.0,addr=0x3 -netdev-netdev tap,fd=24,id=hostnet1,vhost=on,vhostfd=25 -device  virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6d:4f:52,bus=pci.0,addr=0x4  -chardev pty,id=charserial0pty,id=charserial0 -device-device isa-serial,chardev=charserial0,id=serial0  -device usb-tablet,id=input0 -vnc-vnc 127.0.0.1:0127.0.0.1:0 -vga  cirrus  -device intel-hda,id=sound0,bus=pci.0,addr=0x5  -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -incoming  tcp:[::]:49152 -device-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7
char device redirected to /dev/pts/2
inet_listen_opts: bind(ipv6,::,49152): Address already in use
inet_listen_opts: FAILED
Migrate: Failed to bind socket
Migration failed. Exit code tcp:[::]:49152(-1), exiting.
2013-07-23 15:54:33.016+0000: shutting down

[root@dd9 ~]# netstat -laputen | grep :49152
tcp        0      0 0.0.0.0:49152               0.0.0.0:*                   LISTEN      0          82349      1927/glusterfsd     
tcp        0      0 127.0.0.1:1015              127.0.0.1:49152             ESTABLISHED 0          82555      1952/glusterfs      
tcp        0      0 10.90.25.138:49152          10.90.25.137:1016           ESTABLISHED 0          82473      1927/glusterfsd     
tcp        0      0 10.90.25.138:1021           10.90.25.137:49152          ESTABLISHED 0          82344      1952/glusterfs      
tcp        0      0 127.0.0.1:49152             127.0.0.1:1008              ESTABLISHED 0          82725      1927/glusterfsd     
tcp        0      0 127.0.0.1:49152             127.0.0.1:1015              ESTABLISHED 0          82556      1927/glusterfsd     
tcp        0      0 10.90.25.138:49152          10.90.25.137:1010           ESTABLISHED 0          89092      1927/glusterfsd     
tcp        0      0 127.0.0.1:1008              127.0.0.1:49152             ESTABLISHED 0          82724      2069/glusterfs      
tcp        0      0 10.90.25.138:1018           10.90.25.137:49152          ESTABLISHED 0          82784      2115/glusterfs



The exact same setup with GlusterFS 3.3.2 is working like a charm

Version-Release number of selected component (if applicable):

Host is CentOS 6.4 x86_64

gluster 3.4.0-2 (glusterfs glusterfs-server glusterfs-fuse), from the gluster.org RHEL repo
libvirt 0.10.2-18
qemu-kvm-rhev 0.12.1.2-2.355.el6.5


How reproducible:

Not always, but frequently enough


Steps to Reproduce:

- Two hosts with a replicated glusterFS volume (both are gluster server and client)
- Libvirt on both nodes
- One private network used for gluster and live migration
- while glusterFS is working, try to live migrate a qemu-kvm VM, using the standard migration (virsh migrate --live vm qemu+ssh://user@other_node/system)
- From time to time (not always), the migration will fail because the qemu process on the destination host cannot bind to the choosed port


Actual results:
Live migration fails


Expected results:
Live migration shouldn't be bothered by Gluster

Additional info:
An option to configure the first port, or the port range used by Gluster would avoid this situation

--- Additional comment from Daniel on 2013-07-24 05:41:31 EDT ---

Just one more info: I have three GlusterFS volumes between the two nodes, and the first three migrations fail.

As qemu (or libvirt, not sure which one chooses the incomming migration port) increment the port number at each migration attempt, the fourth migration succeed (and the following migrations succeed too)

--- Additional comment from Caerie Houchins on 2013-10-02 17:18:14 EDT ---

We just hit this bug in a new setup today.  Verifying this still exists.  
qemu-kvm-0.12.1.2-2.355.0.1.el6.centos.7.x86_64
glusterfs-3.4.0-8.el6.x86_64
CentOS release 6.4 (Final)
Linux SERVERNAME 2.6.32-358.18.1.el6.x86_64 #1 SMP Wed Aug 28 17:19:38 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

--- Additional comment from Gianluca Cecchi on 2013-10-09 18:52:29 EDT ---

Same problem with oVirt 3.3 and fedora 19 as Hypervisors.
See here:
http://wiki.libvirt.org/page/FAQ
the range 49152-49215 already used by libvirt years before the Gluster change from 3.3 to 3.4...
How could you miss it and worse not able to change at least for 3.4.1 as this bugzilla was opened in July?

At least could you provide a way to configure gluster to use another range so that if two nodes are both servers and client they can use another range?

You are limiting GlusterFS adoption itself as noone would implement oVirt on GlusterFS without migration available...

Thanks for reading
Gianluca

--- Additional comment from Kaleb KEITHLEY on 2013-10-11 07:54:00 EDT ---

Out of curiosity, why isn't this a bug in qemu-kvm? Shouldn't qemu-kvm be trying another port if 49152 (or any other port) is in use? And using portmapper to register the port it does end up using?

--- Additional comment from Anand Avati on 2013-10-11 08:07:18 EDT ---

REVIEW: http://review.gluster.org/6076 (xlators/mgmt/glusterd: ports conflict with qemu live migration) posted (#1) for review on release-3.4 by Kaleb KEITHLEY (kkeithle)

--- Additional comment from Gianluca Cecchi on 2013-10-11 08:34:15 EDT ---

From
http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers
"
Dynamic, private or ephemeral ports[edit]

The range 49152–65535 (215+214 to 216−1) – above the registered ports – contains dynamic or private ports that cannot be registered with IANA.[133] This range is used for custom or temporary purposes and for automatic allocation of ephemeral ports.
"

[133]
https://tools.ietf.org/html/rfc6335

If a new projyect starts to use a range, in my opinion has to consider that it is not the only project in the world and/or for the future.... ;-)

Why libvirt and GlusterFS could not reserve via IANA so that /etc/services could be updated and other projects before set new range can query current status?

It seems very like the 192.168.1.x private network used by every one.
Latest reserved port 49151 and so why not? Start at 49152... ;-)

There are quite several ranges up to 65535 or not?

Just my two eurocent

--- Additional comment from Gianluca Cecchi on 2013-10-11 08:35:59 EDT ---

So as

49152-65535 cannot be registered with IANA

why not try any range below 49151 that is still free or ask to IANA to extend or at least coordinate in an way to not overlap?
Thanks,
Gianluca

Comment 1 Stefan Hajnoczi 2013-10-14 08:15:39 UTC

QEMU is not choosing the port number.  It is libvirt who builds the QEMU
command-line, including the -incoming tcp:[::]:49152 option:

http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/qemu/qemu_migration.c;h=38edadb9742dd787f1cc58008f45ae
+1da6c032ac;hb=HEAD#l2553

The problem is that libvirt uses an internal variable ('port') to keep
track of which ephemeral port to use.  (It also seems like there might
be a problem if more than 64 incoming guests are migrating at the same
time.)

Note that it may be possible to override the incoming migration URI
including port number in the virsh migrate command.  See the --desturi
and --migrateuri options.  This may be a usable temporary workaround.

As a long-term fix libvirt and QEMU should do a real search for a free
port number by binding to a port.  To avoid race conditions either QEMU
needs to do this or libvirt must use file descriptor passing to handle
QEMU the already-bound socket.

Comment 3 Ján Tomko 2013-10-15 07:57:10 UTC

*** Bug 1019058 has been marked as a duplicate of this bug. ***

Comment 5 Daniel Berrangé 2013-10-15 10:53:34 UTC

This upstream patch looks like it would probably solve this issue

https://www.redhat.com/archives/libvir-list/2013-October/msg00652.html

Comment 12 Ademar Reis 2013-10-16 13:42:12 UTC

(In reply to Daniel Berrange from comment #11)
> Libvirt can *not* change the port range used for migration by default,
> because that will likely cause regressions for existing customers, due to
> the need for them to now change their firewall to open a different range of
> ports.

Odds are users running gluster will have to change their firewall settings anyway. Can libvirt selectively change migration ports only if the current ones are busy (my understanding is that this is what the current patch does, no?), or maybe only when using gluster?

In summary: you presented the problem, any ideas for the solution or workarounds?

Comment 13 Daniel Berrangé 2013-10-16 13:50:15 UTC

The solution is the upstream patch to libvirt to make the port range configurable and make libvirt check if the port is in use. This explicitly does not change the default range, so is backcompat safe.

Comment 15 Jiri Denemark 2013-10-18 14:41:53 UTC

This is now fixed upstream by v1.1.3-188-g0196845 and v1.1.3-189-ge3ef20d:

commit 0196845d3abd0d914cf11f7ad6c19df8b47c32ed
Author: Wang Yufei <james.wangyufei>
Date:   Fri Oct 11 11:27:13 2013 +0800

    qemu: Avoid assigning unavailable migration ports
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1019053
    
    When we migrate vms concurrently, there's a chance that libvirtd on
    destination assigns the same port for different migrations, which will
    lead to migration failure during prepare phase on destination. So we use
    virPortAllocator here to solve the problem.
    
    Signed-off-by: Wang Yufei <james.wangyufei>
    Signed-off-by: Jiri Denemark <jdenemar>

commit e3ef20d7f7fee595ac4fc6094e04b7d65ee0583a
Author: Jiri Denemark <jdenemar>
Date:   Tue Oct 15 15:26:52 2013 +0200

    qemu: Make migration port range configurable
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1019053

Comment 16 Jiri Denemark 2013-10-18 21:29:15 UTC

One more patch is needed to fully support configurable migration ports:

commit d9be5a7157515eeae99379e9544c34b34c5e5198
Author: Michal Privoznik <mprivozn>
Date:   Fri Oct 18 18:28:14 2013 +0200

    qemu: Fix augeas support for migration ports
    
    Commit e3ef20d7 allows user to configure migration ports range via
    qemu.conf. However, it forgot to update augeas definition file and
    even the test data was malicious.
    
    Signed-off-by: Michal Privoznik <mprivozn>

Comment 17 Jiri Denemark 2013-11-06 14:28:07 UTC

And one more upstream commit is required:

commit c92ca769af2bacefdd451802d7eb1adac5e6597c
Author: Zeng Junliang <zengjunliang>
Date:   Wed Nov 6 11:36:57 2013 +0800

    qemu: clean up migration ports when migration cancelled
    
    If there's a migration cancelled, the bitmap of migration port should be
    cleaned up too.
    
    Signed-off-by: Zeng Junliang <zengjunliang>
    Signed-off-by: Jiri Denemark <jdenemar>

Comment 18 Gianluca Cecchi 2013-11-16 10:04:30 UTC

Any news on this so to be able to test?
Thanks,
Gianluca

Comment 19 Jiri Denemark 2013-11-18 15:13:12 UTC

As you can see in the three comments above, this bug has been fixed upstream. This bug will get further updates when appropriate for the RHEL release in which this bug will be addressed.

Comment 20 Sven Kieske 2014-01-10 09:23:47 UTC

Hi, as this is fixed upstream and target release is 6.5 and it's marked as "urgent":
When will this get backported?

Thanks for your work.

Comment 21 Jiri Denemark 2014-01-10 13:30:51 UTC

Target release is not 6.5 and never was because the bug came in too late to be incorporated in 6.5. The "Version" bugzilla field says in what version of the product the issue was observed. If you want to have this bug fixed earlier than in the next minor update (6.6), please, talk to Red Hat customer support and provide them with a business justification for putting this bug in an Extended Update Support release. Note that running a gluster node and VMs on the same host is not an officially supported configuration so another justification will likely be requested.

Comment 22 Sven Kieske 2014-01-21 16:50:19 UTC

I'll take that as a "no, we won't backport this for EL 6.5" as I don't run
direct RH EL 6.5 but a clone you might know.

Comment 28 zhe peng 2014-05-28 03:07:37 UTC

I can reproduce this with build:libvirt-0.10.2-35.el6.x86_64
get error msg:
# virsh migrate --live rhel qemu+ssh://10.66.100.102/system --verbose
error: internal error Process exited while reading console log output: char device redirected to /dev/pts/2
qemu-kvm: Migrate: socket bind failed: Address already in use
Migration failed. Exit code tcp:[::]:49152(-1), exiting.

verify with build:libvirt-0.10.2-37.el6.x86_64
step:
S1:
1:prepare gluster server and client(on migration source and dst.)
2:mount gluster pool both on souce and dst.
10.66.100.103:/gluster-vol1 on /var/lib/libvirt/migrate type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072)
3:prepare a guest with gluster storage
4:check port on dst.
tcp        0      0 0.0.0.0:49152               0.0.0.0:*                   LISTEN      0          194240     29866/glusterfsd    
tcp        0      0 10.66.100.102:49152         10.66.100.103:1015          ESTABLISHED 0          194475     29866/glusterfsd    
tcp        0      0 10.66.100.102:49152         10.66.100.102:1021          ESTABLISHED 0          194462     29866/glusterfsd    
tcp        0      0 10.66.100.102:1016          10.66.100.102:49152         ESTABLISHED 0          194748     30008/glusterfs     
tcp        0      0 10.66.100.102:1018          10.66.100.103:49152         ESTABLISHED 0          194464     29879/glusterfs     
tcp        0      0 10.66.100.102:1015          10.66.100.103:49152         ESTABLISHED 0          194751     30008/glusterfs     
tcp        0      0 10.66.100.102:49152         10.66.100.103:1017          ESTABLISHED 0          194466     29866/glusterfsd    
tcp        0      0 10.66.100.102:49152         10.66.100.102:1016          ESTABLISHED 0          194749     29866/glusterfsd    
tcp        0      0 10.66.100.102:1021          10.66.100.102:49152         ESTABLISHED 0          194461     29879/glusterfs   
4:do migration
5:repeat 20 times, no error occured.
S2:
1:prepare a guest same with S1
2:do live migrate then cancelled
# virsh migrate rhel qemu+ssh://10.66.100.102/system --verbose
Migration: [  2 %]^Cerror: operation aborted: migration job: canceled by client
3:before migration canceled, check port on dst.:
tcp        0      0 :::49153                    :::*                        LISTEN      107        212413     931/qemu-kvm        
tcp        0      0 ::ffff:10.66.100.102:49153  ::ffff:10.66.100.103:52244  ESTABLISHED 107        212487     931/qemu-kvm     
after canceled, check port again:
# netstat -laputen | grep 49153
no output
if cancelled the job, the port cleaned up, can reused in next migration
S3:
1:config /etc/libvirt/qemu.conf, edit and restart libvirtd
migration_port_min = 51152
migration_port_max = 51251

2:do migration
3:check dst. port
# netstat -laputen | grep 51
tcp        0      0 10.66.100.102:1015          10.66.100.103:49152         ESTABLISHED 0          194751     30008/glusterfs     
tcp        0      0 :::51152                    :::*                        LISTEN      107        214187     1179/qemu-kvm       
tcp        0      0 ::ffff:10.66.100.102:51152  ::ffff:10.66.100.103:56922  ESTABLISHED 107        214260     1179/qemu-kvm       
# virsh migrate rhel qemu+ssh://10.66.100.102/system --verbose
Migration: [100 %]
migration worked well.

Comment 30 errata-xmlrpc 2014-10-14 04:17:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1374.html

Comment 31 Marcel Kolaja 2016-05-27 11:07:28 UTC

*** Bug 1340368 has been marked as a duplicate of this bug. ***