596016 – Live Migration of KVMs uses wrong interface in cluster

Bug 596016 - Live Migration of KVMs uses wrong interface in cluster

Summary: Live Migration of KVMs uses wrong interface in cluster

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	rgmanager
Sub Component:
Version:	5.5
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:	595992
Blocks:
TreeView+	depends on / blocked

Reported:	2010-05-26 06:37 UTC by Masahiro Matsuya
Modified:	2018-11-14 19:10 UTC (History)
CC List:	12 users (show)
Fixed In Version:	rgmanager-2.0.52-6.13.el5
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	596918 (view as bug list)
Environment:
Last Closed:	2011-01-13 23:26:42 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
proposed patch (859 bytes, patch) 2010-05-26 06:48 UTC, Masahiro Matsuya	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:0134	0	normal	SHIPPED_LIVE	rgmanager bug fix and enhancement update	2011-01-12 19:20:47 UTC

Description Masahiro Matsuya 2010-05-26 06:37:04 UTC

Description of problem:

There are two-nodes cluster. The hostname of each nodes are 'sk010001' and 'sk010002'. Each nodes has two bonding network interfaces for public and private (interconnect). The hostname matches the hostname of the ip address on public network.

Node1: sk010001
 bond0 (for public network) : 172.22.51.1    sk010001
 bond2 (for private network): 172.22.48.131  sk010001-hb

Node2: sk010002
 bond0 (for public network) : 172.22.51.2    sk010002
 bond2 (for private network): 172.22.48.132  sk010002-hb

They specified migration_mapping option in cluster.conf to use the private network for migration. When doing a live migration, the traffic should use the -hb interfaces, but bond0 is used.

The rgmanager uses the following command for live migration from sk010001 to sk010002.

  virsh migrate --live su21k003 qemu+ssh://sk010002-hb/system

But, this is not enough for the purpose. For transfering the guest image on migration, the private network will not used. --migrateuri option of 'virsh migrate' is needed for it. So, the following command should be executed.

  virsh migrate --live su21k003 qemu+ssh://sk010002-hb/system tcp:172.22.48.132

migration_mapping option in cluster.conf is used to replace the hostname in --desturi option. It's not care --migrateuri option at all.

I created a patch for it, though this is not tested yet.

Version-Release number of selected component (if applicable):

rgmanager-2.0.52-6.el5 

How reproducible:

Always

Steps to Reproduce:

1. set migration_mapping option to specify the interface on the private network
2. execute migration on cluster
3. watch the RX traffic on the destination node with ifconfig
  
Actual results:

The private network is not used for migration

Expected results:

The private network is used for migration

Additional info:

My proposed patch requires a bugfix of libvirt package. (BZ595992 is for the libvirt.) Currently, the migration fails if the port number is not specified with --migrateuri option like tcp:172.22.48.132:5000. This libvirt fix enables the migration to finish properly without specified port number.

Comment 1 Masahiro Matsuya 2010-05-26 06:48:22 UTC

Created attachment 416684 [details]
proposed patch

Comment 2 Lon Hohberger 2010-05-26 19:40:15 UTC

My understanding is the destination is taken from the URI, so either there is a DNS or /etc/hosts mixup causing sk010002-hb to be confused with sk010002, or there is a bug in libvirt requiring it to be specified twice for some reason:

http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Virtualization_Guide/sect-Virtualization-KVM_live_migration-Live_KVM_migration_with_virsh.html

Comment 4 Cole Robinson 2010-05-26 21:35:29 UTC

The hostname libvirt is trying to migrate to is the output of 'virsh hostname' on the destination host, which in this case is sk010002. As was recently discussed upstream, this is a deficiency of the libvirt migrations protocol.

Possible solutions:

- Have virsh always build a --migrateuri for qemu if the user doesn't specify one, using the hostname from the destination connection. Generating a port will suck though, and be much less safe than the destination host doing it. This is basically what virt-manager already does.

- Put a qemu specific hack in virDomainMigratePrepare2 in libvirt.c, which takes the URI the destination threw back at us, and splice in the hostname from the dest URI.

- Find some way to fix the remote libvirt driver so it sees the libvirt URI we are using on using on the source host. No idea if this is even possible.

Yeah, all these solutions suck pretty bad.

Comment 6 Lon Hohberger 2010-05-27 13:51:30 UTC

Ok, so it's something we need to work around in rgmanager.

Cole, do you know if this is still a problem in F12 or RHEL6 beta?

Comment 7 Cole Robinson 2010-05-27 14:17:03 UTC

It's not really solved upstream yet, so it's a problem for all libvirt versions.

Adding a way for the user to specify the --migrateuri option in rgmanager would be useful anyways: there may be times when the user explicitly does not want to use the same hostname/interface that the libvirt URI is using (which is why we have the option). Libvirt should still do the intuitive thing by default though.

Comment 8 Lon Hohberger 2010-05-27 18:37:59 UTC

Ok.

We'll work around it in rgmanager.  To that end, Masahiro's patch looks correct, though it might be slightly more correct to use the original hostname in the migrate-uri instead of the target hostname.

Comment 9 Lon Hohberger 2010-05-27 18:41:59 UTC

I.e. this would, I think, be the most correct:

   virsh migrate --live su21k003 \
         qemu+ssh://sk010002/system tcp:sk010002-hb
                    ^^^^^^^^

Masahiro's patch would provide the following, which is 100% acceptable I believe:

   virsh migrate --live su21k003
         qemu+ssh://sk010002-hb/system tcp:sk010002-hb
                    ^^^^^^^^^^^

Since the latter requires the least amount of code changes and the work is done, I vote for that we use Masahiro's patch ;)

Comment 11 Lon Hohberger 2010-07-29 18:17:14 UTC

http://git.fedorahosted.org/git?p=cluster.git;a=commit;h=94085eace39e248040cf7069c7294178c6f944ce

Comment 14 errata-xmlrpc 2011-01-13 23:26:42 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0134.html

Note You need to log in before you can comment on or make changes to this bug.