Bug 596016

Summary: Live Migration of KVMs uses wrong interface in cluster
Product: Red Hat Enterprise Linux 5 Reporter: Masahiro Matsuya <mmatsuya>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: urgent    
Version: 5.5CC: clalance, cluster-maint, crobinso, djansa, edamato, liko, robin, stanislav.polasek, tao, tdunnon, virt-maint, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: rgmanager-2.0.52-6.13.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 596918 (view as bug list) Environment:
Last Closed: 2011-01-13 23:26:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 595992    
Bug Blocks:    
Attachments:
Description Flags
proposed patch none

Description Masahiro Matsuya 2010-05-26 06:37:04 UTC
Description of problem:

There are two-nodes cluster. The hostname of each nodes are 'sk010001' and 'sk010002'. Each nodes has two bonding network interfaces for public and private (interconnect). The hostname matches the hostname of the ip address on public network.

Node1: sk010001
 bond0 (for public network) : 172.22.51.1    sk010001
 bond2 (for private network): 172.22.48.131  sk010001-hb

Node2: sk010002
 bond0 (for public network) : 172.22.51.2    sk010002
 bond2 (for private network): 172.22.48.132  sk010002-hb

They specified migration_mapping option in cluster.conf to use the private network for migration. When doing a live migration, the traffic should use the -hb interfaces, but bond0 is used.

The rgmanager uses the following command for live migration from sk010001 to sk010002.

  virsh migrate --live su21k003 qemu+ssh://sk010002-hb/system

But, this is not enough for the purpose. For transfering the guest image on migration, the private network will not used. --migrateuri option of 'virsh migrate' is needed for it. So, the following command should be executed.

  virsh migrate --live su21k003 qemu+ssh://sk010002-hb/system tcp:172.22.48.132

migration_mapping option in cluster.conf is used to replace the hostname in --desturi option. It's not care --migrateuri option at all.

I created a patch for it, though this is not tested yet.

Version-Release number of selected component (if applicable):

rgmanager-2.0.52-6.el5 

How reproducible:

Always

Steps to Reproduce:

1. set migration_mapping option to specify the interface on the private network
2. execute migration on cluster
3. watch the RX traffic on the destination node with ifconfig
  
Actual results:

The private network is not used for migration

Expected results:

The private network is used for migration

Additional info:

My proposed patch requires a bugfix of libvirt package. (BZ595992 is for the libvirt.) Currently, the migration fails if the port number is not specified with --migrateuri option like tcp:172.22.48.132:5000. This libvirt fix enables the migration to finish properly without specified port number.

Comment 1 Masahiro Matsuya 2010-05-26 06:48:22 UTC
Created attachment 416684 [details]
proposed patch

Comment 2 Lon Hohberger 2010-05-26 19:40:15 UTC
My understanding is the destination is taken from the URI, so either there is a DNS or /etc/hosts mixup causing sk010002-hb to be confused with sk010002, or there is a bug in libvirt requiring it to be specified twice for some reason:

http://www.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Virtualization_Guide/sect-Virtualization-KVM_live_migration-Live_KVM_migration_with_virsh.html

Comment 4 Cole Robinson 2010-05-26 21:35:29 UTC
The hostname libvirt is trying to migrate to is the output of 'virsh hostname' on the destination host, which in this case is sk010002. As was recently discussed upstream, this is a deficiency of the libvirt migrations protocol.

Possible solutions:

- Have virsh always build a --migrateuri for qemu if the user doesn't specify one, using the hostname from the destination connection. Generating a port will suck though, and be much less safe than the destination host doing it. This is basically what virt-manager already does.

- Put a qemu specific hack in virDomainMigratePrepare2 in libvirt.c, which takes the URI the destination threw back at us, and splice in the hostname from the dest URI.

- Find some way to fix the remote libvirt driver so it sees the libvirt URI we are using on using on the source host. No idea if this is even possible.

Yeah, all these solutions suck pretty bad.

Comment 6 Lon Hohberger 2010-05-27 13:51:30 UTC
Ok, so it's something we need to work around in rgmanager.

Cole, do you know if this is still a problem in F12 or RHEL6 beta?

Comment 7 Cole Robinson 2010-05-27 14:17:03 UTC
It's not really solved upstream yet, so it's a problem for all libvirt versions.

Adding a way for the user to specify the --migrateuri option in rgmanager would be useful anyways: there may be times when the user explicitly does not want to use the same hostname/interface that the libvirt URI is using (which is why we have the option). Libvirt should still do the intuitive thing by default though.

Comment 8 Lon Hohberger 2010-05-27 18:37:59 UTC
Ok.

We'll work around it in rgmanager.  To that end, Masahiro's patch looks correct, though it might be slightly more correct to use the original hostname in the migrate-uri instead of the target hostname.

Comment 9 Lon Hohberger 2010-05-27 18:41:59 UTC
I.e. this would, I think, be the most correct:

   virsh migrate --live su21k003 \
         qemu+ssh://sk010002/system tcp:sk010002-hb
                    ^^^^^^^^

Masahiro's patch would provide the following, which is 100% acceptable I believe:

   virsh migrate --live su21k003
         qemu+ssh://sk010002-hb/system tcp:sk010002-hb
                    ^^^^^^^^^^^

Since the latter requires the least amount of code changes and the work is done, I vote for that we use Masahiro's patch ;)

Comment 14 errata-xmlrpc 2011-01-13 23:26:42 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0134.html