Bug 1081461

Summary: Dropped guest network connection during migration (before it finished)
Product: Red Hat Enterprise Linux 7 Reporter: Dr. David Alan Gilbert <dgilbert>
Component: libvirtAssignee: Laine Stump <laine>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.0CC: amit.shah, dgilbert, dyuan, hhuang, huding, jsuchane, juzhang, kchamart, knoel, laine, mjrosato, mzhan, qzhang, rbalakri, shyu, virt-maint, ydu, zpeng
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-1.2.8-11.el7 Doc Type: Bug Fix
Doc Text:
In previous versions of libvirt, when migrating a guest that used macvtap ("direct") network interfaces, network connections to the guest would often be reset as soon as the migration started, potentially leaving the guest unreachable until migration was finished. This was due to the destination host beginning to send out packets with the guest's MAC address while the guest was still running on the source host. libvirt now assures that the destination host keeps the guest's macvtap devices inactive until the guest has been stopped on the source host, thus eliminating any interruptions in guest connectivity,
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-03-05 07:33:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1099210    

Description Dr. David Alan Gilbert 2014-03-27 11:52:00 UTC
Description of problem:
My ssh to my guest *sometimes* drops during (but before the end of) migration.
It reconnects fine.

Version-Release number of selected component (if applicable):
qemu-kvm-1.5.3-57.el7.x86_64


How reproducible:
~20%

Steps to Reproduce:
1. Take two RHEL7 boxes on the same switch
2. Start a guest (again running RHEL7) on one host, configured with macvtap
3. ssh (IPv4) into the guest and start something that hammers memory (e.g. specjbb - so that the migration won't complete)
4. Start a migration to another host with something like:
   migrate --live --compressed --verbose --desturi qemu+ssh://b/system --domain rhel7b1-v3

Actual results:
Sometimes the ssh to the guest will drop, can re-ssh in fine

Expected results:
The ssh connection carries on fine.

Additional info:
I'm sshing into the guest from far-away on a VPN, so definitely not the same subnet.
[mst/laine suggest this is a known macvtap-ism]

<domain type='kvm'>
  <name>rhel7b1-v3</name>
  <uuid>7340a501-4171-4230-8000-48c9da15d4cd</uuid>
  <memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
  <vcpu placement='static'>6</vcpu>
  <os>
    <type arch='x86_64' machine='pc-i440fx-rhel7.0.0'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source file='/home/vms/rhel7b1-v3.img'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </disk>
    <controller type='usb' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'/>
    <controller type='virtio-serial' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </controller>
    <interface type='direct'>
      <mac address='52:54:00:70:10:01'/>
      <source dev='eno1' mode='bridge'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target port='0'/>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <channel type='spicevmc'>
      <target type='virtio' name='com.redhat.spice.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='tablet' bus='usb'/>
    <input type='mouse' bus='ps2'/>
    <graphics type='spice' autoport='yes'/>
    <sound model='ich6'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </sound>
    <video>
      <model type='qxl' ram='65536' vram='65536' heads='1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </memballoon>
  </devices>
</domain>

Comment 2 Laine Stump 2014-04-01 18:46:42 UTC
Actually this will need to be fixed in libvirt, by creating the macvtap devices "down" and not ifconfiging them up until the destination is ready to be started.

I'm reassigning it to libvirt.

Comment 3 Dr. David Alan Gilbert 2014-04-01 18:49:13 UTC
(In reply to Laine Stump from comment #2)
> Actually this will need to be fixed in libvirt, by creating the macvtap
> devices "down" and not ifconfiging them up until the destination is ready to
> be started.
> 
> I'm reassigning it to libvirt.

That's a shame, since it can't be good for the downtime.

Comment 4 Laine Stump 2014-04-02 11:12:29 UTC
(In reply to Dr. David Alan Gilbert from comment #3)
> (In reply to Laine Stump from comment #2)
> > Actually this will need to be fixed in libvirt, by creating the macvtap
> > devices "down" and not ifconfiging them up until the destination is ready to
> > be started.
> > 
> > I'm reassigning it to libvirt.
> 
> That's a shame, since it can't be good for the downtime.

I think you may have misunderstood what I'm proposing - the macvtap device on the *destination host* (not the guest) will be created and left in the "down" state, while the macvtap device on the source host (and the network device in the guest) will remain "up". At the point where the destination host is ready to start up the guest, we shut off the guest in the source host, remove the macvtap interface from the source host, set the macvtap device on the destination host "up", then start the guest on the destination. The time required to up an interface that is already created is negligible in relation to the amount of time it takes to coordinate stopping the guest on source and starting it on destination.

Comment 5 Dr. David Alan Gilbert 2014-04-02 16:44:35 UTC
(In reply to Laine Stump from comment #4)
> (In reply to Dr. David Alan Gilbert from comment #3)
> > (In reply to Laine Stump from comment #2)
> > > Actually this will need to be fixed in libvirt, by creating the macvtap
> > > devices "down" and not ifconfiging them up until the destination is ready to
> > > be started.
> > > 
> > > I'm reassigning it to libvirt.
> > 
> > That's a shame, since it can't be good for the downtime.
> 
> I think you may have misunderstood what I'm proposing - the macvtap device
> on the *destination host* (not the guest) will be created and left in the
> "down" state, while the macvtap device on the source host (and the network
> device in the guest) will remain "up". At the point where the destination
> host is ready to start up the guest, we shut off the guest in the source
> host, remove the macvtap interface from the source host, set the macvtap
> device on the destination host "up", then start the guest on the
> destination. The time required to up an interface that is already created is
> negligible in relation to the amount of time it takes to coordinate stopping
> the guest on source and starting it on destination.

That's about what I expected.

I don't know my networking well enough, but suggestions of things to check:
   1) What happens in that short gap when we have neither macvtap interface up, does anything anywhere start rejecting packets?
   2) Doesn't the qemu on the destination do some form of announce to say 'hey that MAC is over here'? - I guess you would have to make sure that's done after the point you bring it's macvtap up.

Dave

Comment 6 Laine Stump 2014-04-03 10:01:55 UTC
(In reply to Dr. David Alan Gilbert from comment #5)

> I don't know my networking well enough, but suggestions of things to check:
>    1) What happens in that short gap when we have neither macvtap interface
> up, does anything anywhere start rejecting packets?

Since nothing is claiming that MAC address, there is nobody to reject any packets. They will just silently disappear and be re-transmitted by the sender (if the protocol in use provides for that.

This is no different from current behavior with guest interfaces backed by a tap interface, and migration with tap-backed network interfaces works without problem - since the MAC address on the guest is different from the MAC address on the tap interface, between the time the source guest is stopped and the destination guest is started, there is nobody to accept or reject any packets with that mac address. (the reason for the different behavior with macvtap is that, unlike a tap interface, a macvtap interface has the same MAC address as the guest's own interface)

>    2) Doesn't the qemu on the destination do some form of announce to say
> 'hey that MAC is over here'? - I guess you would have to make sure that's
> done after the point you bring it's macvtap up.

I'm not aware of any explicit action to do that, and don't think it needs to be done. Just like when you unplug a physical host from one port of a switch and plug it into another, the switch will correct as soon as it sees a packet with that MAC as source coming from the other port, or (probably) after some short timeout of the entry in its MAC table.

Comment 7 Dr. David Alan Gilbert 2014-04-03 10:15:25 UTC
(In reply to Laine Stump from comment #6)
> (In reply to Dr. David Alan Gilbert from comment #5)
> 
> > I don't know my networking well enough, but suggestions of things to check:
> >    1) What happens in that short gap when we have neither macvtap interface
> > up, does anything anywhere start rejecting packets?
> 
> Since nothing is claiming that MAC address, there is nobody to reject any
> packets. They will just silently disappear and be re-transmitted by the
> sender (if the protocol in use provides for that.
> 
> This is no different from current behavior with guest interfaces backed by a
> tap interface, and migration with tap-backed network interfaces works
> without problem - since the MAC address on the guest is different from the
> MAC address on the tap interface, between the time the source guest is
> stopped and the destination guest is started, there is nobody to accept or
> reject any packets with that mac address. (the reason for the different
> behavior with macvtap is that, unlike a tap interface, a macvtap interface
> has the same MAC address as the guest's own interface)

OK.

> >    2) Doesn't the qemu on the destination do some form of announce to say
> > 'hey that MAC is over here'? - I guess you would have to make sure that's
> > done after the point you bring it's macvtap up.
> 
> I'm not aware of any explicit action to do that, and don't think it needs to
> be done. Just like when you unplug a physical host from one port of a switch
> and plug it into another, the switch will correct as soon as it sees a
> packet with that MAC as source coming from the other port, or (probably)
> after some short timeout of the entry in its MAC table.

Hmm, I've found the code I was thinking of; see savevm.c qemu_announce_self/announce_self_create (I think that's a gratuitous rarp?)
I think the intention is to avoid that short timeout, and to allow inbound traffic to arrive at the destination immediately even if the destination hasn't sent any other outgoing packets.

Comment 8 Laine Stump 2014-04-03 10:53:37 UTC
Interesting to know. Actually that call isn't ideally placed in the case (which is our case) where the guest is created in a paused state, then the CPUs are manually started later. I haven't looked into the exact timing, but this could lead to the destination sending the RARP before the management (libvirt) has even stopped the source guest, much less started the destination guest.

I don't know if this causes any trouble in practice, but it could be that we would lose slightly fewer packets if qemu delayed sending these RARPs until just as the CPUs are unpaused.

Comment 9 Dr. David Alan Gilbert 2014-06-23 17:27:08 UTC
Is there a reason we can't get qemu to run one of the network scripts when the interface needs to be up as opposed to when the device is created?
It looks like at the moment on an incoming migration the script is run right at the start before the migration has actually begun.

I have another reason to ask; in that for pre+postcopy migration the switchover happens during the migration process at some point; if qemu can do that 'up' then it should all just work nicely.

Dave

Comment 10 Laine Stump 2014-06-24 08:05:04 UTC
Someone upstream submitted a libvirt patch that simply delays ifup'ing the device until just before the guest CPU is started; I believe they claimed to have tested it and found it to work. It had a bit of hard coded stuff that would break other scenarios though, which I pointed out in review. I'm now waiting for a V3 of that patch.

Comment 11 Dr. David Alan Gilbert 2014-08-28 14:14:02 UTC
There are at least two sets of fixes that might be relevant here:

  Matthew Rosato's 'Bring netdevs online later' fixes the problem described here where the two netdev's are up at the same time.

The other possibility is zhanghailiang's 'Forbid dealing with packets when VM is not running' (on qemu-devel) that is reported to also have had the symptom of losing networking due to inconsistent state.

Comment 12 Laine Stump 2014-12-11 15:38:15 UTC
This patch was pushed upstream:

commit 82977058f5b1d143a355079900029e9cbfee2fe4
Author: Matthew Rosato <mjrosato.ibm.com>
Date:   Tue Sep 16 16:50:53 2014 -0400

    network: Bring netdevs online later
    
    Currently, MAC registration occurs during device creation, which is
    early enough that, during live migration, you end up with duplicate
    MAC addresses on still-running source and target devices, even though
    the target device isn't actually being used yet.
    This patch proposes to defer MAC registration until right before
    the guest can actually use the device -- In other words, right
    before starting guest CPUs.

It also requires this patch as a prerequisite when backporting:

commit 7199d2c523feb71be44836e3e3a609b631a26947
Author: Matthew Rosato <mjrosato.ibm.com>
Date:   Wed Aug 27 10:34:13 2014 -0400

    util: Introduce flags field for macvtap creation

Comment 13 Laine Stump 2014-12-14 03:25:33 UTC
Two additional patches needed:

commit 879c13d6cc92b6c3d97168e822201e5d00c4a1bc
Author: Laine Stump <laine>
Date:   Thu Dec 11 14:49:13 2014 -0500

    qemu: always call qemuInterfaceStartDevices() when starting CPUs
    
commit c5a54917d5ae97653d29dbfe4995f2efcf5717d6
Author: Laine Stump <laine>
Date:   Thu Dec 11 15:11:10 2014 -0500

    qemu: add a qemuInterfaceStopDevices(), called when guest CPUs stop
    
So the full list, in order is:

7199d2c523feb71be44836e3e3a609b631a26947
82977058f5b1d143a355079900029e9cbfee2fe4
879c13d6cc92b6c3d97168e822201e5d00c4a1bc
c5a54917d5ae97653d29dbfe4995f2efcf5717d6

Comment 16 zhe peng 2014-12-18 07:04:49 UTC
I can reproduce this with 
libvirt-1.2.8-10.el7

verify with build libvirt-1.2.8-11.el7.x86_64

step:
1. Take two RHEL7 boxes on the same switch
2. Start a rhel7 guest on one host with xml:
.....
<interface type='direct'>
      <mac address='52:54:00:70:10:01'/>
      <source dev='eno1' mode='bridge'/>
      <target dev='macvtap0'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
.....
3. ssh into the guest and start something that hammers memory (e.g. specjbb - so that the migration won't complete)

My test configuration for specjbb:
In the SPECjbb.props file: 

input.jvm_instances=1

input.starting_number_warehouses=0

input.increment_number_warehouses=1

input.ending_number_warehouses=4

input.sequence_of_number_of_warehouses=1 2 3 4

...

run.sh:
# cat run.sh 

##
## This is an example of what a run sh script might look like
##

date
echo $CLASSPATH
CLASSPATH=./jbb.jar:./check.jar:$CLASSPATH
echo $CLASSPATH
export CLASSPATH

java -fullversion

java  -Xms8192k -Xmx512m -Xss256k -XX:+UseParallelOldGC -XX:+AggressiveOpts -XX:+UseBiasedLocking  -XX:SurvivorRatio=24 spec.jbb.JBBmain -propfile SPECjbb.props

date 

4. do live migration on source host
#virsh migrate rhel7 --live --compressed qemu+ssh://target_IP/system --verbose

5. ssh is no longer lost during migration.
  I try 20 times, all get same result. 
so move to verified.

Comment 18 errata-xmlrpc 2015-03-05 07:33:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-0323.html