Bug 1410076

Summary: [SR-IOV] - in-guest bond with virtio+passthrough slave lose connectivity after hotunplug/hotplug of passthrough slave
Product: [oVirt] vdsm Reporter: Michael Burman <mburman>
Component: CoreAssignee: Leon Goldberg <lgoldber>
Status: CLOSED CURRENTRELEASE QA Contact: Michael Burman <mburman>
Severity: high Docs Contact:
Priority: medium    
Version: 4.19.1CC: bugs, danken, gklein, lgoldber, mburman, myakove, yfu
Target Milestone: ovirt-4.1.1Flags: lgoldber: needinfo-
rule-engine: ovirt-4.1+
rule-engine: blocker+
Target Release: 4.19.5   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-21 09:51:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1341248    
Bug Blocks: 868811    

Description Michael Burman 2017-01-04 12:07:12 UTC
Description of problem:
[SR-IOV] - Connectivity lost when migrating guest with active-backup bond over virtIo+passthrough interfaces.

As part of the sr-iov migration feature - http://www.ovirt.org/develop/release-management/features/network/liveMigrationSupportForSRIOV/ we creating a bond in the guest os using nmcli between virtIo+passthrough vNICs.

During the migration the virtIo vNIC should became the active one in order to keep connectivity during the migration. Once the passthrough vNIC is plugged back we have no connection and we can't get IP again. 


Version-Release number of selected component (if applicable):
vdsm-4.19.1-1.el7ev.x86_64

Steps to Reproduce:
1. Create bond mode 1 using nmcli:
 [1] - Edit the virtIo vNIC profile (eth0) with 'No Filter' filter
 [2]- Run VM with 2 nics. virtIo and sr-iov
 [3] - Remove the NM_CONTROLLED=no line from the ifcfg-eth0 file that generated via dracut installation
 [4] - nmcli con reload eth0 
 [5] - nmcli connection add type ethernet con-name ens3 ifname ens3
 [6] - nmcli connection add type bond con-name bond0 ifname bond0 mode active-    backup primary ens3
 [7] - nmcli connection modify id bond0 ipv4.method auto ipv6.method ignore
 [8] - nmcli connection modify id ens3 ipv4.method disabled ipv6.method ignore
 [9] - nmcli connection modify id eth0 ipv4.method disabled ipv6.method ignore
 [10] - nmcli connection modify id ens3 connection.slave-type bond connection.master bond0 connection.autoconnect yes
 [11] - nmcli connection modify id eth0 connection.slave-type bond connection.master bond0 connection.autoconnect yes
 [12] - nmcli connection down id ens3; nmcli con up id ens3; nmcli con down id eth0; nmcli con up id eth0; nmcli con up bond0

2. Ping the guest IP
3. Migrate the VM

Actual results:
No connection to the VM

Expected results:
Should work as expected.

Comment 1 Dan Kenigsberg 2017-01-04 14:26:44 UTC
I suspect that this is the same issue we see with hotunplug/hotplug (regardless of migration), and that it shows up - sometimes - even when NM is stopped and masked. right?

Comment 2 Michael Burman 2017-01-04 14:31:59 UTC
Correct.

Comment 3 Meni Yakove 2017-01-11 07:02:27 UTC
This only happens when the hosts have more than one VFs enabled and it's not related to plug/unplug vNIC.

It seems that if the vNIC get different VF (slot) we lost connection.

BOND created with: (Ping)
<interface type='hostdev'>
	  <mac address='00:1a:4a:16:20:76'/>
	  <driver name='vfio'/>
	  <source>
		<address type='pci' domain='0x0000' bus='0x05' slot='0x10' function='0x2'/>
	  </source>
	  <alias name='hostdev0'/>
	  <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
	</interface>


After migration: (No ping)
<interface type='hostdev'>                                                                                                                                     
	  <mac address='00:1a:4a:16:20:76'/>                                                                                                                         
	  <driver name='vfio'/>                                                                                                                                 
	  <source>                                                                                                                                                 
		<address type='pci' domain='0x0000' bus='0x05' slot='0x10' function='0x0'/>                                                                             
	  </source>                                                                                                                                                  
	  <alias name='hostdev0'/>                                                                                                                                  
	  <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>                                                                                
	</interface>


After migrate back to original host: (No ping)
<interface type='hostdev'>                                                                                                                                     
	  <mac address='00:1a:4a:16:20:76'/>                                                                                                                         
	  <driver name='vfio'/>                                                                                                                                 
	  <source>                                                                                                                                                 
		<address type='pci' domain='0x0000' bus='0x05' slot='0x10' function='0x0'/>                                                                             
	  </source>                                                                                                                                                  
	  <alias name='hostdev0'/>                                                                                                                                  
	  <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>                                                                                
	</interface>

Comment 4 Dan Kenigsberg 2017-01-11 08:00:13 UTC
(In reply to Meni Yakove from comment #3)
> This only happens when the hosts have more than one VFs enabled and it's not
> related to plug/unplug vNIC.

Not sure I understand. During migration we do plug/unplug, and that's the only way to change the VF connected to the guest

> 
> It seems that if the vNIC get different VF (slot) we lost connection.

Could it be that what kills traffic is having a stale VF on the host with the same mac as the one in the guest?

Comment 5 Meni Yakove 2017-01-15 11:50:41 UTC
1.  Create 3 VMs 
2.  Enable 2 VFs on the host
3.  Add vNIC (passthrough) to VM 1 and start the VM
4.  Make sure that VM 1 got IP and have connectivity
5.  Check which VF the VM get:
    virsh -r dumpxml <vm-name> | grep -A8 "<interface type='hostdev'>"
    <source>
       <address type='pci' domain='0x0000' bus='0x05' slot='0x10' function='0x2'/>
    </source>
    VM 1 got function='0x2'
6.  Unplug the vNIC from VM 1
7.  Add vNIC (passthrough) to VM 2 and start the VM
8.  Add vNIC (passthrough) to VM 3 and start the VM
9.  Check which VM (2 or 3) got the same source VF that was on VM 1
10. Stop the VM that didn't get the same source VF that was on VM 1
11. Plug the vNIC back on VM 1
12. VM 1 should get different source VF and results with no connectivity.

Comment 6 Dan Kenigsberg 2017-01-18 09:48:39 UTC
I am guessing that our problem is due to the former VF staying in the host with the same MAC as the VF owned by the VM (bug 1341248). Can you you try chainging its mac address, or take the vf down?

Comment 7 Dan Kenigsberg 2017-01-18 09:50:30 UTC
To verify my guest, could you repeat the steps with a different nic type and driver?

Comment 8 Dan Kenigsberg 2017-01-19 22:50:58 UTC
Leon, could you try the tricky workaround suggested in https://bugzilla.redhat.com/show_bug.cgi?id=1341248#c17 ?

Comment 9 Yanan Fu 2017-01-26 00:28:09 UTC
Hi Michael and Dan,
I tests in qemu side, according to the comment 5.

Test version:
qemu: qemu-kvm-rhev-2.6.0-27.el7.x86_64
kernel: kernel-3.10.0-514.el7.x86_64
nic driver: qlcnic

Test steps:
1. create 2 VFs, and prepare 3 VMs.
2. add VF 1 to VM 1  ---> VF 1 get ip in VM 1 and work well
3. hotunplg VF 1 from VM 1.
4. add VF 1 to VM 2.  ---> get ip, work well
5. add VF 2 to VM 3.  ---> get ip, work well
6. hot unplug VF 2 from VM 3, then hot plug it to VM 1.
7. VM 1 work with VF 2, can get ip and work well.

In step 2:
VM 1 + VF 1:   ip:10.73.33.183,   Mac:  8a:ea:c2:7b:6e:f1

In step 4:
VM 2 + VF 1:   ip:10.73.33.183,   Mac:  8a:ea:c2:7b:6e:f1

In step 5:
VM 3 + VF 2:   ip:10.73.33.190,   Mac:  66:e3:ce:63:d7:64

In step 6:
VM 1 + VF 2:   ip:10.73.33.190,   Mac:  66:e3:ce:63:d7:64


Conclusion:
VF's mac don't change when add to different VMs, and can get ip normally.
Seems can not reproduce this bug with qlcnic driver. Thanks!

Comment 10 Dan Kenigsberg 2017-02-13 11:49:17 UTC
Let us verify this bug only when https://gerrit.ovirt.org/#/c/72135/ is in.

Comment 11 Michael Burman 2017-02-20 09:56:27 UTC
Verified on - 4.1.1.2-0.1.el7