Bug 1199782

Summary: same pci addr is stored for two vNICs if they are plugged to a running VM one at a time
Product: Red Hat Enterprise Virtualization Manager Reporter: Michael Burman <mburman>
Component: ovirt-engineAssignee: Marcin Mirecki <mmirecki>
Status: CLOSED CURRENTRELEASE QA Contact: Michael Burman <mburman>
Severity: high Docs Contact:
Priority: high    
Version: 3.5.1CC: bazulay, danken, gklein, lpeer, lsurette, mburman, myakove, rbalakri, Rhev-m-bugs, srevivo, ykaul, ylavi
Target Milestone: ovirt-3.6.2   
Target Release: 3.6.2   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-20 01:28:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Logs
none
new logs
none
3.6 logs
none
New fail logs for Lior
none
vdsm log none

Description Michael Burman 2015-03-08 09:53:12 UTC
Created attachment 999287 [details]
Logs

Description of problem:
Can't HotPlug vNIC- Error while executing action Edit VM Interface properties: Failed to activate VM Network Interface cause of libvirtError: internal error: Attempted double use of PCI slot.

libvirt.log:

 <interface type="bridge">
                        <address bus="0x00" domain="0x0000" function="0x0" slot="0x03" type="pci"/>
                        <mac address="00:1a:4a:16:88:5f"/>
                        <model type="virtio"/>
                        <source bridge="rhevm"/>
                        <link state="up"/>
                        <boot order="2"/>
                        <bandwidth/>
                </interface>
                <interface type="bridge">
                        <address bus="0x00" domain="0x0000" function="0x0" slot="0x08" type="pci"/>
                        <mac address="00:1a:4a:16:88:60"/>
                        <model type="virtio"/>
                        <source bridge="qbrb966e777-a4"/>
                        <link state="up"/>
                        <boot order="3"/>
                        <bandwidth/>
                        <target dev="tapb966e777-a4"/>
                </interface>
                <interface type="bridge">
                        <address bus="0x00" domain="0x0000" function="0x0" slot="0x03" type="pci"/>
                        <mac address="00:1a:4a:16:88:61"/>
                        <model type="virtio"/>
                        <source bridge="br-int"/>
                        <link state="up"/>
                        <boot order="4"/>
                        <bandwidth/>
                        <virtualport type="openvswitch">
                                <parameters interfaceid="8be1902c-1eb3-4001-b28b-0044c4bd3773"/>
                        </virtualport>


vdsm.log:

Thread-1138::ERROR::2015-03-08 11:43:30,806::vm::3421::vm.Vm::(hotplugNic) vmId=`038dd653-dc16-48df-a06b-40338a7c98f3`::Hotplug failed
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 3419, in hotplugNic
    self._dom.attachDevice(nicXml)
  File "/usr/share/vdsm/virt/vm.py", line 689, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 419, in attachDevice
    if ret == -1: raise libvirtError ('virDomainAttachDevice() failed', dom=self)
libvirtError: internal error: Attempted double use of PCI slot 0000:00:03.0 (may need "multifunction='on'" for device on function 0)


  
Version-Release number of selected component (if applicable):
3.5.1-0.1.el6ev

How reproducible:
100

Steps to Reproduce:
1. Run VM and add vNIC with 'rhevm' profile
2. HotUnplug vNIC 
3. Add new vNIC to VM with 'rhevm' profile
4. Try to HotPlug back the first vNIC 

Actual results:
Fail with error:
Error while executing action Edit VM Interface properties: Failed to activate VM Network Interface.

Expected results:
Operation should succeed.

Comment 1 Lior Vernia 2015-03-08 14:01:55 UTC
The first collision I see in the libvirt log is while running the VM with two NICs sharing the same PCI address, one with an external network.

Could you reproduce on a deployment with no external networks? And state exactly the steps to reproduce?

Comment 2 Michael Burman 2015-03-08 14:11:36 UTC
Hi Lior,

I wrote the exact steps above in the Description.

Like i said, not related to external network. Yes i did managed to reproduce without external network.

Comment 3 Lior Vernia 2015-03-08 14:42:10 UTC
Can I see logs from such a deployment?

Comment 4 Michael Burman 2015-03-08 15:13:11 UTC
Created attachment 999346 [details]
new logs

Comment 5 Michael Burman 2015-03-08 15:14:00 UTC
Yes sure Lior, logs from such deployment attached.

Comment 6 Michael Burman 2015-03-09 07:53:38 UTC
Lior,

The same for 3.6.0-0.0.master.20150307182246.git2776c91.el6

2015-03-09 09:50:28,080 ERROR [org.ovirt.engine.core.bll.network.vm.ActivateDeactivateVmNicCommand] (ajp--127.0.0.1-8702-3) [65101358] Command 'org.ovirt.engine.core.bll.network.vm.ActivateDeactivateVmNicCommand' failed: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to HotPlugNicVDS, error = internal error: Attempted double use of PCI slot 0000:00:03.0 (may need "multifunction='on'" for device on function 0), code = 49 (Failed with error ACTIVATE_NIC_FAILED and code 49)

- No external networks. Very simple steps to reproduce.

Comment 7 Michael Burman 2015-03-09 07:58:27 UTC
Created attachment 999442 [details]
3.6 logs

Comment 8 Michael Burman 2015-03-09 08:46:22 UTC
The same for 3.5.0-0.33.el6ev(ASYNC)

Comment 9 Dan Kenigsberg 2015-03-09 10:52:55 UTC
Which OS does the guest run? (if there's no guest, or the OS does not support hot-plug, it's NOTABUG).

Does the guest report the unplugged nic? What's its state?

Comment 10 Michael Burman 2015-03-09 11:56:51 UTC
Hi Dan,

rhel6.5,6.6 and rhel 7.

Guest doesn't report the unplugged nic.

Comment 11 Michael Burman 2015-03-09 12:03:31 UTC
libvirt-1.1.1-29.el7_0.7.x86_64
vdsm-4.16.8.1-7.el7ev.x86_64
vdsm-4.16.12-2.el7ev.x86_64

Comment 12 Michael Burman 2015-03-09 12:41:55 UTC
Created attachment 999515 [details]
New fail logs for Lior

Comment 13 Lior Vernia 2015-03-09 12:47:36 UTC
Dan, the latest logs are with a run I conducted together with Michael, it's well controlled. We created VM lior and ran it with one vNIC, its state was dumped into lior.xml. Then we hot-unplugged nic1 (MAC *22:02), hot-plugged nic2 (MAC *22:03), and then tried to hot-plug nic1 again.

Somehow nic2 got the PCI slot that had been allocated to nic1, 0x03. As far as I could see, neither engine, nor vdsm nor libvirt "asked" for that slot - so it seems to me like the guest OS (RHEL, according to Michael either 7* or 6.6) was the one who re-allocated it to nic2.

Do you agree with the analysis? Who can we talk to on the platform side?

Comment 14 Dan Kenigsberg 2015-03-09 13:04:58 UTC
nic1's address is kept allocated on Engine, but it is completely free and forgotten in libvirt once unplug has succeeded.

I do not see a way to solve this on libvirt or vdsm. Engine might be able to blank-out the pci address of the unplugged nic1 if it notices that that nic is already taken by another device (but only then, since we DO like to persist the fromer address of nic1)

Another option is complete control of PCI addresses in Engine, which is not a simple feature to add.

Comment 15 Lior Vernia 2015-03-09 13:13:46 UTC
Is this consistent with previous behavior?... Didn't RHEL use to not allocate the same PCI address to another network interface (unless it had no choice)? I vaguely remember this behavior from our discussions concerning vNIC ordering.

Comment 16 Dan Kenigsberg 2015-03-09 14:39:28 UTC
(In reply to Lior Vernia from comment #15)
> Is this consistent with previous behavior?... Didn't RHEL use to not
> allocate the same PCI address to another network interface (unless it had no
> choice)? I vaguely remember this behavior from our discussions concerning
> vNIC ordering.

You might be recalling RHEL *guest*'s persistence of pciaddr+mac->guest_nicname mapping, which is persisted on the guest disk in case the vNIC is plugged again.

The allocation of PCI addresses happens in libvirt; I don't believe that it has never attempted to maintain a history of previously-installed devices.

BTW, I am guessing that the very same bug can happen regardless of *hot*plugging:
- run a VM with nic1; pci1 is persisted in engine.
- stop the VM. unplug nic1. plug nic2.
- run VM with nic2; pci2 is allocated by libvirt, and is most likely to equal pci1.
- plug both nics and attempt to run the VM. I expect Engine is sending the same address to both nics, which breaks in libvirt.

is that the case, Michael?

Comment 17 Michael Burman 2015-03-09 15:16:19 UTC
Yes, that is the case. Failed run VM.

libvirt.log:
XML error: Attempted double use of PCI slot 0000:00:03.0 (may need "multifunction='on'" for device on function 0)

<interface type="bridge">
                        <address bus="0x00" domain="0x0000" function="0x0" slot="0x03" type="pci"/>
                        <mac address="00:1a:4a:16:88:5c"/>
                        <model type="virtio"/>
                        <source bridge="rhevm"/>
                        <link state="up"/>
                        <bandwidth/>
                </interface>
                <interface type="bridge">
                        <address bus="0x00" domain="0x0000" function="0x0" slot="0x03" type="pci"/>
                        <mac address="00:1a:4a:16:88:5e"/>
                        <model type="virtio"/>
                        <source bridge="rhevm"/>
                        <link state="up"/>
                        <bandwidth/>
                </interface>

Comment 18 Lior Vernia 2015-03-09 15:27:31 UTC
If we indeed do not want the engine to stop caring about previous PCI addresses, then this should probably be solved by exposing vNIC PCI address management to users - which seems like the right way to go for Bug 1108926 as well.

Comment 19 Lior Vernia 2015-03-10 13:11:00 UTC
Lowering priority as there's an easy workaround - just remove the vNIC and re-create it.

Comment 21 Michael Burman 2015-12-10 10:40:11 UTC
Tested and failedQA on 3.6.1.2-0.1.el6 and vdsm-4.17.13-1.el7ev.noarch

Thread-356::ERROR::2015-12-10 12:32:59,754::vm::758::virt.vm::(_startUnderlyingVm) vmId=`404f96db-b224-4163-a21e-eeb8eb084d7b`::The vm start process failed
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 702, in _startUnderlyingVm
    self._run()
  File "/usr/share/vdsm/virt/vm.py", line 1889, in _run
    self._connection.createXML(domxml, flags),
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 3611, in createXML
    if ret is None:raise libvirtError('virDomainCreateXML() failed', conn=self)
libvirtError: XML error: Attempted double use of PCI slot 0000:00:03.0 (may need "multifunction='on'" for device on function 0)


I tested my steps from description ^^ 
1. Run VM and add vNIC with 'rhevm' profile
2. HotUnplug vNIC 
3. Add new vNIC to VM with 'rhevm' profile
4. Try to HotPlug back the first vNIC 
and some how succeeded.

I tested Dan's steps from comment 16^^
- run a VM with nic1; pci1 is persisted in engine.
- stop the VM. unplug nic1. plug nic2.
- run VM with nic2; pci2 is allocated by libvirt, and is most likely to equal pci1.
- plug both nics and attempt to run the VM. I expect Engine is sending the same address to both nics, which breaks in libvirt.
And Failed with libvirtError. same as the original report.

Comment 22 Michael Burman 2015-12-10 10:49:35 UTC
Created attachment 1104298 [details]
vdsm log

Comment 23 Marcin Mirecki 2015-12-10 11:13:01 UTC
The fix is only for hotplugging/hotunplugging.
The stop/start would require another patch.

Comment 24 Dan Kenigsberg 2015-12-10 12:59:26 UTC
We can wait for this patch to 3.6.2.

Comment 25 Marcin Mirecki 2015-12-11 08:44:12 UTC
I think the problem can touch not only the nics, but also other pci devices (like discs).

Comment 26 Michael Burman 2015-12-28 12:55:36 UTC
Verified on - 3.6.2-0.1.el6