Created attachment 1239344 [details] vdsm log Description of problem: [SR-IOV] - same pci addr is stored for two vNICs - libvirtError: XML error: Attempted double use of PCI slot 0000:00:08.0 (may need "multifunction='on'" for device on function 0) This looks like the same bug we had on 3.6 for virtIo vNICs - BZ 1199782 But the fix wasn't done for sr-iov vNICs and we now facing the same issue. 2017-01-11 08:46:49,539+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugNicVDSCommand] (default task-10) [7c2b3d00] Failed in 'HotPlugNicVDS' method 2017-01-11 08:46:49,552+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-10) [7c2b3d00] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_ mixed_1 command failed: XML error: Attempted double use of PCI slot 0000:00:08.0 (may need "multifunction='on'" for device on function 0) 2017-01-11 08:46:49,552+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugNicVDSCommand] (default task-10) [7c2b3d00] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugNicVDSCommand' return value 'StatusOnlyReturn [status=Status [code=49, message=XML error: Attempted double use of PCI slot 0000:00:08.0 (may need "multifunction='on'" for device on function 0)]]' 2017-01-11 08:46:49,403 INFO (jsonrpc/7) [virt.vm] (vmId='dd9b1f83-cb5d-4afe-81ea-224050a01531') Hotplug NIC xml: <?xml version='1.0' encoding='UTF-8'?> <interface managed="no" type="hostdev"> <address bus="0x00" domain="0x0000" function="0x0" slot="0x08" type="pci" /> <mac address="00:1a:4a:16:20:7a" /> <source> <address bus="5" domain="0" function="2" slot="16" type="pci" /> </source> <driver name="vfio" /> </interface> (vm:2085) 2017-01-11 08:46:49,403 INFO (jsonrpc/7) [virt.vm] (vmId='dd9b1f83-cb5d-4afe-81ea-224050a01531') Detaching device pci_0000_05_10_2 from the host. (network:214) 2017-01-11 08:46:49,457 ERROR (jsonrpc/7) [virt.vm] (vmId='dd9b1f83-cb5d-4afe-81ea-224050a01531') Hotplug failed (vm:2091) Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 2089, in hotplugNic self._dom.attachDevice(nicXml) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 69, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 941, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 536, in attachDevice if ret == -1: raise libvirtError ('virDomainAttachDevice() failed', dom=self) libvirtError: XML error: Attempted double use of PCI slot 0000:00:08.0 (may need "multifunction='on'" for device on function 0) 2017-01-11 08:46:49,458 INFO (jsonrpc/7) [vdsm.api] FINISH hotplugNic return={'status': {'message': 'XML error: Attempted double use of PCI slot 0000:00:08.0 (may need "multifunction=\'on\'" for device on functio n 0)', 'code': 49}} (api:43) Version-Release number of selected component (if applicable): 4.1.0-0.4.master.20170109222652.git53fd6cb.el7.centos How reproducible: 100% Steps to Reproduce: 1. Host with two VFs enabled 2. Start VM with one sriov vNIC 3. Unplug the vNIC 4. Add another sriov vNIC 5. Try to Plug the first vNIC back Actual results: Failed libvirtError: XML error: Attempted double use of PCI slot 0000:00:08.0 (may need "multifunction='on'" for device on function 0) Expected results: We should not try to use the pci addr
Created attachment 1247815 [details] logs in debug
During some debugging sessions we established that a possible cause of this bug is a delayed update of the device data in the DB by VM monitoring. When a device is plugged, at first the "address" field is empty. It is only populated in the next VM monitoring pass (it's also quite probable that it's not caused by the delay to next VM monitoring, but a delay of reading the data on vdsm side and passing it to the result of Host.getAllVmStats). There is therefore a time period (between update and VM monitoring that updates the device), when the "address" remains empty, even though the device already occupies a PCI slot. If we manage to plug in an unplugged device with the same PCI address during this time, we will get the described error. ======================== Notes: - We only managed to reproduce the bug on one specific env, I did not manage to reproduce this on my environment. - It only occured for SRIOV devices - It took quite a lot of attempts to reproduce this (since we rely on the time gap between plug and VM monitoring) ========================= Description of the debug session. Steps: Plug in NIC2. Unplug NIC2. The state of the DB will be (irrelevant rows removed): engine=# select device_id, name, is_plugged, address, from vm_device as d, vm_interface as i where d.device_id = i.id and d.type='interface' and d.device='hostdev' order by d.is_plugged, name, d.host_device; device_id | name | is_plugged | address --------------------------------------+------+------------+-------------------------------------------------------------- 508932a7-0e5f-42bd-9a10-db830b2fa804 | nic2 | f | {slot=0x19, bus=0x00, domain=0x0000, type=pci, function=0x0} The address of NIC6 is {slot=0x19, bus=0x00, domain=0x0000, type=pci, function=0x0} We will use this NIC6 to plug in and cause the conflict later. Now plug in another nic, use virsh to check if it is using the same PCI address. #virsh edit vm Should show the new device xml to be: <interface type='hostdev'> <mac address='00:1a:4a:16:91:df'/> <driver name='vfio'/> <source> <address type='pci' domain='0x0000' bus='0x05' slot='0x10' function='0x2'/> </source> <address type='pci' domain='0x0000' bus='0x00' slot='0x19' function='0x0'/> </interface> Check DB state, it should show the address to be empty (you have to rush to be faster than VM monitoring): device_id | name | is_plugged | address --------------------------------------+------+------------+-------------------------------------------------------------- 508932a7-0e5f-42bd-9a10-db830b2fa804 | nic2 | f | {slot=0x19, bus=0x00, domain=0x0000, type=pci, function=0x0} 24b6c16f-118d-4378-aed8-c263fabaa394 | nic7 | t | Quickly plug in NIC2 again. ================== Results: ---------------------------- engine.log {"jsonrpc": "2.0", "id": "bb379e89-bce5-42e9-bdaf-c3485a8bce7c", "error": {"message": "XML error: Attempted double use of PCI slot 0000:00:19.0 (may need \"multifunction='on'\" for device on function 0)", "code": 49}}\00 2017-02-07 14:56:53,780+02 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc": "2.0", "id": "bb379e89-bce5-42e9-bdaf-c3485a8bce7c", "error": {"message": "XML error: Attempted double use of PCI slot 0000:00:19.0 (may need \"multifunction='on'\" for device on function 0)", "code": 49}} 2017-02-07 14:56:53,781+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugNicVDSCommand] (default task-3) [4929fbc7] Failed in 'HotPlugNicVDS' method 2017-02-07 14:56:53,802+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-3) [4929fbc7] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed_1 command HotPlugNicVDS failed: XML error: Attempted double use of PCI slot 0000:00:19.0 (may need "multifunction='on'" for device on function 0) 2017-02-07 14:56:53,802+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugNicVDSCommand] (default task-3) [4929fbc7] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugNicVDSCommand' return value 'StatusOnlyReturn [status=Status [code=49, message=XML error: Attempted double use of PCI slot 0000:00:19.0 (may need "multifunction='on'" for device on function 0)]]' 2017-02-07 14:56:53,802+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugNicVDSCommand] (default task-3) [4929fbc7] HostName = host_mixed_1 2017-02-07 14:56:53,802+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugNicVDSCommand] (default task-3) [4929fbc7] Command 'HotPlugNicVDSCommand(HostName = host_mixed_1, VmNicDeviceVDSParameters:{runAsync='true', hostId='3a461812-281d-48db-8b7a-bc7a5849c1fe', vm.vm_name='golden_env_mixed_virtio_0', nic='VmNetworkInterface:{id='508932a7-0e5f-42bd-9a10-db830b2fa804', networkName='sr', vnicProfileName='null', vnicProfileId='8e6e49d9-a52e-4ed1-b979-3286f84aab57', speed='1000', type='5', macAddress='00:1a:4a:16:91:e5', active='true', linked='true', portMirroring='false', vmId='1674025e-8abf-4b2a-b702-2731e0e88ce4', vmName='golden_env_mixed_virtio_0', vmTemplateId='null', QoSName='null', remoteNetworkName='null'}', vmDevice='VmDevice:{id='VmDeviceId:{deviceId='508932a7-0e5f-42bd-9a10-db830b2fa804', vmId='1674025e-8abf-4b2a-b702-2731e0e88ce4'}', device='hostdev', type='INTERFACE', bootOrder='0', specParams='[]', address='{slot=0x19, bus=0x00, domain=0x0000, type=pci, function=0x0}', managed='true', plugged='false', readOnly='false', deviceAlias='hostdev0', customProperties='[]', snapshotId='null', logicalName='null', hostDevice='pci_0000_05_10_0'}'})' execution failed: VDSGenericException: VDSErrorException: Failed to HotPlugNicVDS, error = XML error: Attempted double use of PCI slot 0000:00:19.0 (may need "multifunction='on'" for device on function 0), code = 49 2017-02-07 14:56:53,803+02 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugNicVDSCommand] (default task-3) [4929fbc7] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to HotPlugNicVDS, error = XML error: Attempted double use of PCI slot 0000:00:19.0 (may need "multifunction='on'" for device on function 0), code = 49 at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.createDefaultConcreteException(VdsBrokerCommand.java:76) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.createException(BrokerCommandBase.java:222) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:192) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugNicVDSCommand.executeVdsBrokerCommand(HotPlugNicVDSCommand.java:14) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:111) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:73) [vdsbroker.jar:] at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.DefaultVdsCommandExecutor.execute(DefaultVdsCommandExecutor.java:14) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:407) [vdsbroker.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:2158) [bll.jar:] at org.ovirt.engine.core.bll.network.vm.ActivateDeactivateVmNicCommand.executePlugOrUnplug(ActivateDeactivateVmNicCommand.java:285) [bll.jar:] at org.ovirt.engine.core.bll.network.vm.ActivateDeactivateVmNicCommand.plugNic(ActivateDeactivateVmNicCommand.java:255) [bll.jar:] at org.ovirt.engine.core.bll.network.vm.ActivateDeactivateVmNicCommand.executeVmCommand(ActivateDeactivateVmNicCommand.java:203) [bll.jar:] at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:120) [bll.jar:] -------------------------------------- vdsm.log 2017-02-07 14:56:53,254 INFO (jsonrpc/6) [virt.vm] (vmId='1674025e-8abf-4b2a-b702-2731e0e88ce4') Hotplug NIC xml: <?xml version='1.0' encoding='UTF-8'?> <interface managed="no" type="hostdev"> <address bus="0x00" domain="0x0000" function="0x0" slot="0x19" type="pci" /> <mac address="00:1a:4a:16:91:e5" /> <source> <address bus="5" domain="0" function="0" slot="16" type="pci" /> </source> <driver name="vfio" /> </interface> (vm:2086) 2017-02-07 14:56:53,254 INFO (jsonrpc/6) [virt.vm] (vmId='1674025e-8abf-4b2a-b702-2731e0e88ce4') Detaching device pci_0000_05_10_0 from the host. (network:214) 2017-02-07 14:56:53,317 ERROR (jsonrpc/6) [virt.vm] (vmId='1674025e-8abf-4b2a-b702-2731e0e88ce4') Hotplug failed (vm:2092) Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 2090, in hotplugNic self._dom.attachDevice(nicXml) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 69, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 941, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 536, in attachDevice if ret == -1: raise libvirtError ('virDomainAttachDevice() failed', dom=self) libvirtError: XML error: Attempted double use of PCI slot 0000:00:19.0 (may need "multifunction='on'" for device on function 0) 2017-02-07 14:56:53,321 INFO (jsonrpc/6) [vdsm.api] FINISH hotplugNic return={'status': {'message': 'XML error: Attempted double use of PCI slot 0000:00:19.0 (may need "multifunction=\'on\'" for device on function 0)', 'code': 49}} (api:43) 2017-02-07 14:56:53,321 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call VM.hotplugNic failed (error 49) in 0.20 seconds (__init__:515)
I don't quite like the suggested patch, even though it was approved by maintainers. It feels raceful, and can loose information (the old pci address) when it is not really necessary. Let us take it into 4.2 only. In any case, this bug, which pops up only during quick plug/unplug scenario, is not really urgent.
I did not like the patch, but the issue is not important enough to keep fighting for a pure solution. Let QE judge if it is good enough for us.
Verified on - 4.2.0-0.0.master.20170725202023.git561151b.el7.centos and vdsm-4.20.1-241.giteb37c05.el7.centos.x86_64
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017. Since the problem described in this bug report should be resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.