Bug 1412074
| Summary: | [SR-IOV] - same pci addr is stored for two vNICs - libvirtError: XML error: Attempted double use of PCI slot 0000:00:08.0 (may need "multifunction='on'" for device on function 0) | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-engine | Reporter: | Michael Burman <mburman> | ||||||
| Component: | BLL.Network | Assignee: | Marcin Mirecki <mmirecki> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Michael Burman <mburman> | ||||||
| Severity: | medium | Docs Contact: | |||||||
| Priority: | medium | ||||||||
| Version: | 4.1.0 | CC: | bugs, danken | ||||||
| Target Milestone: | ovirt-4.2.0 | Flags: | rule-engine:
ovirt-4.2+
|
||||||
| Target Release: | 4.2.0 | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2017-12-20 11:08:22 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | Network | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | 1471667, 1472286 | ||||||||
| Bug Blocks: | |||||||||
| Attachments: |
|
||||||||
|
Description
Michael Burman
2017-01-11 07:25:34 UTC
Created attachment 1247815 [details]
logs in debug
During some debugging sessions we established that a possible cause of this bug is a delayed update of the device data in the DB by VM monitoring.
When a device is plugged, at first the "address" field is empty. It is only populated in the next VM monitoring pass (it's also quite probable that it's not caused by the delay to next VM monitoring, but a delay of reading the data on vdsm side and passing it to the result of Host.getAllVmStats).
There is therefore a time period (between update and VM monitoring that updates the device), when the "address" remains empty, even though the device already occupies a PCI slot. If we manage to plug in an unplugged device with the same PCI address during this time, we will get the described error.
========================
Notes:
- We only managed to reproduce the bug on one specific env, I did not manage to reproduce this on my environment.
- It only occured for SRIOV devices
- It took quite a lot of attempts to reproduce this (since we rely on the time gap between plug and VM monitoring)
=========================
Description of the debug session.
Steps:
Plug in NIC2.
Unplug NIC2.
The state of the DB will be (irrelevant rows removed):
engine=# select device_id, name, is_plugged, address, from vm_device as d, vm_interface as i where d.device_id = i.id and d.type='interface' and d.device='hostdev' order by d.is_plugged, name, d.host_device;
device_id | name | is_plugged | address
--------------------------------------+------+------------+--------------------------------------------------------------
508932a7-0e5f-42bd-9a10-db830b2fa804 | nic2 | f | {slot=0x19, bus=0x00, domain=0x0000, type=pci, function=0x0}
The address of NIC6 is {slot=0x19, bus=0x00, domain=0x0000, type=pci, function=0x0}
We will use this NIC6 to plug in and cause the conflict later.
Now plug in another nic, use virsh to check if it is using the same PCI address.
#virsh edit vm
Should show the new device xml to be:
<interface type='hostdev'>
<mac address='00:1a:4a:16:91:df'/>
<driver name='vfio'/>
<source>
<address type='pci' domain='0x0000' bus='0x05' slot='0x10' function='0x2'/>
</source>
<address type='pci' domain='0x0000' bus='0x00' slot='0x19' function='0x0'/>
</interface>
Check DB state, it should show the address to be empty (you have to rush to be faster than VM monitoring):
device_id | name | is_plugged | address
--------------------------------------+------+------------+--------------------------------------------------------------
508932a7-0e5f-42bd-9a10-db830b2fa804 | nic2 | f | {slot=0x19, bus=0x00, domain=0x0000, type=pci, function=0x0}
24b6c16f-118d-4378-aed8-c263fabaa394 | nic7 | t |
Quickly plug in NIC2 again.
==================
Results:
----------------------------
engine.log
{"jsonrpc": "2.0", "id": "bb379e89-bce5-42e9-bdaf-c3485a8bce7c", "error": {"message": "XML error: Attempted double use of PCI slot 0000:00:19.0 (may need \"multifunction='on'\" for device on function 0)", "code": 49}}\00
2017-02-07 14:56:53,780+02 DEBUG [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Message received: {"jsonrpc": "2.0", "id": "bb379e89-bce5-42e9-bdaf-c3485a8bce7c", "error": {"message": "XML error: Attempted double use of PCI slot 0000:00:19.0 (may need \"multifunction='on'\" for device on function 0)", "code": 49}}
2017-02-07 14:56:53,781+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugNicVDSCommand] (default task-3) [4929fbc7] Failed in 'HotPlugNicVDS' method
2017-02-07 14:56:53,802+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-3) [4929fbc7] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM host_mixed_1 command HotPlugNicVDS failed: XML error: Attempted double use of PCI slot 0000:00:19.0 (may need "multifunction='on'" for device on function 0)
2017-02-07 14:56:53,802+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugNicVDSCommand] (default task-3) [4929fbc7] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugNicVDSCommand' return value 'StatusOnlyReturn [status=Status [code=49, message=XML error: Attempted double use of PCI slot 0000:00:19.0 (may need "multifunction='on'" for device on function 0)]]'
2017-02-07 14:56:53,802+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugNicVDSCommand] (default task-3) [4929fbc7] HostName = host_mixed_1
2017-02-07 14:56:53,802+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugNicVDSCommand] (default task-3) [4929fbc7] Command 'HotPlugNicVDSCommand(HostName = host_mixed_1, VmNicDeviceVDSParameters:{runAsync='true', hostId='3a461812-281d-48db-8b7a-bc7a5849c1fe', vm.vm_name='golden_env_mixed_virtio_0', nic='VmNetworkInterface:{id='508932a7-0e5f-42bd-9a10-db830b2fa804', networkName='sr', vnicProfileName='null', vnicProfileId='8e6e49d9-a52e-4ed1-b979-3286f84aab57', speed='1000', type='5', macAddress='00:1a:4a:16:91:e5', active='true', linked='true', portMirroring='false', vmId='1674025e-8abf-4b2a-b702-2731e0e88ce4', vmName='golden_env_mixed_virtio_0', vmTemplateId='null', QoSName='null', remoteNetworkName='null'}', vmDevice='VmDevice:{id='VmDeviceId:{deviceId='508932a7-0e5f-42bd-9a10-db830b2fa804', vmId='1674025e-8abf-4b2a-b702-2731e0e88ce4'}', device='hostdev', type='INTERFACE', bootOrder='0', specParams='[]', address='{slot=0x19, bus=0x00, domain=0x0000, type=pci, function=0x0}', managed='true', plugged='false', readOnly='false', deviceAlias='hostdev0', customProperties='[]', snapshotId='null', logicalName='null', hostDevice='pci_0000_05_10_0'}'})' execution failed: VDSGenericException: VDSErrorException: Failed to HotPlugNicVDS, error = XML error: Attempted double use of PCI slot 0000:00:19.0 (may need "multifunction='on'" for device on function 0), code = 49
2017-02-07 14:56:53,803+02 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugNicVDSCommand] (default task-3) [4929fbc7] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to HotPlugNicVDS, error = XML error: Attempted double use of PCI slot 0000:00:19.0 (may need "multifunction='on'" for device on function 0), code = 49
at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.createDefaultConcreteException(VdsBrokerCommand.java:76) [vdsbroker.jar:]
at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.createException(BrokerCommandBase.java:222) [vdsbroker.jar:]
at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:192) [vdsbroker.jar:]
at org.ovirt.engine.core.vdsbroker.vdsbroker.HotPlugNicVDSCommand.executeVdsBrokerCommand(HotPlugNicVDSCommand.java:14) [vdsbroker.jar:]
at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:111) [vdsbroker.jar:]
at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:73) [vdsbroker.jar:]
at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:]
at org.ovirt.engine.core.vdsbroker.vdsbroker.DefaultVdsCommandExecutor.execute(DefaultVdsCommandExecutor.java:14) [vdsbroker.jar:]
at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:407) [vdsbroker.jar:]
at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:]
at org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:2158) [bll.jar:]
at org.ovirt.engine.core.bll.network.vm.ActivateDeactivateVmNicCommand.executePlugOrUnplug(ActivateDeactivateVmNicCommand.java:285) [bll.jar:]
at org.ovirt.engine.core.bll.network.vm.ActivateDeactivateVmNicCommand.plugNic(ActivateDeactivateVmNicCommand.java:255) [bll.jar:]
at org.ovirt.engine.core.bll.network.vm.ActivateDeactivateVmNicCommand.executeVmCommand(ActivateDeactivateVmNicCommand.java:203) [bll.jar:]
at org.ovirt.engine.core.bll.VmCommand.executeCommand(VmCommand.java:120) [bll.jar:]
--------------------------------------
vdsm.log
2017-02-07 14:56:53,254 INFO (jsonrpc/6) [virt.vm] (vmId='1674025e-8abf-4b2a-b702-2731e0e88ce4') Hotplug NIC xml: <?xml version='1.0' encoding='UTF-8'?>
<interface managed="no" type="hostdev">
<address bus="0x00" domain="0x0000" function="0x0" slot="0x19" type="pci" />
<mac address="00:1a:4a:16:91:e5" />
<source>
<address bus="5" domain="0" function="0" slot="16" type="pci" />
</source>
<driver name="vfio" />
</interface>
(vm:2086)
2017-02-07 14:56:53,254 INFO (jsonrpc/6) [virt.vm] (vmId='1674025e-8abf-4b2a-b702-2731e0e88ce4') Detaching device pci_0000_05_10_0 from the host. (network:214)
2017-02-07 14:56:53,317 ERROR (jsonrpc/6) [virt.vm] (vmId='1674025e-8abf-4b2a-b702-2731e0e88ce4') Hotplug failed (vm:2092)
Traceback (most recent call last):
File "/usr/share/vdsm/virt/vm.py", line 2090, in hotplugNic
self._dom.attachDevice(nicXml)
File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 69, in f
ret = attr(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper
ret = f(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 941, in wrapper
return func(inst, *args, **kwargs)
File "/usr/lib64/python2.7/site-packages/libvirt.py", line 536, in attachDevice
if ret == -1: raise libvirtError ('virDomainAttachDevice() failed', dom=self)
libvirtError: XML error: Attempted double use of PCI slot 0000:00:19.0 (may need "multifunction='on'" for device on function 0)
2017-02-07 14:56:53,321 INFO (jsonrpc/6) [vdsm.api] FINISH hotplugNic return={'status': {'message': 'XML error: Attempted double use of PCI slot 0000:00:19.0 (may need "multifunction=\'on\'" for device on function 0)', 'code': 49}} (api:43)
2017-02-07 14:56:53,321 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call VM.hotplugNic failed (error 49) in 0.20 seconds (__init__:515)
I don't quite like the suggested patch, even though it was approved by maintainers. It feels raceful, and can loose information (the old pci address) when it is not really necessary. Let us take it into 4.2 only. In any case, this bug, which pops up only during quick plug/unplug scenario, is not really urgent. I did not like the patch, but the issue is not important enough to keep fighting for a pure solution. Let QE judge if it is good enough for us. Verified on - 4.2.0-0.0.master.20170725202023.git561151b.el7.centos and vdsm-4.20.1-241.giteb37c05.el7.centos.x86_64 This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017. Since the problem described in this bug report should be resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |