Bug 1717434

Summary: [downstream clone - 4.3.4] PCI address of NICs are not stored in the database after a hotplug of passthrough NIC resulting in change of network device name in VM after a reboot
Product: Red Hat Enterprise Virtualization Manager Reporter: RHV bug bot <rhv-bugzilla-bot>
Component: ovirt-engineAssignee: Dominik Holler <dholler>
Status: CLOSED ERRATA QA Contact: Michael Burman <mburman>
Severity: high Docs Contact:
Priority: high    
Version: 4.2.8-4CC: dholler, emarcus, mkalinin, mtessun, Rhev-m-bugs, sfroemer
Target Milestone: ovirt-4.3.4Keywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: ovirt-engine-4.3.4.1 Doc Type: Bug Fix
Doc Text:
In some scenarios, the PCI address of a hotplugged SR-IOV vNIC was overwritten by an empty value, and as a result, the NIC name in the virtual machine was changed following a reboot. In this release, the vNIC PCI address is stored in the database and the NIC name persists following a virtual machine reboot.
Story Points: ---
Clone Of: 1703112 Environment:
Last Closed: 2019-06-20 14:48:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Network RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1703112    
Bug Blocks: 1703275    

Description RHV bug bot 2019-06-05 13:15:29 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1703112 +++
======================================================================

Description of problem:

Sometimes, when an SR-IOV NIC is hotplugged, the PCI device address is getting lost from the database. Because of this, the NIC name is getting changed within the VM after a reboot, because the PCI address is getting changed as libvirt will generate a new address since the engine will not pass the same while starting the VM.

In my test setup, I started a while loop to list the vm_device during hot plugging and also enabled the all_log statement for PostgreSQL.

====
while true; do date; /usr/share/ovirt-engine/dbscripts/engine-psql.sh -c "select address,host_device from vm_device where vm_id='3d2859c2-31a3-4ce3-97fc-52b09e60ce8c' and type='interface';";sleep 1;done

Thu Apr 25 08:53:55 EDT 2019
                           address                            |   host_device    
--------------------------------------------------------------+------------------
 {type=pci, slot=0x07, bus=0x00, domain=0x0000, function=0x0} | pci_0000_07_12_1
 {type=pci, slot=0x08, bus=0x00, domain=0x0000, function=0x0} | pci_0000_07_11_5
 {type=pci, slot=0x09, bus=0x00, domain=0x0000, function=0x0} | pci_0000_07_10_1
                                                              | 


Thu Apr 25 08:53:56 EDT 2019
                           address                            |   host_device    
--------------------------------------------------------------+------------------
 {type=pci, slot=0x07, bus=0x00, domain=0x0000, function=0x0} | pci_0000_07_12_1
 {type=pci, slot=0x08, bus=0x00, domain=0x0000, function=0x0} | pci_0000_07_11_5
 {type=pci, slot=0x09, bus=0x00, domain=0x0000, function=0x0} | pci_0000_07_10_1
 {type=pci, slot=0x0a, bus=0x00, domain=0x0000, function=0x0} | pci_0000_07_10_5          ====> Address was saved here.
(4 rows)


Thu Apr 25 08:53:58 EDT 2019
                           address                            |   host_device    
--------------------------------------------------------------+------------------
 {type=pci, slot=0x07, bus=0x00, domain=0x0000, function=0x0} | pci_0000_07_12_1
 {type=pci, slot=0x08, bus=0x00, domain=0x0000, function=0x0} | pci_0000_07_11_5
 {type=pci, slot=0x09, bus=0x00, domain=0x0000, function=0x0} | pci_0000_07_10_1
                                                              | pci_0000_07_10_5           ====> Immediately after 2 seconds, the address was removed.
======

In the PostgreSQL logs, I can see that there were two updatevmdevice requests during hotplugging. The second one was with _empty_ address.

===
updatevmdevice with address.

2019-04-25 12:53:56.211 UTC LOG:  execute S_81: select * from public.updatevmdevice(CAST ($1 AS uuid),CAST ($2 AS uuid),CAST ($3 AS varchar),CAST ($4 AS varchar),CAST ($5 AS varchar),CAST ($6 AS text),CAST ($7 AS bool),CAST ($8 AS bool),CAST ($9 AS bool),CAST ($10 AS varchar),CAST ($11 AS text),CAST ($12 AS uuid),CAST ($13 AS varchar),CAST ($14 AS varchar)) as result
2019-04-25 12:53:56.211 UTC DETAIL:  parameters: $1 = 'a644c267-368b-4193-8935-7d1c3daf265b', $2 = '3d2859c2-31a3-4ce3-97fc-52b09e60ce8c', $3 = 'hostdev', $4 = 'interface', $5 = '{type=pci, slot=0x0a, bus=0x00, domain=0x0000, function=0x0}', $6 = '{ }', $7 = 't', $8 = 't', $9 = 'f', $10 = 'ua-a644c267-368b-4193-8935-7d1c3daf265b', $11 = '{ }', $12 = NULL, $13 = NULL, $14 = 'pci_0000_07_10_5' 

updatevmdevice with _empty_ address after 2 seconds.

2019-04-25 12:53:58.747 UTC LOG:  execute <unnamed>: select * from updatevmdevice($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14) as result
2019-04-25 12:53:58.747 UTC DETAIL:  parameters: $1 = 'a644c267-368b-4193-8935-7d1c3daf265b', $2 = '3d2859c2-31a3-4ce3-97fc-52b09e60ce8c', $3 = 'hostdev', $4 = 'interface', $5 = '', $6 = '{ }', $7 = 't', $8 = 't', $9 = 'f', $10 = 'ua-a644c267-368b-4193-8935-7d1c3daf265b', $11 = '{ }', $12 = NULL, $13 = NULL, $14 = 'pci_0000_07_10_5'
===

Engine log doesn't show anything for the second request.

===
2019-04-25 08:53:56,183-04 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engineScheduled-Thread-1) [] method: runVdsCommand, params: [DumpXmls, Params:{hostId='bd982e52-cbea-4631-bfaf-31475f5683c7', vmIds='[3d2859c2-31a3-4ce3-97fc-52b09e60ce8c]'}], timeElapsed: 226ms
2019-04-25 08:53:56,210-04 DEBUG [org.ovirt.engine.core.dal.dbbroker.BatchProcedureExecutionConnectionCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-1) [] Executing batch for procedure UpdateVmDevice

2019-04-25 08:53:58,740-04 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (default task-14) [49ff84ad] Before releasing a lock 'EngineLock:{exclusiveLocks='[bd982e52-cbea-4631-bfaf-31475f5683c7=HOST_DEVICES]', sharedLocks=''}'
2019-04-25 08:53:58,740-04 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (default task-14) [49ff84ad] The exclusive lock for key 'bd982e52-cbea-4631-bfaf-31475f5683c7HOST_DEVICES' is released and lock is removed from map
2019-04-25 08:53:58,769-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-14) [49ff84ad] EVENT_ID: NETWORK_ACTIVATE_VM_INTERFACE_SUCCESS(1,012), Network Interface nic9 (PCI Passthrough) was plugged to VM vm_nijin. (User: admin@internal-authz)
===

Now if I reboot this VM, the NIC name will get changed within the VM because the engine will not pass the PCI address and libvirt generate another one.

===
2019-04-25 08:55:46,137-04 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-2517) [1df55bc1-4bf5-47f8-a8e0-458cd234a72c] START, CreateBrokerVDSCommand(HostName = , CreateVDSCommandParameters:{hostId='bd982e52-cbea-4631-bfaf-31475f5683c7', vmId='3d2859c2-31a3-4ce3-97fc-52b09e60ce8c', vm='VM [vm_nijin]'}), log id: 4cb9e33

2019-04-25 08:55:46,154-04 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-2517) [1df55bc1-4bf5-47f8-a8e0-458cd234a72c] VM <?xml version="1.0" encoding="UTF-8"?><domain type="kvm" xmlns:ovirt-tune="http://ovirt.org/vm/tune/1.0" xmlns:ovirt-vm="http://ovirt.org/vm/1.0">
------
------
------
    <interface type="hostdev" managed="no">
      <driver name="vfio"/>
      <source>
        <address bus="7" domain="0" function="1" slot="17" type="pci"/>
      </source>
      <alias name="ua-a644c267-368b-4193-8935-7d1c3daf265b"/>
      <mac address="00:1a:4a:16:01:29"/>
      <bandwidth/>
    </interface>
=====

Since PCI address got changed, the network device name will get changed within the VM.

====
Apr 25 08:53:52 dhcp181-122 kernel: igbvf 0000:00:0a.0: Address: 00:1a:4a:16:01:29
Apr 25 08:53:52 dhcp181-122 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ens10: link becomes ready
Apr 25 08:53:52 dhcp181-122 kernel: igbvf 0000:00:0a.0: Link is Up 1000 Mbps Full Duplex

After reboot.

Apr 25 08:56:02 dhcp181-122 kernel: igbvf 0000:00:06.0: Address: 00:1a:4a:16:01:29
Apr 25 08:56:06 dhcp181-122 kernel: igbvf 0000:00:06.0: Link is Up 1000 Mbps Full Duplex
Apr 25 08:56:06 dhcp181-122 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ens6: link becomes ready
====

This can affect the configurations in the VM.

Version-Release number of selected component (if applicable):

rhvm-4.2.8.7-0.1.el7ev.noarch

How reproducible:

I can reproduce this almost 10% time when I try to hotplug the NIC to the VM in the test environment. Looks like this happens only on passthrough NIC since I was unable to reproduce this in virtio NICs.

Steps to Reproduce:

1. Passthrough an SR-IOV NIC to the VM and watch the address in vm_device table.


Actual results:

PCI address of NICs are not stored in the database after a hotplug of passthrough NIC resulting in the change of network device name in VM after a reboot

Expected results:

PCI address of NIC should be persistent.

Additional info:

(Originally by Nijin Ashok)

Comment 7 RHV bug bot 2019-06-05 13:15:41 UTC
Just from reading the source code, I expect that bug 1703112 will be
reproducible by creating a before_get_caps hook on the host.
The hook should just sleep for a short time between 1 to 15 seconds.
I hope this would change the timing of the threads in Engine in a way
that triggers the bug.

(Originally by Dominik Holler)

Comment 8 RHV bug bot 2019-06-05 13:15:42 UTC
(In reply to Dominik Holler from comment #7)
> Just from reading the source code, I expect that bug 1703112 will be
> reproducible by creating a before_get_caps hook on the host.
> The hook should just sleep for a short time between 1 to 15 seconds.
> I hope this would change the timing of the threads in Engine in a way
> that triggers the bug.

Maybe a high VdsRefreshRate would work, too.

(Originally by Dominik Holler)

Comment 9 RHV bug bot 2019-06-05 13:15:44 UTC
4.3.4.1 is available for QE

(Originally by Dominik Holler)

Comment 12 RHV bug bot 2019-06-05 13:15:49 UTC
Verified on - 4.3.4.1-0.1.el7

(Originally by Michael Burman)

Comment 14 Dominik Holler 2019-06-05 13:17:26 UTC
*** Bug 1717356 has been marked as a duplicate of this bug. ***

Comment 15 Dominik Holler 2019-06-05 13:25:31 UTC
Copy verified from https://bugzilla.redhat.com/show_bug.cgi?id=1703112#c12

Comment 18 errata-xmlrpc 2019-06-20 14:48:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:1566