Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1861753

Summary: HotUnplug an additional nic of HE VM is processed with error about removing not pluggable device from libvirt
Product: [oVirt] ovirt-engine Reporter: Polina <pagranat>
Component: BLL.VirtAssignee: Arik <ahadas>
Status: CLOSED WONTFIX QA Contact: meital avital <mavital>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.4.1.10CC: ahadas, amusil, arachman, asocha, bugs, dholler, mperina, msobczyk, sbonazzo
Target Milestone: ---Keywords: Automation
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-13 10:23:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs, xml output
none
logs_unplugged_nic_after restart
none
logs + vm.conf none

Description Polina 2020-07-29 13:26:08 UTC
Created attachment 1702807 [details]
logs, xml output

Description of problem:
HotUnplug an additional nic of HE VM is processed with error about removing not pluggable device from libvirt

Version-Release number of selected component (if applicable):
ovirt-engine-4.4.1.10-0.1.el8ev.noarch

How reproducible: always


Steps to Reproduce:
1. add and additional network to the engine and additional nic to HE VM (test_network in the attached get_networks.xml. enp10s0 in the get_nics.xml attached)
2. Send rest API to unplug the nic

PUT https://{{host}}/ovirt-engine/api/vms/70ef8cf8-0109-474e-ad87-1777a8ef844f/nics/11bf670a-1770-473d-8a64-04a1a72a870c

<nic>
    <plugged>false</plugged>
</nic>

The error in engine.log (see line 12253)

020-07-29 15:44:28,646+03 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmDevicesMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-55) [] VM '70ef8cf8-0109-474e-ad87-1777a8ef844f' managed non pluggable device was removed unexpectedly from libvirt: 'VmDevice:{id='VmDeviceId:{deviceId='5c67ad87-9baa-451c-897c-dc037d0299d8', vmId='70ef8cf8-0109-474e-ad87-1777a8ef844f'}', device='virtio-scsi', type='CONTROLLER', specParams='[ioThreadId=1]', address='', managed='true', plugged='false', readOnly='false', deviceAlias='', customProperties='[]', snapshotId='null', logicalName='null', hostDevice='null'}'

There is another scenario:
If before the unlug the HE VM was restarted (by cli hosted-engine --vm-poweroff/--vm-start) we waited for the DC is UP, hosts have max score, the engine is in healthy state , then sending the same unplug is resulted with 

2020-07-29 15:44:12,441+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-9) [] Command 'GetAllVmStatsVDSCommand(HostName = host_mixed_3, VdsIdVDSCommandParametersBase:{hostId='bc5e06d2-ce92-4737-813b-4514dc041af5'})' execution failed: VDSGenericException: VDSNetworkException: Connection timeout for host 'lynx11.lab.eng.tlv2.redhat.com', last response arrived 22542 ms ago.

11796 2020-07-29 15:44:13,444+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-93) [] Command 'GetAllVmStatsVDSCom      mand(HostName = host_mixed_1, VdsIdVDSCommandParametersBase:{hostId='0287afb9-3b69-48cd-80c5-b3ee4b2c3b18'})' execution failed: VDSGenericException: VDSNetworkException: Connection timeout for host 'lynx09      .lab.eng.tlv2.redhat.com', last response arrived 22543 ms ago.

2020-07-29 15:44:13,672+03 ERROR [org.ovirt.engine.api.restapi.resource.AbstractBackendResource] (default task-6) [] Operation Failed: [Network error during communication with the Host.]


please look in the engine.log from 

  11776 2020-07-29 15:43:55,927+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.HotUnplugNicVDSCommand] (default task-6) [659288c] NIC hot-unplug: <?xml version="1.0" encoding="UTF-8"?><hotunplug>
  11777   <devices>
  11778     <interface>
  11779       <alias name="ua-11bf670a-1770-473d-8a64-04a1a72a870c"/>
  11780     </interface>
  11781   </devices>
  11782 </hotunplug>



This was found by automation test , and in such scenario the additional nic just could not be removed from the engine by test.
Expected results: nic
unplugged without error


Additional info:

Comment 1 Arik 2020-07-30 08:37:22 UTC
(In reply to Polina from comment #0)
> 020-07-29 15:44:28,646+03 ERROR
> [org.ovirt.engine.core.vdsbroker.monitoring.VmDevicesMonitoring]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-55) []
> VM '70ef8cf8-0109-474e-ad87-1777a8ef844f' managed non pluggable device was
> removed unexpectedly from libvirt:
> 'VmDevice:{id='VmDeviceId:{deviceId='5c67ad87-9baa-451c-897c-dc037d0299d8',
> vmId='70ef8cf8-0109-474e-ad87-1777a8ef844f'}', device='virtio-scsi',
> type='CONTROLLER', specParams='[ioThreadId=1]', address='', managed='true',
> plugged='false', readOnly='false', deviceAlias='', customProperties='[]',
> snapshotId='null', logicalName='null', hostDevice='null'}'

This error appears long before the hot-unplug operation (e.g., 2020-07-29 13:42:30) so it's unlikely to be related.
The attached log doesn't seem to cover the interesting period of time in which the device got unplugged.
Please file a separate bug on that. We need log that covers the VM creation until the first occurrence of this error.

> There is another scenario:
> If before the unlug the HE VM was restarted (by cli hosted-engine
> --vm-poweroff/--vm-start) we waited for the DC is UP, hosts have max score,
> the engine is in healthy state , then sending the same unplug is resulted
> with 
...
> This was found by automation test , and in such scenario the additional nic
> just could not be removed from the engine by test.
> Expected results: nic
> unplugged without error

The funny part is that the hot-unplug actually succeeded - it's just that the engine was unable to get the response and therefore thought it failed.
I believe this is an infra related bug, something to do with the jsonrpc library:
In 2020-07-29 15:43:55,728+03 the engine managed to transmit the request for unplug to VDSM
In 2020-07-29 15:44:10,913+03 we see: "No interaction with host 'lynx09.lab.eng.tlv2.redhat.com' for 20000 ms."
And probably as a result of that the engine didn't get VDSM response (that was generated on 2020-07-29 15:44:03,344+0300).
Martin, could someone from the infra team take a look?

Comment 5 Polina 2020-09-17 07:16:59 UTC
Created attachment 1715179 [details]
logs_unplugged_nic_after restart

Hi, making some order after an additional investigation. 

1. The error msg "020-07-29 15:44:28,646+03 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmDevicesMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-55) [] VM '70ef8cf8-0109-474e-ad87-1777a8ef844f' managed non pluggable device was removed unexpectedly from libvirt: 'VmDevice:{id='VmDeviceId:{deviceId='5c67ad87-9baa-451c-897c-dc037d0299d8', vmId='70ef8cf8-0109-474e-ad87-1777a8ef844f'}', device='virtio-scsi', type='CONTROLLER', specParams='[ioThreadId=1]', address='', managed='true', plugged='false', readOnly='false', deviceAlias='', customProperties='[]', snapshotId='null', logicalName='null', hostDevice='null'}'" 

doesn't relate to plug/unplug network sceario. It is reported by the engine every time the HE VM is restarted:
hosted-engine --vm-poweroff
hosted-engine --vm-start

and it is there.

2. About the plugging network . the scenario I do :

  - create a new test_network (new network in network menu),
  - assign the new network to the interface by SetupHostNetworks in Host/Network Interfaces tab
  - add new nic with this test_network to the HE VM
    check on HE VM guest that new interface appeared (ls /sys/class/net)
  - wait a minute, restart HE VM by hosted-engine --vm-poweroff/--vm-start
all these actions could be performed sending REST APIs like in http://pastebin.test.redhat.com/903007

Result when the HE VM is up, this additional nic is unplugged. the situation is not straightforward, not always reproducible.
I attach some logs (logs_unplugged_after_restart.tar) from the last reproduce when the HE VM was restarted and after restart the nic is unplugged.
Arik , could you please look if there is some hint  why after restart the nic is unplugged. please let me know if some DEBUG in logs will be helpfull, i'll supply

3. About the VDSNetworkException I have to reproduce it in production environment as it is available for me and I'll supply info.I suspect that is could be related to some specific setup.

Comment 6 Arik 2020-09-17 11:01:04 UTC
(In reply to Polina from comment #5)
> 2. About the plugging network . the scenario I do :
> 
>   - create a new test_network (new network in network menu),
>   - assign the new network to the interface by SetupHostNetworks in
> Host/Network Interfaces tab
>   - add new nic with this test_network to the HE VM
>     check on HE VM guest that new interface appeared (ls /sys/class/net)
>   - wait a minute, restart HE VM by hosted-engine --vm-poweroff/--vm-start
> all these actions could be performed sending REST APIs like in
> http://pastebin.test.redhat.com/903007
> 
> Result when the HE VM is up, this additional nic is unplugged. the situation
> is not straightforward, not always reproducible.
> I attach some logs (logs_unplugged_after_restart.tar) from the last
> reproduce when the HE VM was restarted and after restart the nic is
> unplugged.
> Arik , could you please look if there is some hint  why after restart the
> nic is unplugged. please let me know if some DEBUG in logs will be helpfull,
> i'll supply

I see that before the restart we got the domain xml with both interfaces:

    <interface type='bridge'>
      <mac address='00:1a:4a:16:10:64'/>
      <source bridge='ovirtmgmt'/>
      <target dev='vnet0'/>
      <model type='virtio'/>
      <driver name='vhost' queues='4'/>
      <filterref filter='vdsm-no-mac-spoofing'/>
      <link state='up'/>
      <mtu size='1500'/>
      <alias name='ua-23f63029-dcd7-4644-b20f-092061c2589e'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </interface>
    <interface type='bridge'>
      <mac address='00:1a:4a:16:10:39'/>
      <source bridge='test_network'/>
      <target dev='vnet1'/>
      <model type='virtio'/>
      <driver name='vhost' queues='4'/>
      <filterref filter='vdsm-no-mac-spoofing'/>
      <link state='up'/>
      <mtu size='1500'/>
      <alias name='ua-714be89d-f3a5-4c91-be40-7e7a2d3cf57e'/>
      <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/>
    </interface>

and afterwards we get a single one:
   <interface type='bridge'>
      <mac address='00:1a:4a:16:10:64'/>
      <source bridge='ovirtmgmt'/>
      <target dev='vnet0'/>
      <model type='virtio'/>
      <driver name='vhost' queues='4'/>
      <filterref filter='vdsm-no-mac-spoofing'/>
      <link state='up'/>
      <mtu size='1500'/>
      <alias name='ua-23f63029-dcd7-4644-b20f-092061c2589e'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </interface>

the interface with the test_network doesn't appear also in the metadata section.
so it seems that the ha-broker didn't use the latest configuration of the vm when restarting it.

I see that after plugging the NIC we've updated the OVFSTORE:
2020-09-16 23:11:14,697+03 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.OvfDataUpdater] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54) [] Attempting to update VMs/Templates Ovf.

so it's hard to tell why the ha-broker didn't use the updated configuration - maybe there's another component on the host that fetched the configuration from the OVFSTORE and prepares the configuration to use on next restart of the hosted engine which didn't do its job before the hosted engine was restarted.

Asaf, can you please shed more light on this flow?

Comment 7 Dominik Holler 2020-09-17 12:25:21 UTC
Can you please share the /run/ovirt-hosted-engine-ha/vm.conf ?

Comment 8 Dominik Holler 2020-09-17 12:32:07 UTC
 Can you please share the  /var/log/ovirt-hosted-engine-ha/agent.log ?

Comment 9 Polina 2020-09-17 13:08:56 UTC
Created attachment 1715220 [details]
logs + vm.conf

added logs including vm.conf

Comment 11 Arik 2020-10-08 07:13:02 UTC
Sandro, is there anything else the virt team can provide to proceed with this? we suspect this could be an issue with the ha-broker (comment 6)

Comment 12 Asaf Rachmani 2020-10-08 08:00:21 UTC
(In reply to Arik from comment #6)
> (In reply to Polina from comment #5)
> > 2. About the plugging network . the scenario I do :
> > 
> >   - create a new test_network (new network in network menu),
> >   - assign the new network to the interface by SetupHostNetworks in
> > Host/Network Interfaces tab
> >   - add new nic with this test_network to the HE VM
> >     check on HE VM guest that new interface appeared (ls /sys/class/net)
> >   - wait a minute, restart HE VM by hosted-engine --vm-poweroff/--vm-start
> > all these actions could be performed sending REST APIs like in
> > http://pastebin.test.redhat.com/903007
> > 
> > Result when the HE VM is up, this additional nic is unplugged. the situation
> > is not straightforward, not always reproducible.
> > I attach some logs (logs_unplugged_after_restart.tar) from the last
> > reproduce when the HE VM was restarted and after restart the nic is
> > unplugged.
> > Arik , could you please look if there is some hint  why after restart the
> > nic is unplugged. please let me know if some DEBUG in logs will be helpfull,
> > i'll supply
> 
> I see that before the restart we got the domain xml with both interfaces:
> 
>     <interface type='bridge'>
>       <mac address='00:1a:4a:16:10:64'/>
>       <source bridge='ovirtmgmt'/>
>       <target dev='vnet0'/>
>       <model type='virtio'/>
>       <driver name='vhost' queues='4'/>
>       <filterref filter='vdsm-no-mac-spoofing'/>
>       <link state='up'/>
>       <mtu size='1500'/>
>       <alias name='ua-23f63029-dcd7-4644-b20f-092061c2589e'/>
>       <address type='pci' domain='0x0000' bus='0x02' slot='0x00'
> function='0x0'/>
>     </interface>
>     <interface type='bridge'>
>       <mac address='00:1a:4a:16:10:39'/>
>       <source bridge='test_network'/>
>       <target dev='vnet1'/>
>       <model type='virtio'/>
>       <driver name='vhost' queues='4'/>
>       <filterref filter='vdsm-no-mac-spoofing'/>
>       <link state='up'/>
>       <mtu size='1500'/>
>       <alias name='ua-714be89d-f3a5-4c91-be40-7e7a2d3cf57e'/>
>       <address type='pci' domain='0x0000' bus='0x09' slot='0x00'
> function='0x0'/>
>     </interface>
> 
> and afterwards we get a single one:
>    <interface type='bridge'>
>       <mac address='00:1a:4a:16:10:64'/>
>       <source bridge='ovirtmgmt'/>
>       <target dev='vnet0'/>
>       <model type='virtio'/>
>       <driver name='vhost' queues='4'/>
>       <filterref filter='vdsm-no-mac-spoofing'/>
>       <link state='up'/>
>       <mtu size='1500'/>
>       <alias name='ua-23f63029-dcd7-4644-b20f-092061c2589e'/>
>       <address type='pci' domain='0x0000' bus='0x02' slot='0x00'
> function='0x0'/>
>     </interface>
> 
> the interface with the test_network doesn't appear also in the metadata
> section.
> so it seems that the ha-broker didn't use the latest configuration of the vm
> when restarting it.
> 
> I see that after plugging the NIC we've updated the OVFSTORE:
> 2020-09-16 23:11:14,697+03 INFO 
> [org.ovirt.engine.core.bll.storage.ovfstore.OvfDataUpdater]
> (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54) []
> Attempting to update VMs/Templates Ovf.
> 
> so it's hard to tell why the ha-broker didn't use the updated configuration
> - maybe there's another component on the host that fetched the configuration
> from the OVFSTORE and prepares the configuration to use on next restart of
> the hosted engine which didn't do its job before the hosted engine was
> restarted.
> 
> Asaf, can you please shed more light on this flow?

Once the engine VM starts, ovirt-ha-agent uses `/var/run/ovirt-hosted-engine-ha/vm.conf` to get all the information it needs.
This file is regenerated by parsing the OVF_STORE disks, there is some latency in reading/writing OVF_STORE.

Comment 13 Arik 2020-10-08 08:25:24 UTC
(In reply to Asaf Rachmani from comment #12)
> Once the engine VM starts, ovirt-ha-agent uses
> `/var/run/ovirt-hosted-engine-ha/vm.conf` to get all the information it
> needs.
> This file is regenerated by parsing the OVF_STORE disks, there is some
> latency in reading/writing OVF_STORE.

Yeah, that makes sense and this latency is what we're interested in - can we estimate the maximal time it may take for a change to the OVF_STORE to propagate to vm.conf?

Comment 14 Asaf Rachmani 2020-10-08 09:01:47 UTC
(In reply to Arik from comment #13)
> (In reply to Asaf Rachmani from comment #12)
> > Once the engine VM starts, ovirt-ha-agent uses
> > `/var/run/ovirt-hosted-engine-ha/vm.conf` to get all the information it
> > needs.
> > This file is regenerated by parsing the OVF_STORE disks, there is some
> > latency in reading/writing OVF_STORE.
> 
> Yeah, that makes sense and this latency is what we're interested in - can we
> estimate the maximal time it may take for a change to the OVF_STORE to
> propagate to vm.conf?

I think it depends on the environment (networking/storage/load etc.).

Comment 15 Arik 2020-10-08 09:43:17 UTC
(In reply to Asaf Rachmani from comment #14)
> (In reply to Arik from comment #13)
> > (In reply to Asaf Rachmani from comment #12)
> > > Once the engine VM starts, ovirt-ha-agent uses
> > > `/var/run/ovirt-hosted-engine-ha/vm.conf` to get all the information it
> > > needs.
> > > This file is regenerated by parsing the OVF_STORE disks, there is some
> > > latency in reading/writing OVF_STORE.
> > 
> > Yeah, that makes sense and this latency is what we're interested in - can we
> > estimate the maximal time it may take for a change to the OVF_STORE to
> > propagate to vm.conf?
> 
> I think it depends on the environment (networking/storage/load etc.).

yes, but for the automation tests to be deterministic we don't have to be super accurate here -
let's say that the ha-broker samples the OVF_STORE every minute
so we can tell QE, listen - you need to make sure the engine says it managed to update the OVF-STORE, then wait for 90 seconds before proceeding with restarting the hosted-engine - as in the worst case the vm.conf will be updated after a minute and we take another 30 sec as a buffer for whatever storage/load issue that may appear

we don't want to tell them to wait too much, right? but that's ok to give a rough estimation for this
is it realistic that after a minute that vm.conf will be updated?

Comment 16 Asaf Rachmani 2020-10-08 10:31:44 UTC
(In reply to Arik from comment #15)
> (In reply to Asaf Rachmani from comment #14)
> > (In reply to Arik from comment #13)
> > > (In reply to Asaf Rachmani from comment #12)
> > > > Once the engine VM starts, ovirt-ha-agent uses
> > > > `/var/run/ovirt-hosted-engine-ha/vm.conf` to get all the information it
> > > > needs.
> > > > This file is regenerated by parsing the OVF_STORE disks, there is some
> > > > latency in reading/writing OVF_STORE.
> > > 
> > > Yeah, that makes sense and this latency is what we're interested in - can we
> > > estimate the maximal time it may take for a change to the OVF_STORE to
> > > propagate to vm.conf?
> > 
> > I think it depends on the environment (networking/storage/load etc.).
> 
> yes, but for the automation tests to be deterministic we don't have to be
> super accurate here -
> let's say that the ha-broker samples the OVF_STORE every minute
> so we can tell QE, listen - you need to make sure the engine says it managed
> to update the OVF-STORE, then wait for 90 seconds before proceeding with
> restarting the hosted-engine - as in the worst case the vm.conf will be
> updated after a minute and we take another 30 sec as a buffer for whatever
> storage/load issue that may appear
> 
> we don't want to tell them to wait too much, right? but that's ok to give a
> rough estimation for this
> is it realistic that after a minute that vm.conf will be updated?

Sounds good

Comment 17 Arik 2020-10-08 12:23:06 UTC
(In reply to Arik from comment #15)
> yes, but for the automation tests to be deterministic we don't have to be
> super accurate here -
> let's say that the ha-broker samples the OVF_STORE every minute
> so we can tell QE, listen - you need to make sure the engine says it managed
> to update the OVF-STORE, then wait for 90 seconds before proceeding with
> restarting the hosted-engine - as in the worst case the vm.conf will be
> updated after a minute and we take another 30 sec as a buffer for whatever
> storage/load issue that may appear

Polina, would that work as a workaround in the automation tests for this?

Comment 20 Arik 2020-10-13 10:35:22 UTC
To reflect on why this bug was closed - 
we cannot avoid the race between updating the configuration of the hosted engine VM and restarting it as it takes time to generate the configuration that the hosted engine VM will restart with.
The probability of this to happen in real environment is rather low and by trying to reduce the probability of this to happen we may introduce undesired side effects so it's better to go with the workaround for the automation tests that was proposed in comment 17 and seems to work.