Bug 1861753
| Summary: | HotUnplug an additional nic of HE VM is processed with error about removing not pluggable device from libvirt | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-engine | Reporter: | Polina <pagranat> | ||||||||
| Component: | BLL.Virt | Assignee: | Arik <ahadas> | ||||||||
| Status: | CLOSED WONTFIX | QA Contact: | meital avital <mavital> | ||||||||
| Severity: | medium | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 4.4.1.10 | CC: | ahadas, amusil, arachman, asocha, bugs, dholler, mperina, msobczyk, sbonazzo | ||||||||
| Target Milestone: | --- | Keywords: | Automation | ||||||||
| Target Release: | --- | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2020-10-13 10:23:16 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
|
Description
Polina
2020-07-29 13:26:08 UTC
(In reply to Polina from comment #0) > 020-07-29 15:44:28,646+03 ERROR > [org.ovirt.engine.core.vdsbroker.monitoring.VmDevicesMonitoring] > (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-55) [] > VM '70ef8cf8-0109-474e-ad87-1777a8ef844f' managed non pluggable device was > removed unexpectedly from libvirt: > 'VmDevice:{id='VmDeviceId:{deviceId='5c67ad87-9baa-451c-897c-dc037d0299d8', > vmId='70ef8cf8-0109-474e-ad87-1777a8ef844f'}', device='virtio-scsi', > type='CONTROLLER', specParams='[ioThreadId=1]', address='', managed='true', > plugged='false', readOnly='false', deviceAlias='', customProperties='[]', > snapshotId='null', logicalName='null', hostDevice='null'}' This error appears long before the hot-unplug operation (e.g., 2020-07-29 13:42:30) so it's unlikely to be related. The attached log doesn't seem to cover the interesting period of time in which the device got unplugged. Please file a separate bug on that. We need log that covers the VM creation until the first occurrence of this error. > There is another scenario: > If before the unlug the HE VM was restarted (by cli hosted-engine > --vm-poweroff/--vm-start) we waited for the DC is UP, hosts have max score, > the engine is in healthy state , then sending the same unplug is resulted > with ... > This was found by automation test , and in such scenario the additional nic > just could not be removed from the engine by test. > Expected results: nic > unplugged without error The funny part is that the hot-unplug actually succeeded - it's just that the engine was unable to get the response and therefore thought it failed. I believe this is an infra related bug, something to do with the jsonrpc library: In 2020-07-29 15:43:55,728+03 the engine managed to transmit the request for unplug to VDSM In 2020-07-29 15:44:10,913+03 we see: "No interaction with host 'lynx09.lab.eng.tlv2.redhat.com' for 20000 ms." And probably as a result of that the engine didn't get VDSM response (that was generated on 2020-07-29 15:44:03,344+0300). Martin, could someone from the infra team take a look? Created attachment 1715179 [details] logs_unplugged_nic_after restart Hi, making some order after an additional investigation. 1. The error msg "020-07-29 15:44:28,646+03 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmDevicesMonitoring] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-55) [] VM '70ef8cf8-0109-474e-ad87-1777a8ef844f' managed non pluggable device was removed unexpectedly from libvirt: 'VmDevice:{id='VmDeviceId:{deviceId='5c67ad87-9baa-451c-897c-dc037d0299d8', vmId='70ef8cf8-0109-474e-ad87-1777a8ef844f'}', device='virtio-scsi', type='CONTROLLER', specParams='[ioThreadId=1]', address='', managed='true', plugged='false', readOnly='false', deviceAlias='', customProperties='[]', snapshotId='null', logicalName='null', hostDevice='null'}'" doesn't relate to plug/unplug network sceario. It is reported by the engine every time the HE VM is restarted: hosted-engine --vm-poweroff hosted-engine --vm-start and it is there. 2. About the plugging network . the scenario I do : - create a new test_network (new network in network menu), - assign the new network to the interface by SetupHostNetworks in Host/Network Interfaces tab - add new nic with this test_network to the HE VM check on HE VM guest that new interface appeared (ls /sys/class/net) - wait a minute, restart HE VM by hosted-engine --vm-poweroff/--vm-start all these actions could be performed sending REST APIs like in http://pastebin.test.redhat.com/903007 Result when the HE VM is up, this additional nic is unplugged. the situation is not straightforward, not always reproducible. I attach some logs (logs_unplugged_after_restart.tar) from the last reproduce when the HE VM was restarted and after restart the nic is unplugged. Arik , could you please look if there is some hint why after restart the nic is unplugged. please let me know if some DEBUG in logs will be helpfull, i'll supply 3. About the VDSNetworkException I have to reproduce it in production environment as it is available for me and I'll supply info.I suspect that is could be related to some specific setup. (In reply to Polina from comment #5) > 2. About the plugging network . the scenario I do : > > - create a new test_network (new network in network menu), > - assign the new network to the interface by SetupHostNetworks in > Host/Network Interfaces tab > - add new nic with this test_network to the HE VM > check on HE VM guest that new interface appeared (ls /sys/class/net) > - wait a minute, restart HE VM by hosted-engine --vm-poweroff/--vm-start > all these actions could be performed sending REST APIs like in > http://pastebin.test.redhat.com/903007 > > Result when the HE VM is up, this additional nic is unplugged. the situation > is not straightforward, not always reproducible. > I attach some logs (logs_unplugged_after_restart.tar) from the last > reproduce when the HE VM was restarted and after restart the nic is > unplugged. > Arik , could you please look if there is some hint why after restart the > nic is unplugged. please let me know if some DEBUG in logs will be helpfull, > i'll supply I see that before the restart we got the domain xml with both interfaces: <interface type='bridge'> <mac address='00:1a:4a:16:10:64'/> <source bridge='ovirtmgmt'/> <target dev='vnet0'/> <model type='virtio'/> <driver name='vhost' queues='4'/> <filterref filter='vdsm-no-mac-spoofing'/> <link state='up'/> <mtu size='1500'/> <alias name='ua-23f63029-dcd7-4644-b20f-092061c2589e'/> <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </interface> <interface type='bridge'> <mac address='00:1a:4a:16:10:39'/> <source bridge='test_network'/> <target dev='vnet1'/> <model type='virtio'/> <driver name='vhost' queues='4'/> <filterref filter='vdsm-no-mac-spoofing'/> <link state='up'/> <mtu size='1500'/> <alias name='ua-714be89d-f3a5-4c91-be40-7e7a2d3cf57e'/> <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/> </interface> and afterwards we get a single one: <interface type='bridge'> <mac address='00:1a:4a:16:10:64'/> <source bridge='ovirtmgmt'/> <target dev='vnet0'/> <model type='virtio'/> <driver name='vhost' queues='4'/> <filterref filter='vdsm-no-mac-spoofing'/> <link state='up'/> <mtu size='1500'/> <alias name='ua-23f63029-dcd7-4644-b20f-092061c2589e'/> <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </interface> the interface with the test_network doesn't appear also in the metadata section. so it seems that the ha-broker didn't use the latest configuration of the vm when restarting it. I see that after plugging the NIC we've updated the OVFSTORE: 2020-09-16 23:11:14,697+03 INFO [org.ovirt.engine.core.bll.storage.ovfstore.OvfDataUpdater] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54) [] Attempting to update VMs/Templates Ovf. so it's hard to tell why the ha-broker didn't use the updated configuration - maybe there's another component on the host that fetched the configuration from the OVFSTORE and prepares the configuration to use on next restart of the hosted engine which didn't do its job before the hosted engine was restarted. Asaf, can you please shed more light on this flow? Can you please share the /run/ovirt-hosted-engine-ha/vm.conf ? Can you please share the /var/log/ovirt-hosted-engine-ha/agent.log ? Created attachment 1715220 [details]
logs + vm.conf
added logs including vm.conf
Sandro, is there anything else the virt team can provide to proceed with this? we suspect this could be an issue with the ha-broker (comment 6) (In reply to Arik from comment #6) > (In reply to Polina from comment #5) > > 2. About the plugging network . the scenario I do : > > > > - create a new test_network (new network in network menu), > > - assign the new network to the interface by SetupHostNetworks in > > Host/Network Interfaces tab > > - add new nic with this test_network to the HE VM > > check on HE VM guest that new interface appeared (ls /sys/class/net) > > - wait a minute, restart HE VM by hosted-engine --vm-poweroff/--vm-start > > all these actions could be performed sending REST APIs like in > > http://pastebin.test.redhat.com/903007 > > > > Result when the HE VM is up, this additional nic is unplugged. the situation > > is not straightforward, not always reproducible. > > I attach some logs (logs_unplugged_after_restart.tar) from the last > > reproduce when the HE VM was restarted and after restart the nic is > > unplugged. > > Arik , could you please look if there is some hint why after restart the > > nic is unplugged. please let me know if some DEBUG in logs will be helpfull, > > i'll supply > > I see that before the restart we got the domain xml with both interfaces: > > <interface type='bridge'> > <mac address='00:1a:4a:16:10:64'/> > <source bridge='ovirtmgmt'/> > <target dev='vnet0'/> > <model type='virtio'/> > <driver name='vhost' queues='4'/> > <filterref filter='vdsm-no-mac-spoofing'/> > <link state='up'/> > <mtu size='1500'/> > <alias name='ua-23f63029-dcd7-4644-b20f-092061c2589e'/> > <address type='pci' domain='0x0000' bus='0x02' slot='0x00' > function='0x0'/> > </interface> > <interface type='bridge'> > <mac address='00:1a:4a:16:10:39'/> > <source bridge='test_network'/> > <target dev='vnet1'/> > <model type='virtio'/> > <driver name='vhost' queues='4'/> > <filterref filter='vdsm-no-mac-spoofing'/> > <link state='up'/> > <mtu size='1500'/> > <alias name='ua-714be89d-f3a5-4c91-be40-7e7a2d3cf57e'/> > <address type='pci' domain='0x0000' bus='0x09' slot='0x00' > function='0x0'/> > </interface> > > and afterwards we get a single one: > <interface type='bridge'> > <mac address='00:1a:4a:16:10:64'/> > <source bridge='ovirtmgmt'/> > <target dev='vnet0'/> > <model type='virtio'/> > <driver name='vhost' queues='4'/> > <filterref filter='vdsm-no-mac-spoofing'/> > <link state='up'/> > <mtu size='1500'/> > <alias name='ua-23f63029-dcd7-4644-b20f-092061c2589e'/> > <address type='pci' domain='0x0000' bus='0x02' slot='0x00' > function='0x0'/> > </interface> > > the interface with the test_network doesn't appear also in the metadata > section. > so it seems that the ha-broker didn't use the latest configuration of the vm > when restarting it. > > I see that after plugging the NIC we've updated the OVFSTORE: > 2020-09-16 23:11:14,697+03 INFO > [org.ovirt.engine.core.bll.storage.ovfstore.OvfDataUpdater] > (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54) [] > Attempting to update VMs/Templates Ovf. > > so it's hard to tell why the ha-broker didn't use the updated configuration > - maybe there's another component on the host that fetched the configuration > from the OVFSTORE and prepares the configuration to use on next restart of > the hosted engine which didn't do its job before the hosted engine was > restarted. > > Asaf, can you please shed more light on this flow? Once the engine VM starts, ovirt-ha-agent uses `/var/run/ovirt-hosted-engine-ha/vm.conf` to get all the information it needs. This file is regenerated by parsing the OVF_STORE disks, there is some latency in reading/writing OVF_STORE. (In reply to Asaf Rachmani from comment #12) > Once the engine VM starts, ovirt-ha-agent uses > `/var/run/ovirt-hosted-engine-ha/vm.conf` to get all the information it > needs. > This file is regenerated by parsing the OVF_STORE disks, there is some > latency in reading/writing OVF_STORE. Yeah, that makes sense and this latency is what we're interested in - can we estimate the maximal time it may take for a change to the OVF_STORE to propagate to vm.conf? (In reply to Arik from comment #13) > (In reply to Asaf Rachmani from comment #12) > > Once the engine VM starts, ovirt-ha-agent uses > > `/var/run/ovirt-hosted-engine-ha/vm.conf` to get all the information it > > needs. > > This file is regenerated by parsing the OVF_STORE disks, there is some > > latency in reading/writing OVF_STORE. > > Yeah, that makes sense and this latency is what we're interested in - can we > estimate the maximal time it may take for a change to the OVF_STORE to > propagate to vm.conf? I think it depends on the environment (networking/storage/load etc.). (In reply to Asaf Rachmani from comment #14) > (In reply to Arik from comment #13) > > (In reply to Asaf Rachmani from comment #12) > > > Once the engine VM starts, ovirt-ha-agent uses > > > `/var/run/ovirt-hosted-engine-ha/vm.conf` to get all the information it > > > needs. > > > This file is regenerated by parsing the OVF_STORE disks, there is some > > > latency in reading/writing OVF_STORE. > > > > Yeah, that makes sense and this latency is what we're interested in - can we > > estimate the maximal time it may take for a change to the OVF_STORE to > > propagate to vm.conf? > > I think it depends on the environment (networking/storage/load etc.). yes, but for the automation tests to be deterministic we don't have to be super accurate here - let's say that the ha-broker samples the OVF_STORE every minute so we can tell QE, listen - you need to make sure the engine says it managed to update the OVF-STORE, then wait for 90 seconds before proceeding with restarting the hosted-engine - as in the worst case the vm.conf will be updated after a minute and we take another 30 sec as a buffer for whatever storage/load issue that may appear we don't want to tell them to wait too much, right? but that's ok to give a rough estimation for this is it realistic that after a minute that vm.conf will be updated? (In reply to Arik from comment #15) > (In reply to Asaf Rachmani from comment #14) > > (In reply to Arik from comment #13) > > > (In reply to Asaf Rachmani from comment #12) > > > > Once the engine VM starts, ovirt-ha-agent uses > > > > `/var/run/ovirt-hosted-engine-ha/vm.conf` to get all the information it > > > > needs. > > > > This file is regenerated by parsing the OVF_STORE disks, there is some > > > > latency in reading/writing OVF_STORE. > > > > > > Yeah, that makes sense and this latency is what we're interested in - can we > > > estimate the maximal time it may take for a change to the OVF_STORE to > > > propagate to vm.conf? > > > > I think it depends on the environment (networking/storage/load etc.). > > yes, but for the automation tests to be deterministic we don't have to be > super accurate here - > let's say that the ha-broker samples the OVF_STORE every minute > so we can tell QE, listen - you need to make sure the engine says it managed > to update the OVF-STORE, then wait for 90 seconds before proceeding with > restarting the hosted-engine - as in the worst case the vm.conf will be > updated after a minute and we take another 30 sec as a buffer for whatever > storage/load issue that may appear > > we don't want to tell them to wait too much, right? but that's ok to give a > rough estimation for this > is it realistic that after a minute that vm.conf will be updated? Sounds good (In reply to Arik from comment #15) > yes, but for the automation tests to be deterministic we don't have to be > super accurate here - > let's say that the ha-broker samples the OVF_STORE every minute > so we can tell QE, listen - you need to make sure the engine says it managed > to update the OVF-STORE, then wait for 90 seconds before proceeding with > restarting the hosted-engine - as in the worst case the vm.conf will be > updated after a minute and we take another 30 sec as a buffer for whatever > storage/load issue that may appear Polina, would that work as a workaround in the automation tests for this? To reflect on why this bug was closed - we cannot avoid the race between updating the configuration of the hosted engine VM and restarting it as it takes time to generate the configuration that the hosted engine VM will restart with. The probability of this to happen in real environment is rather low and by trying to reduce the probability of this to happen we may introduce undesired side effects so it's better to go with the workaround for the automation tests that was proposed in comment 17 and seems to work. |