Bug 1598781 - Upgrading RHV-H is bringing back libvirt network file which causes issues in starting of VM
Summary: Upgrading RHV-H is bringing back libvirt network file which causes issues in ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: imgbased
Version: 4.2.4
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: ovirt-4.2.6
: ---
Assignee: Ryan Barry
QA Contact: Yaning Wang
URL:
Whiteboard:
Depends On:
Blocks: 1622025
TreeView+ depends on / blocked
 
Reported: 2018-07-06 12:45 UTC by nijin ashok
Modified: 2022-07-09 10:05 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-04 13:43:29 UTC
oVirt Team: Node
Target Upstream Version:
Embargoed:
peyu: testing_plan_complete+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-38478 0 None None None 2021-12-10 16:47:35 UTC
Red Hat Product Errata RHSA-2018:2626 0 None None None 2018-09-04 13:43:54 UTC
oVirt gerrit 92892 0 master MERGED osupdater: remove broken symlinks when syncing layers 2020-12-06 17:25:01 UTC
oVirt gerrit 93834 0 ovirt-4.2 MERGED osupdater: remove broken symlinks when syncing layers 2020-12-06 17:25:27 UTC

Description nijin ashok 2018-07-06 12:45:49 UTC
Description of problem:

From 4.2, the libvirt display network is created and deleted on demand as per https://gerrit.ovirt.org/#/c/74390/. So in a host which is not having any VMs, the libvirt display network will be absent. So before the upgrade of the host, the files vdsm-rhevm.xml will be absent from both "/etc/libvirt/qemu/networks/" and "/etc/libvirt/qemu/networks/autostart". The /etc/libvirt/qemu/networks/autostart/vdsm-rhevm.xml is a symlink to /etc/libvirt/qemu/networks/vdsm-rhevm.xml by default.

However, it will be available in the older 4.1 layer. As per the current logic, we are rsyncing the changed files from the older layer to the next newer layer and then will remove the files which only exist in the older layer.

So here the files are copied from older to newer layer.

===
2018-06-12 20:22:26,340 [DEBUG] (remediate_etc) Calling: (['mount', u'/dev/rhvh/rhvh-4.1-0.20180425.0+1', u'/tmp/mnt.58SJU'],) {'close_fds': True, 'stderr': -2}
2018-06-12 20:22:26,424 [DEBUG] (remediate_etc) Calling: (['mount', u'/dev/rhvh/rhvh-4.2.3.0-0.20180518.0+1', u'/tmp/mnt.OBuuY'],) {'close_fds': True, 'stderr': -2}

2018-06-12 20:22:26,733 [DEBUG] (remediate_etc) Calling: (['rsync', '-pogAXlHrx', '-SWc', '--no-i-r', '--checksum', '--update', '--exclude', 'mnt.*/*', '--exclude', '*targeted/active/modules*', '--exclude', '*network-scripts/ifcfg-*', u'/tmp/mnt.58SJU//etc/', u'/tmp/mnt.OBuuY//etc'],) {'close_fds': True, 'stderr': -2}
===

And the libvirt network files are getting deleted...

===
2018-06-12 20:22:26,520 [DEBUG] (remediate_etc) Planning to remove /etc/libvirt/qemu/networks/vdsm-rhevm.xml
2018-06-12 20:22:26,521 [DEBUG] (remediate_etc) Planning to remove /etc/libvirt/qemu/networks/autostart/vdsm-rhevm.xml

2018-06-12 20:22:28,866 [DEBUG] (remediate_etc) os.unlink(/tmp/mnt.OBuuY////etc/libvirt/qemu/networks/vdsm-rhevm.xml)
2018-06-12 20:22:28,868 [DEBUG] (remediate_etc) os.unlink(/tmp/mnt.OBuuY////etc/libvirt/qemu/networks/autostart/vdsm-rhevm.xml)
===

However, file under "networks" is getting deleted first before "autostart". So the source file is deleted first before the link. So the file under autostart will become often and will fail the check "os.path.isfile".

===
                log.debug("os.unlink({})".format(filename))
                if os.path.isfile(filename):
                    os.unlink(filename)
===

So actually the autostart file is not deleted although logs say that because the log line is before the "os.path.isfile" check.

Now, this "often" link will be copied to all newer layer.

===
It's still there in rhvh-4.2.3.0-0.20180518.0+1

2018-06-12 20:22:29,007 [DEBUG] (remediate_etc) Calling: (['mount', u'/dev/rhvh/rhvh-4.2.3.0-0.20180518.0+1', u'/tmp/mnt.jFRxF'],) {'close_fds': True, 'stderr': -2}
2018-06-12 20:22:29,088 [DEBUG] (remediate_etc) Calling: (['mount', u'/dev/rhvh/rhvh-4.2.3.1-0.20180531.0+1', u'/tmp/mnt.8cBVC'],) {'close_fds': True, 'stderr': -2}


2018-06-12 20:22:29,476 [DEBUG] (remediate_etc) Calling: (['rsync', '-pogAXlHrx', '-SWc', '--no-i-r', '--checksum', '--update', '--exclude', 'mnt.*/*', '--exclude', '*targeted/active/modules*', '--exclude', '*network-scripts/ifcfg-*', u'/tmp/mnt.jFRxF//etc/', u'/tmp/mnt.8cBVC//etc'],) {'close_fds': True, 'stderr': -2}

2018-06-12 20:22:29,187 [DEBUG] (remediate_etc) Planning to remove /etc/libvirt/qemu/networks/autostart/vdsm-rhevm.xml
2018-06-12 20:22:30,779 [DEBUG] (remediate_etc) os.unlink(/tmp/mnt.8cBVC////etc/libvirt/qemu/networks/autostart/vdsm-rhevm.xml)

2018-06-12 20:22:41,180 [DEBUG] (migrate_etc) Calling: (['cp', '-a', '-r', u'/tmp/mnt.6gtnb///etc/libvirt/qemu/networks/autostart/vdsm-rhevm.xml', u'/tmp/mnt.S0oRd///etc/libvirt/qemu/networks/autostart/vdsm-rhevm.xml'],) {'close_fds': True, 'stderr': -2}

===

Because of this, these files will be copied to the latest layer in every upgrade.

===
2018-07-05 01:46:22,249 [DEBUG] (migrate_etc) Calling: (['mount', u'/dev/rhvh/rhvh-4.2.3.1-0.20180531.0+1', u'/tmp/mnt.jgYj1'],) {'close_fds': True, 'stderr': -2}
2018-07-05 01:46:22,189 [DEBUG] (migrate_etc) Calling: (['mount', u'/dev/rhvh/rhvh-4.2.4.3-0.20180627.0+1', u'/tmp/mnt.iRhHh'],) {'close_fds': True, 'stderr': -2}

2018-07-05 01:46:26,093 [DEBUG] (migrate_etc) Calling: (['cp', '-a', '-r', u'/tmp/mnt.jgYj1///etc/libvirt/qemu/networks/autostart/vdsm-rhevm.xml', u'/tmp/mnt.iRhHh///etc/libvirt/qemu/networks/autostart/vdsm-rhevm.xml'],) {'close_fds': True, 'stderr': -2}
===

Because of the presence of these files, starting of VM will fail with the error below while creating the libvirt network again.

===
2018-06-20 18:37:18,988+0000 ERROR (vm/55c24427) [virt.vm] (vmId='uuid') Failed to setup device vnc (vm:2616)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2613, in _setup_devices
    dev_object.setup()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/graphics.py", line 91, in setup
    displaynetwork.create_network(display_network, self.vmid)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/displaynetwork.py", line 27, in create_network
    libvirtnetwork.create_network(netname, display_device, user_reference)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/libvirtnetwork.py", line 89, in create_network
    _createNetwork(createNetworkDef(netname, bridged, iface))
  File "/usr/lib/python2.7/site-packages/vdsm/virt/libvirtnetwork.py", line 108, in _createNetwork
    net.setAutostart(1)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2981, in setAutostart
    if ret == -1: raise libvirtError ('virNetworkSetAutostart() failed', net=self)
libvirtError: Failed to create symlink '/etc/libvirt/qemu/networks/autostart/vdsm-rhevm.xml' to '/etc/libvirt/qemu/networks/vdsm-rhevm.xml': File exists

2018-06-20 18:37:18,988+0000 INFO  (vm/55c24427) [virt.vm] (vmId='UUID') Changed state to Down: Failed to create symlink '/etc/libvirt/qemu/networks/autostart/vdsm-rhevm.xml' to '/etc/libvirt/qemu/networks/vdsm-rhevm.xml': File exists (code=1) (vm:1683)
====


Version-Release number of selected component (if applicable):

imgbased-1.0.20
rhvh-4.2.4.3-0.20180627.0+1

How reproducible:

100%

Steps to Reproduce:

Please see above.

Actual results:

Often libvirt network symlinks are moved incorrectly from older layer to newer layer which is causing the issue in starting VM after the upgrade.

Expected results:

VM should start fine after the upgrade.

Additional info:

Comment 2 Ryan Barry 2018-07-06 13:29:41 UTC
Thanks, Nijin!

Not sure how this wasn't caught in testing, but this may be the best but report I've ever seen

Comment 3 Ryan Barry 2018-07-08 14:16:17 UTC
I can't actually reproduce this.

In both cases (updating from 4.1->4.2 prior to this patch, updating from 4.1->4.2 after this patch), both symlinks survive. 

The logging is corrected with this patch, and behavior should be correct based on the report in comment#1, but since I can't reproduce the issue, I can't be sure.

Can you please provide exact steps to reproduce?

Comment 5 nijin ashok 2018-07-09 13:30:41 UTC
(In reply to Ryan Barry from comment #3)
> I can't actually reproduce this.
> 
> In both cases (updating from 4.1->4.2 prior to this patch, updating from
> 4.1->4.2 after this patch), both symlinks survive. 
> 
> The logging is corrected with this patch, and behavior should be correct
> based on the report in comment#1, but since I can't reproduce the issue, I
> can't be sure.
> 
> Can you please provide exact steps to reproduce?

Thank you Ryan for checking this.

I am able to reproduce the issue. It indeed needs 3 layer copy.

[1] Installed rhvh-4.1-0.20180425.0 and added it in 4.1 manager 

imgbase w
[INFO] You are on rhvh-4.1-0.20170417.0+1

Files are present

ls /etc/libvirt/qemu/networks/
autostart  vdsm-ovirtmgmt.xml

ls /etc/libvirt/qemu/networks/autostart/
vdsm-ovirtmgmt.xml

[2] Upgraded manager to 4.2.

[3] Upgraded hypervisor to rhvh-4.2.3.0-0.20180518.0+1. Here the issue was not visible because, in the end, both the files will be copied from the old layer. Even though an "often" link was created in between, it will automatically get fixed in the end. 

===
2018-07-09 11:37:18,457 [DEBUG] (remediate_etc) Calling: (['mount', u'/dev/rhvh/rhvh-4.2.3.0-0.20180518.0+1', u'/tmp/mnt.CJFuB'],) {'close_fds': True, 'stderr': -2}


2018-07-09 11:37:27,156 [DEBUG] (remediate_etc) os.unlink(/tmp/mnt.CJFuB////etc/libvirt/qemu/networks/vdsm-ovirtmgmt.xml)
2018-07-09 11:37:27,156 [DEBUG] (remediate_etc) os.unlink(/tmp/mnt.CJFuB////etc/libvirt/qemu/networks/autostart/vdsm-ovirtmgmt.xml)

2018-07-09 11:37:48,797 [DEBUG] (migrate_etc) Calling: (['cp', '-a', '-r', u'/tmp/mnt.rfCeG///etc/libvirt/qemu/networks/vdsm-ovirtmgmt.xml', u'/tmp/mnt.X5uYa///etc/libvirt/qemu/networks/vdsm-ovirtmgmt.xml'],) {'close_fds': True, 'stderr': -2}

2018-07-09 11:37:48,801 [DEBUG] (migrate_etc) Calling: (['cp', '-a', '-r', u'/tmp/mnt.rfCeG///etc/libvirt/qemu/networks/autostart/vdsm-ovirtmgmt.xml', u'/tmp/mnt.X5uYa///etc/libvirt/qemu/networks/autostart/vdsm-ovirtmgmt.xml'],) {'close_fds': True, 'stderr': -2}
===

[4] Updated cluster to 4.2 and start and stop the VM in this host. This will clear the vdsm-ovirtmgmt.xml from both the directories.

[5] Put the host into maintenance mode and upgraded the host to rhvh-4.2.4.3-0.20180627.0.

imgbase w
You are on rhvh-4.2.4.3-0.20180627.0+1

The vdsm-ovirtmgmt.xml in the autostart directory is back from dead.

ls /etc/libvirt/qemu/networks/autostart/
vdsm-ovirtmgmt.xml

ls /etc/libvirt/qemu/networks/
autostart  default.xml

[6] Started the VM on this host and failed with error below.

2018-07-09 12:25:53,067+0530 ERROR (vm/3c2a0c41) [virt.vm] (vmId='3c2a0c41-682c-413b-8c1c-75a8c726f4ff') The vm start process failed (vm:943)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2789, in _run
    self._setup_devices()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2613, in _setup_devices
    dev_object.setup()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/graphics.py", line 91, in setup
    displaynetwork.create_network(display_network, self.vmid)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/displaynetwork.py", line 27, in create_network
    libvirtnetwork.create_network(netname, display_device, user_reference)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/libvirtnetwork.py", line 89, in create_network
    _createNetwork(createNetworkDef(netname, bridged, iface))
  File "/usr/lib/python2.7/site-packages/vdsm/virt/libvirtnetwork.py", line 108, in _createNetwork
    net.setAutostart(1)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2981, in setAutostart
    if ret == -1: raise libvirtError ('virNetworkSetAutostart() failed', net=self)
libvirtError: Failed to create symlink '/etc/libvirt/qemu/networks/autostart/vdsm-ovirtmgmt.xml' to '/etc/libvirt/qemu/networks/vdsm-ovirtmgmt.xml': File exists

This autostart file get automatically cleared after this failure as a part of VM teardown process. So the next VM start will work.

However, for the customer, the impact was larger has he migrated the VM  and the VM status went into down status in the portal although qemu-kvm process exist in the host. I was not able to reproduce it though.

Comment 6 Ryan Barry 2018-07-09 15:41:33 UTC
Thanks - this gave me a reproducer, and I confirmed that the patch resolves

Comment 7 Germano Veit Michel 2018-08-16 04:16:45 UTC
Another hit with 2 VMs.

Raising this to urgent because on migration VDSM stops reporting the VM to RHV-M even though the qemu-kvm process exists and is running (the VM is fine). But RHV-M sees it as down. This open space for ugly split brains and corruptions.

Comment 8 Germano Veit Michel 2018-08-16 04:21:36 UTC
Just to show what I meant above:

On incoming migration:

2018-08-16 10:15:39,271+1000 ERROR (vm/2e6bb483) [virt.vm] (vmId='2e6bb483-ff14-4743-8fc1-dcd41d644f15') The vm start process failed (vm:942)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 871, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2791, in _run
    self._setup_devices()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2615, in _setup_devices
    dev_object.setup()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/graphics.py", line 91, in setup
    displaynetwork.create_network(display_network, self.vmid)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/displaynetwork.py", line 27, in create_network
    libvirtnetwork.create_network(netname, display_device, user_reference)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/libvirtnetwork.py", line 89, in create_network
    _createNetwork(createNetworkDef(netname, bridged, iface))
  File "/usr/lib/python2.7/site-packages/vdsm/virt/libvirtnetwork.py", line 108, in _createNetwork
    net.setAutostart(1)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2981, in setAutostart
    if ret == -1: raise libvirtError ('virNetworkSetAutostart() failed', net=self)
libvirtError: Failed to create symlink '/etc/libvirt/qemu/networks/autostart/vdsm-production.xml' to '/etc/libvirt/qemu/networks/vdsm-production.xml': File exists
2018-08-16 10:15:39,271+1000 INFO  (vm/2e6bb483) [virt.vm] (vmId='2e6bb483-ff14-4743-8fc1-dcd41d644f15') Changed state to Down: Failed to create symlink '/etc/libvirt/qemu/networks/autostart/vdsm-production.xml' to '/etc/libvirt/qemu/networks/vdsm-production.xml': File exists (code=1) (vm:1682)
2018-08-16 10:15:39,277+1000 INFO  (vm/2e6bb483) [virt.vm] (vmId='2e6bb483-ff14-4743-8fc1-dcd41d644f15') Stopping connection (guestagent:438)

Then VDSM does not report the VM, but the VM is running happily on the host.

Comment 10 Yaning Wang 2018-08-28 03:09:55 UTC
Verified on build 4.2-20180827.0.el7_5

steps:

1. from 4.1-20180426.0 upgrade 4.2-20180531.0
2. upgrade to 4.2-20180827.0.el7_5

Actual results:

libvirt network symlinks are removed correctly after multi-upgrade.

Comment 12 errata-xmlrpc 2018-09-04 13:43:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2626

Comment 13 Daniel Gur 2019-08-28 13:13:50 UTC
sync2jira

Comment 14 Daniel Gur 2019-08-28 13:18:04 UTC
sync2jira


Note You need to log in before you can comment on or make changes to this bug.