Description of problem: From 4.2, the libvirt display network is created and deleted on demand as per https://gerrit.ovirt.org/#/c/74390/. So in a host which is not having any VMs, the libvirt display network will be absent. So before the upgrade of the host, the files vdsm-rhevm.xml will be absent from both "/etc/libvirt/qemu/networks/" and "/etc/libvirt/qemu/networks/autostart". The /etc/libvirt/qemu/networks/autostart/vdsm-rhevm.xml is a symlink to /etc/libvirt/qemu/networks/vdsm-rhevm.xml by default. However, it will be available in the older 4.1 layer. As per the current logic, we are rsyncing the changed files from the older layer to the next newer layer and then will remove the files which only exist in the older layer. So here the files are copied from older to newer layer. === 2018-06-12 20:22:26,340 [DEBUG] (remediate_etc) Calling: (['mount', u'/dev/rhvh/rhvh-4.1-0.20180425.0+1', u'/tmp/mnt.58SJU'],) {'close_fds': True, 'stderr': -2} 2018-06-12 20:22:26,424 [DEBUG] (remediate_etc) Calling: (['mount', u'/dev/rhvh/rhvh-4.2.3.0-0.20180518.0+1', u'/tmp/mnt.OBuuY'],) {'close_fds': True, 'stderr': -2} 2018-06-12 20:22:26,733 [DEBUG] (remediate_etc) Calling: (['rsync', '-pogAXlHrx', '-SWc', '--no-i-r', '--checksum', '--update', '--exclude', 'mnt.*/*', '--exclude', '*targeted/active/modules*', '--exclude', '*network-scripts/ifcfg-*', u'/tmp/mnt.58SJU//etc/', u'/tmp/mnt.OBuuY//etc'],) {'close_fds': True, 'stderr': -2} === And the libvirt network files are getting deleted... === 2018-06-12 20:22:26,520 [DEBUG] (remediate_etc) Planning to remove /etc/libvirt/qemu/networks/vdsm-rhevm.xml 2018-06-12 20:22:26,521 [DEBUG] (remediate_etc) Planning to remove /etc/libvirt/qemu/networks/autostart/vdsm-rhevm.xml 2018-06-12 20:22:28,866 [DEBUG] (remediate_etc) os.unlink(/tmp/mnt.OBuuY////etc/libvirt/qemu/networks/vdsm-rhevm.xml) 2018-06-12 20:22:28,868 [DEBUG] (remediate_etc) os.unlink(/tmp/mnt.OBuuY////etc/libvirt/qemu/networks/autostart/vdsm-rhevm.xml) === However, file under "networks" is getting deleted first before "autostart". So the source file is deleted first before the link. So the file under autostart will become often and will fail the check "os.path.isfile". === log.debug("os.unlink({})".format(filename)) if os.path.isfile(filename): os.unlink(filename) === So actually the autostart file is not deleted although logs say that because the log line is before the "os.path.isfile" check. Now, this "often" link will be copied to all newer layer. === It's still there in rhvh-4.2.3.0-0.20180518.0+1 2018-06-12 20:22:29,007 [DEBUG] (remediate_etc) Calling: (['mount', u'/dev/rhvh/rhvh-4.2.3.0-0.20180518.0+1', u'/tmp/mnt.jFRxF'],) {'close_fds': True, 'stderr': -2} 2018-06-12 20:22:29,088 [DEBUG] (remediate_etc) Calling: (['mount', u'/dev/rhvh/rhvh-4.2.3.1-0.20180531.0+1', u'/tmp/mnt.8cBVC'],) {'close_fds': True, 'stderr': -2} 2018-06-12 20:22:29,476 [DEBUG] (remediate_etc) Calling: (['rsync', '-pogAXlHrx', '-SWc', '--no-i-r', '--checksum', '--update', '--exclude', 'mnt.*/*', '--exclude', '*targeted/active/modules*', '--exclude', '*network-scripts/ifcfg-*', u'/tmp/mnt.jFRxF//etc/', u'/tmp/mnt.8cBVC//etc'],) {'close_fds': True, 'stderr': -2} 2018-06-12 20:22:29,187 [DEBUG] (remediate_etc) Planning to remove /etc/libvirt/qemu/networks/autostart/vdsm-rhevm.xml 2018-06-12 20:22:30,779 [DEBUG] (remediate_etc) os.unlink(/tmp/mnt.8cBVC////etc/libvirt/qemu/networks/autostart/vdsm-rhevm.xml) 2018-06-12 20:22:41,180 [DEBUG] (migrate_etc) Calling: (['cp', '-a', '-r', u'/tmp/mnt.6gtnb///etc/libvirt/qemu/networks/autostart/vdsm-rhevm.xml', u'/tmp/mnt.S0oRd///etc/libvirt/qemu/networks/autostart/vdsm-rhevm.xml'],) {'close_fds': True, 'stderr': -2} === Because of this, these files will be copied to the latest layer in every upgrade. === 2018-07-05 01:46:22,249 [DEBUG] (migrate_etc) Calling: (['mount', u'/dev/rhvh/rhvh-4.2.3.1-0.20180531.0+1', u'/tmp/mnt.jgYj1'],) {'close_fds': True, 'stderr': -2} 2018-07-05 01:46:22,189 [DEBUG] (migrate_etc) Calling: (['mount', u'/dev/rhvh/rhvh-4.2.4.3-0.20180627.0+1', u'/tmp/mnt.iRhHh'],) {'close_fds': True, 'stderr': -2} 2018-07-05 01:46:26,093 [DEBUG] (migrate_etc) Calling: (['cp', '-a', '-r', u'/tmp/mnt.jgYj1///etc/libvirt/qemu/networks/autostart/vdsm-rhevm.xml', u'/tmp/mnt.iRhHh///etc/libvirt/qemu/networks/autostart/vdsm-rhevm.xml'],) {'close_fds': True, 'stderr': -2} === Because of the presence of these files, starting of VM will fail with the error below while creating the libvirt network again. === 2018-06-20 18:37:18,988+0000 ERROR (vm/55c24427) [virt.vm] (vmId='uuid') Failed to setup device vnc (vm:2616) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2613, in _setup_devices dev_object.setup() File "/usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/graphics.py", line 91, in setup displaynetwork.create_network(display_network, self.vmid) File "/usr/lib/python2.7/site-packages/vdsm/virt/displaynetwork.py", line 27, in create_network libvirtnetwork.create_network(netname, display_device, user_reference) File "/usr/lib/python2.7/site-packages/vdsm/virt/libvirtnetwork.py", line 89, in create_network _createNetwork(createNetworkDef(netname, bridged, iface)) File "/usr/lib/python2.7/site-packages/vdsm/virt/libvirtnetwork.py", line 108, in _createNetwork net.setAutostart(1) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2981, in setAutostart if ret == -1: raise libvirtError ('virNetworkSetAutostart() failed', net=self) libvirtError: Failed to create symlink '/etc/libvirt/qemu/networks/autostart/vdsm-rhevm.xml' to '/etc/libvirt/qemu/networks/vdsm-rhevm.xml': File exists 2018-06-20 18:37:18,988+0000 INFO (vm/55c24427) [virt.vm] (vmId='UUID') Changed state to Down: Failed to create symlink '/etc/libvirt/qemu/networks/autostart/vdsm-rhevm.xml' to '/etc/libvirt/qemu/networks/vdsm-rhevm.xml': File exists (code=1) (vm:1683) ==== Version-Release number of selected component (if applicable): imgbased-1.0.20 rhvh-4.2.4.3-0.20180627.0+1 How reproducible: 100% Steps to Reproduce: Please see above. Actual results: Often libvirt network symlinks are moved incorrectly from older layer to newer layer which is causing the issue in starting VM after the upgrade. Expected results: VM should start fine after the upgrade. Additional info:
Thanks, Nijin! Not sure how this wasn't caught in testing, but this may be the best but report I've ever seen
I can't actually reproduce this. In both cases (updating from 4.1->4.2 prior to this patch, updating from 4.1->4.2 after this patch), both symlinks survive. The logging is corrected with this patch, and behavior should be correct based on the report in comment#1, but since I can't reproduce the issue, I can't be sure. Can you please provide exact steps to reproduce?
(In reply to Ryan Barry from comment #3) > I can't actually reproduce this. > > In both cases (updating from 4.1->4.2 prior to this patch, updating from > 4.1->4.2 after this patch), both symlinks survive. > > The logging is corrected with this patch, and behavior should be correct > based on the report in comment#1, but since I can't reproduce the issue, I > can't be sure. > > Can you please provide exact steps to reproduce? Thank you Ryan for checking this. I am able to reproduce the issue. It indeed needs 3 layer copy. [1] Installed rhvh-4.1-0.20180425.0 and added it in 4.1 manager imgbase w [INFO] You are on rhvh-4.1-0.20170417.0+1 Files are present ls /etc/libvirt/qemu/networks/ autostart vdsm-ovirtmgmt.xml ls /etc/libvirt/qemu/networks/autostart/ vdsm-ovirtmgmt.xml [2] Upgraded manager to 4.2. [3] Upgraded hypervisor to rhvh-4.2.3.0-0.20180518.0+1. Here the issue was not visible because, in the end, both the files will be copied from the old layer. Even though an "often" link was created in between, it will automatically get fixed in the end. === 2018-07-09 11:37:18,457 [DEBUG] (remediate_etc) Calling: (['mount', u'/dev/rhvh/rhvh-4.2.3.0-0.20180518.0+1', u'/tmp/mnt.CJFuB'],) {'close_fds': True, 'stderr': -2} 2018-07-09 11:37:27,156 [DEBUG] (remediate_etc) os.unlink(/tmp/mnt.CJFuB////etc/libvirt/qemu/networks/vdsm-ovirtmgmt.xml) 2018-07-09 11:37:27,156 [DEBUG] (remediate_etc) os.unlink(/tmp/mnt.CJFuB////etc/libvirt/qemu/networks/autostart/vdsm-ovirtmgmt.xml) 2018-07-09 11:37:48,797 [DEBUG] (migrate_etc) Calling: (['cp', '-a', '-r', u'/tmp/mnt.rfCeG///etc/libvirt/qemu/networks/vdsm-ovirtmgmt.xml', u'/tmp/mnt.X5uYa///etc/libvirt/qemu/networks/vdsm-ovirtmgmt.xml'],) {'close_fds': True, 'stderr': -2} 2018-07-09 11:37:48,801 [DEBUG] (migrate_etc) Calling: (['cp', '-a', '-r', u'/tmp/mnt.rfCeG///etc/libvirt/qemu/networks/autostart/vdsm-ovirtmgmt.xml', u'/tmp/mnt.X5uYa///etc/libvirt/qemu/networks/autostart/vdsm-ovirtmgmt.xml'],) {'close_fds': True, 'stderr': -2} === [4] Updated cluster to 4.2 and start and stop the VM in this host. This will clear the vdsm-ovirtmgmt.xml from both the directories. [5] Put the host into maintenance mode and upgraded the host to rhvh-4.2.4.3-0.20180627.0. imgbase w You are on rhvh-4.2.4.3-0.20180627.0+1 The vdsm-ovirtmgmt.xml in the autostart directory is back from dead. ls /etc/libvirt/qemu/networks/autostart/ vdsm-ovirtmgmt.xml ls /etc/libvirt/qemu/networks/ autostart default.xml [6] Started the VM on this host and failed with error below. 2018-07-09 12:25:53,067+0530 ERROR (vm/3c2a0c41) [virt.vm] (vmId='3c2a0c41-682c-413b-8c1c-75a8c726f4ff') The vm start process failed (vm:943) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 872, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2789, in _run self._setup_devices() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2613, in _setup_devices dev_object.setup() File "/usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/graphics.py", line 91, in setup displaynetwork.create_network(display_network, self.vmid) File "/usr/lib/python2.7/site-packages/vdsm/virt/displaynetwork.py", line 27, in create_network libvirtnetwork.create_network(netname, display_device, user_reference) File "/usr/lib/python2.7/site-packages/vdsm/virt/libvirtnetwork.py", line 89, in create_network _createNetwork(createNetworkDef(netname, bridged, iface)) File "/usr/lib/python2.7/site-packages/vdsm/virt/libvirtnetwork.py", line 108, in _createNetwork net.setAutostart(1) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2981, in setAutostart if ret == -1: raise libvirtError ('virNetworkSetAutostart() failed', net=self) libvirtError: Failed to create symlink '/etc/libvirt/qemu/networks/autostart/vdsm-ovirtmgmt.xml' to '/etc/libvirt/qemu/networks/vdsm-ovirtmgmt.xml': File exists This autostart file get automatically cleared after this failure as a part of VM teardown process. So the next VM start will work. However, for the customer, the impact was larger has he migrated the VM and the VM status went into down status in the portal although qemu-kvm process exist in the host. I was not able to reproduce it though.
Thanks - this gave me a reproducer, and I confirmed that the patch resolves
Another hit with 2 VMs. Raising this to urgent because on migration VDSM stops reporting the VM to RHV-M even though the qemu-kvm process exists and is running (the VM is fine). But RHV-M sees it as down. This open space for ugly split brains and corruptions.
Just to show what I meant above: On incoming migration: 2018-08-16 10:15:39,271+1000 ERROR (vm/2e6bb483) [virt.vm] (vmId='2e6bb483-ff14-4743-8fc1-dcd41d644f15') The vm start process failed (vm:942) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 871, in _startUnderlyingVm self._run() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2791, in _run self._setup_devices() File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2615, in _setup_devices dev_object.setup() File "/usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/graphics.py", line 91, in setup displaynetwork.create_network(display_network, self.vmid) File "/usr/lib/python2.7/site-packages/vdsm/virt/displaynetwork.py", line 27, in create_network libvirtnetwork.create_network(netname, display_device, user_reference) File "/usr/lib/python2.7/site-packages/vdsm/virt/libvirtnetwork.py", line 89, in create_network _createNetwork(createNetworkDef(netname, bridged, iface)) File "/usr/lib/python2.7/site-packages/vdsm/virt/libvirtnetwork.py", line 108, in _createNetwork net.setAutostart(1) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2981, in setAutostart if ret == -1: raise libvirtError ('virNetworkSetAutostart() failed', net=self) libvirtError: Failed to create symlink '/etc/libvirt/qemu/networks/autostart/vdsm-production.xml' to '/etc/libvirt/qemu/networks/vdsm-production.xml': File exists 2018-08-16 10:15:39,271+1000 INFO (vm/2e6bb483) [virt.vm] (vmId='2e6bb483-ff14-4743-8fc1-dcd41d644f15') Changed state to Down: Failed to create symlink '/etc/libvirt/qemu/networks/autostart/vdsm-production.xml' to '/etc/libvirt/qemu/networks/vdsm-production.xml': File exists (code=1) (vm:1682) 2018-08-16 10:15:39,277+1000 INFO (vm/2e6bb483) [virt.vm] (vmId='2e6bb483-ff14-4743-8fc1-dcd41d644f15') Stopping connection (guestagent:438) Then VDSM does not report the VM, but the VM is running happily on the host.
Verified on build 4.2-20180827.0.el7_5 steps: 1. from 4.1-20180426.0 upgrade 4.2-20180531.0 2. upgrade to 4.2-20180827.0.el7_5 Actual results: libvirt network symlinks are removed correctly after multi-upgrade.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2626
sync2jira