1627289 – [downstream clone - 4.2.7] startUnderlyingVm fails with exception resulting in split-brain

Bug 1627289 - [downstream clone - 4.2.7] startUnderlyingVm fails with exception resulting in split-brain

Summary: [downstream clone - 4.2.7] startUnderlyingVm fails with exception resulting i...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.2.5
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	ovirt-4.2.7
Target Release:	---
Assignee:	Milan Zamazal
QA Contact:	Liran Rotenberg
Docs Contact:
URL:
Whiteboard:	QE Sanity Only
Depends On:	1617745
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-10 15:08 UTC by RHV bug bot
Modified:	2021-12-10 17:31 UTC (History)
CC List:	10 users (show)
Fixed In Version:	v4.20.40
Doc Type:	Bug Fix
Doc Text:	In certain cases, when a VM does not successfully complete migration to its destination host, it still boots and runs on the destination host, without VDSM managing it properly. This unmanaged VM can cause split-brain. With this release, a migrating VM is always prevented from starting if the VM setup fails on the destination host.
Clone Of:	1617745
Environment:
Last Closed:	2018-10-03 12:05:15 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-44275	None	None	None	2021-12-10 17:31:28 UTC
oVirt gerrit	94055	master	MERGED	virt: Wait until migration is fully prepared before reporting success	2020-07-06 19:02:31 UTC
oVirt gerrit	94161	ovirt-4.2	MERGED	virt: Wait until migration is fully prepared before reporting success	2020-07-06 19:02:31 UTC

Description RHV bug bot 2018-09-10 15:08:21 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1617745 +++
======================================================================

Description of problem:

An exception during _startUnderlyingVm on an incoming migration leaves a VM running happily on the host and vdsm is NOT reporting it. So Down on RHV-M and Up on the host. This leaves space for data corruption if the VM has no lease.

The exception/error is being handled in BZ1598781. What is requested here is to better handle the exceptions so that if some other error occurs, at least the VM is killed to prevent a split brain.

1. During an incoming migration, VDSM failed to "start" the VM:

2018-08-16 10:15:39,271+1000 ERROR (vm/2e6bb483) [virt.vm] (vmId='2e6bb483-ff14-4743-8fc1-dcd41d644f15') The vm start process failed (vm:942)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 871, in _startUnderlyingVm
    self._run()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2791, in _run
    self._setup_devices()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 2615, in _setup_devices
    dev_object.setup()
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vmdevices/graphics.py", line 91, in setup
    displaynetwork.create_network(display_network, self.vmid)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/displaynetwork.py", line 27, in create_network
    libvirtnetwork.create_network(netname, display_device, user_reference)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/libvirtnetwork.py", line 89, in create_network
    _createNetwork(createNetworkDef(netname, bridged, iface))
  File "/usr/lib/python2.7/site-packages/vdsm/virt/libvirtnetwork.py", line 108, in _createNetwork
    net.setAutostart(1)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 2981, in setAutostart
    if ret == -1: raise libvirtError ('virNetworkSetAutostart() failed', net=self)
libvirtError: Failed to create symlink '/etc/libvirt/qemu/networks/autostart/vdsm-production.xml' to '/etc/libvirt/qemu/networks/vdsm-production.xml': File exists

2. VM is "down"

2018-08-16 10:15:39,271+1000 INFO  (vm/2e6bb483) [virt.vm] (vmId='2e6bb483-ff14-4743-8fc1-dcd41d644f15') Changed state to Down: Failed to create symlink '/etc/libvirt/qemu/networks/autostart/vdsm-production.xml' to '/etc/libvirt/qemu/networks/vdsm-production.xml': File exists (code=1) (vm:1682)

2018-08-16 10:15:39,277+1000 INFO  (vm/2e6bb483) [virt.vm] (vmId='2e6bb483-ff14-4743-8fc1-dcd41d644f15') Stopping connection (guestagent:438)

3. But migration finishes and VM runs:

Libvirt event saying the nigration finished and VM runs:

2018-08-16 10:15:48,621+1000 INFO  (libvirt/events) [virt.vm] (vmId='2e6bb483-ff14-4743-8fc1-dcd41d644f15') CPU running: onResume (vm:6157)
2018-08-16 10:15:48,635+1000 INFO  (libvirt/events) [virt.vm] (vmId='2e6bb483-ff14-4743-8fc1-dcd41d644f15') CPU running: onResume (vm:6157)

4. Destroy is called

2018-08-16 10:15:53,729+1000 INFO  (jsonrpc/0) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:192.168.18.179,57282, vmId=2e6bb483-ff14-4743-8fc1-dcd41d644f15 (api:46)

2018-08-16 10:15:53,730+1000 INFO  (jsonrpc/0) [virt.vm] (vmId='2e6bb483-ff14-4743-8fc1-dcd41d644f15') Release VM resources (vm:5237)

2018-08-16 10:15:53,730+1000 WARN  (jsonrpc/0) [virt.vm] (vmId='2e6bb483-ff14-4743-8fc1-dcd41d644f15') trying to set state to Powering down when already Down (vm:606)

2018-08-16 10:15:53,730+1000 INFO  (jsonrpc/0) [virt.vm] (vmId='2e6bb483-ff14-4743-8fc1-dcd41d644f15') Stopping connection (guestagent:438)

5. VM is still running hours later

$ cat date 
Thu Aug 16 11:48:38 AEST 2018

$ grep 2e6bb483-ff14-4743-8fc1-dcd41d644f15 ps | wc -l
1

$ cat sos_commands/virsh/virsh_-r_list_--all  | grep NAS01 | wc -l
1

Version-Release number of selected component (if applicable):
vdsm-4.20.35-1.el7ev.x86_64

Actual results:
VDSM does not report a VM that is Up

Expected results:
VDSM kills the VM or report it as Up. Avoid split brain.

(Originally by Germano Veit Michel)

Comment 2 RHV bug bot 2018-09-10 15:08:32 UTC

do you have logs from teh source host? The source host should have cancelled the migration on seeing this error on destination. There's 9 seconds between teh error and successful migration so it shouldn't be a race

(Originally by michal.skrivanek)

Comment 3 RHV bug bot 2018-09-10 15:08:37 UTC

Also, can anyone take a look at the underlying networking error? Dan?

(Originally by michal.skrivanek)

Comment 4 RHV bug bot 2018-09-10 15:08:42 UTC

(In reply to Michal Skrivanek from comment #3)
> Also, can anyone take a look at the underlying networking error? Dan?

Michal, that is this: BZ1598781.

Btw, I just requested the sosreport for the source host. Will upload the logs once I get them.

(Originally by Germano Veit Michel)

Comment 5 RHV bug bot 2018-09-10 15:08:47 UTC

Looks like the source did not notice the error.

2018-08-16 10:15:48,430+1000 INFO  (libvirt/events) [virt.vm] (vmId='2e6bb483-ff14-4743-8fc1-dcd41d644f15') CPU stopped: onSuspend (vm:6157)                                                   
2018-08-16 10:15:49,169+1000 INFO  (migsrc/2e6bb483) [virt.vm] (vmId='2e6bb483-ff14-4743-8fc1-dcd41d644f15') migration took 12 seconds to complete (migration:514)                             
2018-08-16 10:15:49,170+1000 INFO  (migsrc/2e6bb483) [virt.vm] (vmId='2e6bb483-ff14-4743-8fc1-dcd41d644f15') Changed state to Down: Migration succeeded (code=4) (vm:1682)    

Attaching the logs...

(Originally by Germano Veit Michel)

Comment 7 RHV bug bot 2018-09-10 15:08:58 UTC

ok, no need to bother Dan then. Thanks

(Originally by michal.skrivanek)

Comment 8 RHV bug bot 2018-09-10 15:09:02 UTC

yeah, it seems we do not wait for initialization failures on destination vm creation and return success to the source. When this happens fast before we start the actual migration from source there's no domain to destroy and then the libvirt migration creates one we do not track and completes migration

We should probably wait a bit longer in case of migrationCreate to return success only once we start waiting for the incoming migration, in waitForMigrationDestinationPrepare.
I would move self._pathsPreparedEvent.set() a bit later, after devices initialization (and rename it, it's not used for anything else than migration synchronization)

(Originally by michal.skrivanek)

Comment 9 Liran Rotenberg 2018-09-12 07:49:52 UTC

I couldn't reproduce this bug.
I tried to get the same error and succeed using hwrng device and setting on the destination host directory: /etc/udev/rules.d immutable (chattr +i).
Although I do get the same errors, the split-brain didn't occur to me (with and without Milan's patch).
No regression about migration VMs is noticeable regarding this patch.

Comment 10 Martin Tessun 2018-09-12 11:52:19 UTC

As we cannot reproduce the issue, the fix is QAed for Sanity only. In case the issue happens again, please reopen this BZ or open a new one.

Note You need to log in before you can comment on or make changes to this bug.