Bug 1617745
| Summary: | startUnderlyingVm fails with exception resulting in split-brain | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> | |
| Component: | vdsm | Assignee: | Milan Zamazal <mzamazal> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Liran Rotenberg <lrotenbe> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | high | |||
| Version: | 4.2.5 | CC: | danken, gveitmic, lsurette, mavital, michal.skrivanek, mtessun, mzamazal, rbarry, sgoodman, srevivo, ycui | |
| Target Milestone: | ovirt-4.3.0 | Keywords: | Reopened, ZStream | |
| Target Release: | 4.3.0 | Flags: | lsvaty:
testing_plan_complete-
|
|
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | QE Sanity Only | |||
| Fixed In Version: | v4.30.3 | Doc Type: | Bug Fix | |
| Doc Text: |
Previously, when a migrating virtual machine was not properly set up on the destination host, it could still start there under certain circumstances, then run unnoticed and without VDSM supervision. This situation sometimes resulted in split-brain. Now migration is always prevented from starting if the virtual machine set up fails on the destination host.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 1625646 1627289 (view as bug list) | Environment: | ||
| Last Closed: | 2018-11-20 13:02:03 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1625646, 1627289 | |||
|
Description
Germano Veit Michel
2018-08-16 04:34:18 UTC
do you have logs from teh source host? The source host should have cancelled the migration on seeing this error on destination. There's 9 seconds between teh error and successful migration so it shouldn't be a race Also, can anyone take a look at the underlying networking error? Dan? (In reply to Michal Skrivanek from comment #3) > Also, can anyone take a look at the underlying networking error? Dan? Michal, that is this: BZ1598781. Btw, I just requested the sosreport for the source host. Will upload the logs once I get them. Looks like the source did not notice the error. 2018-08-16 10:15:48,430+1000 INFO (libvirt/events) [virt.vm] (vmId='2e6bb483-ff14-4743-8fc1-dcd41d644f15') CPU stopped: onSuspend (vm:6157) 2018-08-16 10:15:49,169+1000 INFO (migsrc/2e6bb483) [virt.vm] (vmId='2e6bb483-ff14-4743-8fc1-dcd41d644f15') migration took 12 seconds to complete (migration:514) 2018-08-16 10:15:49,170+1000 INFO (migsrc/2e6bb483) [virt.vm] (vmId='2e6bb483-ff14-4743-8fc1-dcd41d644f15') Changed state to Down: Migration succeeded (code=4) (vm:1682) Attaching the logs... ok, no need to bother Dan then. Thanks yeah, it seems we do not wait for initialization failures on destination vm creation and return success to the source. When this happens fast before we start the actual migration from source there's no domain to destroy and then the libvirt migration creates one we do not track and completes migration We should probably wait a bit longer in case of migrationCreate to return success only once we start waiting for the incoming migration, in waitForMigrationDestinationPrepare. I would move self._pathsPreparedEvent.set() a bit later, after devices initialization (and rename it, it's not used for anything else than migration synchronization) I couldn't reproduce this bug. I tried to get the same error and succeed using hwrng device and setting on the destination host directory: /etc/udev/rules.d immutable (chattr +i). Although I do get the same errors, the split-brain didn't occur to me (with and without Milan's patch). No regression about migration VMs is noticeable regarding this patch. As we cannot reproduce the issue, the fix is QAed for Sanity only. In case the issue happens again, please reopen this BZ or open a new one. sync2jira sync2jira |