Bug 1029035
Summary: | [engine] vm appears in 2 different rhev-h hosts (split brain) | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | rhev-integ |
Component: | ovirt-engine | Assignee: | Roy Golan <rgolan> |
Status: | CLOSED ERRATA | QA Contact: | Pavel Novotny <pnovotny> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 3.2.0 | CC: | acathrow, cboyle, iheim, lpeer, lyarwood, mavital, michal.skrivanek, mkalinin, ofrenkel, rgolan, Rhev-m-bugs, yeylon |
Target Milestone: | --- | Keywords: | ZStream |
Target Release: | 3.2.5 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | virt | ||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
The problem was seen as a host htpervisor went into non-responding mode. A VM that was running on it stopped responding and spice didn't work to access that VM's console. While trying to stop/reboot the problematic host, the VM appeared on 2 different hosts (via UI and on virsh command line). Eventually this caused the FS on that VM to become corrupted, which resulted in having to reinstall that VM. This was caused by the fence command running at the same time. This fix runs fence command mutually exclusive.
* Cause: a race between 2 Fence operation, one triggered by UI and one automatically
* Consequence: the end of the fence command is to start VMs up, since the race exist the process starts VMs without knowing it started already
* Fix: make fence operations exclusive by taking an engine-lock
* Result: now the 2nd fence operation should fail with Can Do Action since one is already in progress
|
Story Points: | --- |
Clone Of: | 1026811 | Environment: | |
Last Closed: | 2013-12-18 14:10:37 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1026811 | ||
Bug Blocks: |
Comment 3
Charlie
2013-11-28 00:42:55 UTC
FailedQA in rhevm-3.2.5-0.48.el6ev.noarch (sf22). Verifications steps: 1. Have a VM and run it on a host with power management. 2. Simulate host problems, for example, by blocking the VDSM port 54321 from the host. - Now RHEVM tries to connect to the host and after a while it initiates reboot via PM. 3. As soon as the host reboot is initated by RHEVM, try to reboot the host also manually in Webadmin via Power Management -> Restart. Results: 1. Host is in Reboot state: Error dialog: "Cannot restart ${type}. Related operation is currently in progress. Please try again later." ^^^ the ${type} should be interpolated with proper value - "host" in this case. 2. Host is then switched to Non Responsive state, manual reboot attempts will return error dialog: "Another Power Management operation is still running, please retry in 157 Sec. Cannot restart Host. Fence operation failed." No strange behavior on the affected VM has been seen. The VM can be ran again and is fully operational. Conclusion: The basic problem of 2 concurrent fence flows has been fixed, however the first error message (Results #1) is not formatted correctly. i agree the error message is wrong, but since this only happens in 3.2.5 (and not in 3.3, according to original bug verification) i don't think its worth fixing for z stream, as this is really minor text issue. the important thing is that split brain cannot happen. moving back to on_qa to verify this, if you think we should consider fixing the text message please open a new bug in order not to block the version using this important bug. (In reply to Omer Frenkel from comment #5) > i agree the error message is wrong, but since this only happens in 3.2.5 > (and not in 3.3, according to original bug verification) i don't think its > worth fixing for z stream, as this is really minor text issue. > Yes, it's happening only in 3.2.5 Z-Stream, not in 3.3. I filed new bug 1036784 for the text formatting issue, let PM decide, if it's worth fixing in some future Z-Stream version or not. > the important thing is that split brain cannot happen. > moving back to on_qa to verify this, > if you think we should consider fixing the text message please open a new > bug in order not to block the version using this important bug. I agree it's not necessary to block this bug since it resolves the core problem, verifying again. Verified in rhevm-3.2.5-0.48.el6ev.noarch (sf22). Verifications steps: 1. Have a VM and run it on a host with power management. 2. Simulate host problems, for example, by blocking the VDSM port 54321 from the host. - Now RHEVM tries to connect to the host and after a while it initiates reboot via PM. 3. As soon as the host reboot is initated by RHEVM, try to reboot the host also manually in Webadmin via Power Management -> Restart. Results: 1. Host is in Reboot state: Error dialog: "Cannot restart ${type}. Related operation is currently in progress. Please try again later." ^^^problem with not substituted variable ${type} is tracked separately under bug 1036784 2. Host is then switched to Non Responsive state, manual reboot attempts will return error dialog: "Another Power Management operation is still running, please retry in 157 Sec. Cannot restart Host. Fence operation failed." No strange behavior on the affected VM has been seen. The VM can be ran again and is fully operational. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1831.html |