Bug 1029035

Summary: [engine] vm appears in 2 different rhev-h hosts (split brain)
Product: Red Hat Enterprise Virtualization Manager Reporter: rhev-integ
Component: ovirt-engineAssignee: Roy Golan <rgolan>
Status: CLOSED ERRATA QA Contact: Pavel Novotny <pnovotny>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.2.0CC: acathrow, cboyle, iheim, lpeer, lyarwood, mavital, michal.skrivanek, mkalinin, ofrenkel, rgolan, Rhev-m-bugs, yeylon
Target Milestone: ---Keywords: ZStream
Target Release: 3.2.5   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: virt
Fixed In Version: Doc Type: Bug Fix
Doc Text:
The problem was seen as a host htpervisor went into non-responding mode. A VM that was running on it stopped responding and spice didn't work to access that VM's console. While trying to stop/reboot the problematic host, the VM appeared on 2 different hosts (via UI and on virsh command line). Eventually this caused the FS on that VM to become corrupted, which resulted in having to reinstall that VM. This was caused by the fence command running at the same time. This fix runs fence command mutually exclusive. * Cause: a race between 2 Fence operation, one triggered by UI and one automatically * Consequence: the end of the fence command is to start VMs up, since the race exist the process starts VMs without knowing it started already * Fix: make fence operations exclusive by taking an engine-lock * Result: now the 2nd fence operation should fail with Can Do Action since one is already in progress
Story Points: ---
Clone Of: 1026811 Environment:
Last Closed: 2013-12-18 14:10:37 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1026811    
Bug Blocks:    

Comment 3 Charlie 2013-11-28 00:42:55 UTC
This bug is currently attached to errata RHBA-2013:16431. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to 
minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 4 Pavel Novotny 2013-11-28 14:24:54 UTC
FailedQA in rhevm-3.2.5-0.48.el6ev.noarch (sf22).

Verifications steps:
1. Have a VM and run it on a host with power management.
2. Simulate host problems, for example, by blocking the VDSM port 54321 from the host.
  - Now RHEVM tries to connect to the host and after a while it initiates reboot via PM.
3. As soon as the host reboot is initated by RHEVM, try to reboot the host also manually in Webadmin via Power Management -> Restart.

Results:
1. Host is in Reboot state:
Error dialog: "Cannot restart ${type}. Related operation is currently in progress. Please try again later."
^^^ the ${type} should be interpolated with proper value - "host" in this case.

2. Host is then switched to Non Responsive state, manual reboot attempts will return error dialog:
"Another Power Management operation is still running, please retry in 157 Sec.
Cannot restart Host. Fence operation failed."

No strange behavior on the affected VM has been seen. The VM can be ran again and is fully operational.

Conclusion: The basic problem of 2 concurrent fence flows has been fixed, however the first error message (Results #1) is not formatted correctly.

Comment 5 Omer Frenkel 2013-12-02 14:03:36 UTC
i agree the error message is wrong, but since this only happens in 3.2.5 (and not in 3.3, according to original bug verification) i don't think its worth fixing for z stream, as this is really minor text issue.

the important thing is that split brain cannot happen.
moving back to on_qa to verify this,
if you think we should consider fixing the text message please open a new bug in order not to block the version using this important bug.

Comment 6 Pavel Novotny 2013-12-02 15:55:14 UTC
(In reply to Omer Frenkel from comment #5)
> i agree the error message is wrong, but since this only happens in 3.2.5
> (and not in 3.3, according to original bug verification) i don't think its
> worth fixing for z stream, as this is really minor text issue.
> 

Yes, it's happening only in 3.2.5 Z-Stream, not in 3.3.
I filed new bug 1036784 for the text formatting issue, let PM decide, if it's worth fixing in some future Z-Stream version or not.

> the important thing is that split brain cannot happen.
> moving back to on_qa to verify this,
> if you think we should consider fixing the text message please open a new
> bug in order not to block the version using this important bug.

I agree it's not necessary to block this bug since it resolves the core problem, verifying again.

Comment 7 Pavel Novotny 2013-12-02 15:59:04 UTC
Verified in rhevm-3.2.5-0.48.el6ev.noarch (sf22).

Verifications steps:
1. Have a VM and run it on a host with power management.
2. Simulate host problems, for example, by blocking the VDSM port 54321 from the host.
  - Now RHEVM tries to connect to the host and after a while it initiates reboot via PM.
3. As soon as the host reboot is initated by RHEVM, try to reboot the host also manually in Webadmin via Power Management -> Restart.

Results:
1. Host is in Reboot state:
Error dialog: "Cannot restart ${type}. Related operation is currently in progress. Please try again later."
^^^problem with not substituted variable ${type} is tracked separately under bug 1036784

2. Host is then switched to Non Responsive state, manual reboot attempts will return error dialog:
"Another Power Management operation is still running, please retry in 157 Sec.
Cannot restart Host. Fence operation failed."

No strange behavior on the affected VM has been seen. The VM can be ran again and is fully operational.

Comment 9 errata-xmlrpc 2013-12-18 14:10:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1831.html