Bug 1103165 - In case of node crash, master storage domain becomes inactive causing manual fencing to fail
Summary: In case of node crash, master storage domain becomes inactive causing manual ...
Keywords:
Status: CLOSED DUPLICATE of bug 1082365
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.4.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: 3.4.1
Assignee: Liron Aravot
QA Contact: Aharon Canan
URL:
Whiteboard: storage
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-05-30 11:55 UTC by Mahendra Takwale
Modified: 2016-02-10 18:33 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-06-02 10:55:38 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Mahendra Takwale 2014-05-30 11:55:04 UTC
Description of problem: If a node on which virtual machine is running crashes, then master storage domain goes into "inactive" state. Master storage domain fails to come online even if the alternate host is available and in up state. Due to Master storage domain failure, any attempt to manually fencing out the host (by selecting "Confirm Host has been rebooted") fails with following error,

"Manual fence did not revoke the selected SPM
(north.example.com) since the master storage domain was 
not active or could not use another host for the fence operation."



Version-Release number of selected component (if applicable): 3.4


How reproducible:   


Steps to Reproduce:
1. Start virtual machine on a SPM node and pull the power chord.
2. Once the host goes into "Non-Responsive" state, initiate manual fencing by selecting "Confirm Host has been rebooted"
3. Check the state of the master storage domain

Actual results:
1. Master storage domain goes into "inactive" state causing data-center to go into "inactive" state.
2. Attempt to initiate manual fencing is failed with the above mentioned error


Expected results:
1. The other node is still up so master storage domain should not go into inactive state.
2. manual fencing should successfully initiates SPM failover


Additional info:

Comment 1 Barak 2014-06-01 12:31:25 UTC
Allon shouldn't the engine detect that there is no SPM using the other node ?

Comment 2 Allon Mureinik 2014-06-01 13:21:19 UTC
(In reply to Barak from comment #1)
> Allon shouldn't the engine detect that there is no SPM using the other node ?

If the SPM node went to NonResponsive you cannot assume the SPM is down - it may very well be up, just inaccessible.
However, the fact that "Confirm Host has been rebooted" doesn't work is troublesome.

Liron - can you take a look please?

Comment 3 Liron Aravot 2014-06-01 14:15:03 UTC
Hi Mahendra,
If you can please attach the logs so we can provide better RCA to the issue.

There was an issue with the fence that was fixed in:
http://gerrit.ovirt.org/#/c/27341/
http://gerrit.ovirt.org/#/c/27340/

additionally, please confirm on which version are you testing.

thanks!

Liron

Comment 4 Mahendra Takwale 2014-06-02 09:18:21 UTC
Hi Liron,

Thanks for analyzing the issue.

Here is the information you have asked for,

1. RHEV version

rhevm-3.4.0-0.16.rc.el6ev.noarch

2. VDSM version

vdsm-4.14.7-0.2.rc.el6ev.x86_64

log file size is 59M, due to which upload failed. Please let me know alternate location, where I can copy this.

Thanks & Regards,
Mahendra Takwale

Comment 5 Liron Aravot 2014-06-02 10:55:38 UTC
Hi Mahendra,
it seems that the issue was resolved in vdsm later than the build you are using, please try to upgrade it and if there are any further issues please let me know.

at the moment i'm closing this one as this issue should be resolved by the provided patches in the last comment.

*** This bug has been marked as a duplicate of bug 1082365 ***


Note You need to log in before you can comment on or make changes to this bug.