1092631 – failure to recover after executing fenceSpmStorage

Bug 1092631 - failure to recover after executing fenceSpmStorage

Summary: failure to recover after executing fenceSpmStorage

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.4.0
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.4.0
Assignee:	Liron Aravot
QA Contact:	Ori Gofen
Docs Contact:
URL:
Whiteboard:	storage
Depends On:
Blocks:	1082365
TreeView+	depends on / blocked

Reported:	2014-04-29 15:32 UTC by Liron Aravot
Modified:	2016-05-26 01:48 UTC (History)
CC List:	15 users (show)
Fixed In Version:	vdsm-4.14.7-1.el6ev
Doc Type:	Bug Fix
Doc Text:	Previously, the role of storage pool master would not be transferred to another host when a host designated as the storage pool master was manually fenced. This was caused by an error in the logic used to indicate that the role of storage pool master is free when a host to which that role has been assigned is manually fenced. This logic has now been revised so that manually fencing a host designated as a storage pool manager will transfer the role of storage pool manager to another host.
Clone Of:	1082365
Environment:
Last Closed:	2014-06-09 13:30:29 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2014:0504	normal	SHIPPED_LIVE	vdsm 3.4.0 bug fix and enhancement update	2014-06-09 17:21:35 UTC
oVirt gerrit	27226	None	None	None	Never
oVirt gerrit	27340	None	None	None	Never

Description Liron Aravot 2014-04-29 15:32:15 UTC

Description of problem:

After manually fencing spm host ("confirm host has been rebooted button") the system won't start the spm on other host.


Version-Release number of selected component (if applicable):

- 3.4.0

How reproducible:
Always

Steps to Reproduce:

- 2 Node Cluster -SPM/HSM
- block spm host network
- host becomes non response
- click on "confirm host has been rebooted" button
- other host isn't being selected as the spm

Actual results:

- other host isn't being selected as the spm

Expected results:

- spm should be started on the other host

Comment 1 Liron Aravot 2014-04-29 15:36:26 UTC

I'll add logs -
basically the issue seems that when we "fence" on that scenario, the pool metadata is being updated with spmId = -1 and lver = -1.
The problem is that when the engine runs getSpmStatus the stats are retrieved from sanlock that weren't updated and contains the previous spm id/lver.

this bug https://bugzilla.redhat.com/show_bug.cgi?id=1082365 are kind of blocking each other , 1082365 can't be "completely" verified with succesful "fence" while this bug can't be solved without 1082365 attribute errors fixed.
This bug was opened for sanity and testing of the complete scenario.

Comment 3 Ori Gofen 2014-05-12 15:34:01 UTC

verified on av9 step taken:

1.create 2 Node Cluster -SPM/HSM (shared dc)
2.block spm host network
3.wait for host to become non response
4.stop vdsmd service on blocked spm 
5.click on "confirm host has been rebooted" button

HSM gains SPM as expected

Comment 4 errata-xmlrpc 2014-06-09 13:30:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0504.html

Note You need to log in before you can comment on or make changes to this bug.