Bug 1009610

Summary: [RFE] Provide clear warning when SPM become inaccessible and needs fencing
Product: [Retired] oVirt Reporter: Markus Stockhausen <mst>
Component: ovirt-engine-coreAssignee: Tal Nisan <tnisan>
Status: CLOSED WONTFIX QA Contact: Aharon Canan <acanan>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.3CC: amureini, bugs, iheim, mst, nsoffer, rbalakri, yeylon
Target Milestone: ---Keywords: Improvement
Target Release: 3.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-03-22 15:47:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine log none

Description Markus Stockhausen 2013-09-18 17:17:18 UTC
Description of problem:

Data center fails if host with storage pool manger switches into pahtlogic state.

Version-Release number of selected component (if applicable):

ovirt 3.3.0-4

How reproducible:

100%

Steps to Reproduce:

1. start hypervisor nodes

2. start ovirt-engine

3. observe logs which node acts as storage pool manager

4. wait until everything is up

5. put storage pool manager host into error state with command
   iptables -I INPUT -p tcp --destination-port 54321 -j DROP

Actual results:

datacenter is set offline. System cannot recover from that situation. Even 20 minutes after the error datacenter and storage domains are still offline.

Expected results:

recovery takes place. storage pool manager is switched to other host. datacenter is available again.

Additional info:

ovirt-engine log attached, Web error log follows
	
2013-Sep-18, 18:49 Try to recover Data Center Collogia. 
                   Setting status to Non Responsive.
2013-Sep-18, 18:46 Fencing failed on Storage Pool Manager colovn3 
                   for Data Center Collogia. Setting status to Non-Operational.
2013-Sep-18, 18:46 Host colovn3 is non responsive.
2013-Sep-18, 18:44 Host colovn3 is non responsive.
2013-Sep-18, 18:43 Invalid status on Data Center Collogia. Setting Data 
                   Center status to Non Responsive (On host colovn3, 
                   Error: Network error during communication with the Host.).
<----- close firewall on colovn3 here <--------
2013-Sep-18, 18:42 Storage Pool Manager runs on Host colovn3 (Address:
                   192.168.10.53).
2013-Sep-18, 18:42 State was set to Up for host colovn1.

Comment 1 Markus Stockhausen 2013-09-18 17:18:02 UTC
Created attachment 799500 [details]
engine log

Comment 2 Itamar Heim 2013-09-18 21:59:07 UTC
did you configure fencing (power management)?
if you kill the SPM node, engine can't elect another SPM without knowing the first SPM was fenced (via power management, or manually by you)

Comment 3 Markus Stockhausen 2013-09-19 06:14:36 UTC
Phew, 

it seems as if I did not get this setup right using only the wiki documentation. Therefore I read the "power management deep dive" presentation. Hopefully with success to some extend. I better sum it up with my own words to ensure I got it right.

- Only one hypervisor host acts as a SPM (storage pool manager)
- It is designed to check connections to the storage domains periodically
- if it fails the cluster has no more information about the storage domains
- To get the DC up and running again the SPM has to be relocated
- That can only work if ovirt-engine knows for sure that the failed SPM is dead

Thats where power management kicks in. If it is configured correctly the host can be killed even if it is in a pathologic state. Without it the admin must switch off the host and relocate the SPM manually. That would be my scenario.

In this situation I would expect some webadmin message with better readability. E.g "SPM node failed. Power management disabled. Fix situation and relocate SPM manually." 

Only an idea to prevent future questions ...

Comment 4 Itamar Heim 2013-09-22 07:20:05 UTC
you got the gist of it...
allon - thoughts on above?
(also, now that we are using sanlock for SPM, maybe we can consider dropping the fencing requirement?)

Comment 5 Nir Soffer 2013-09-29 09:33:00 UTC
This looks like a duplicate of bug 1007010.

Seems that we need better instructions how to test fencing - which docuement did you use to perfom this test?

Comment 6 Markus Stockhausen 2013-10-02 18:08:11 UTC
Hello,

sorry for the late reply. I did no tests with fencing. I just tried to simulate a node failure to get an idea how ovirt behaves afterwards. My self choosen test consists of blocking all tcp input to the ovirt node (see above). The outcome was that all storage domains where offline. All the logs did not show any good advice what was going on.

As such a situation may araise only once in the lifetime of a cluster I guess the normal admin will be overchallenged to find a solution. A simple hint that manually moving the SPM to another node will resolve the issue should be of much help.

Markus

Comment 7 Nir Soffer 2013-10-02 18:51:59 UTC
Since SPM is critical, it should be easy to detect that it became inaccesible. 

Stuff to check:
1. Check if logs contains errors when SPM become inaccesible
2. Check if there is an easy way to detect such situation in the user interface
3. Check related documenation

Comment 8 Itamar Heim 2014-02-13 18:31:27 UTC
pushing to target release 3.5, assuming its not planned for 3.4 at this point...

Comment 9 Itamar Heim 2015-03-22 15:47:18 UTC
Closing old bugs. If this issue is still relevant/important in current version, please re-open the bug.