Red Hat Bugzilla – Bug 1009610
[RFE] Provide clear warning when SPM become inaccessible and needs fencing
Last modified: 2016-02-10 14:43:53 EST
Description of problem:
Data center fails if host with storage pool manger switches into pahtlogic state.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. start hypervisor nodes
2. start ovirt-engine
3. observe logs which node acts as storage pool manager
4. wait until everything is up
5. put storage pool manager host into error state with command
iptables -I INPUT -p tcp --destination-port 54321 -j DROP
datacenter is set offline. System cannot recover from that situation. Even 20 minutes after the error datacenter and storage domains are still offline.
recovery takes place. storage pool manager is switched to other host. datacenter is available again.
ovirt-engine log attached, Web error log follows
2013-Sep-18, 18:49 Try to recover Data Center Collogia.
Setting status to Non Responsive.
2013-Sep-18, 18:46 Fencing failed on Storage Pool Manager colovn3
for Data Center Collogia. Setting status to Non-Operational.
2013-Sep-18, 18:46 Host colovn3 is non responsive.
2013-Sep-18, 18:44 Host colovn3 is non responsive.
2013-Sep-18, 18:43 Invalid status on Data Center Collogia. Setting Data
Center status to Non Responsive (On host colovn3,
Error: Network error during communication with the Host.).
<----- close firewall on colovn3 here <--------
2013-Sep-18, 18:42 Storage Pool Manager runs on Host colovn3 (Address:
2013-Sep-18, 18:42 State was set to Up for host colovn1.
Created attachment 799500 [details]
did you configure fencing (power management)?
if you kill the SPM node, engine can't elect another SPM without knowing the first SPM was fenced (via power management, or manually by you)
it seems as if I did not get this setup right using only the wiki documentation. Therefore I read the "power management deep dive" presentation. Hopefully with success to some extend. I better sum it up with my own words to ensure I got it right.
- Only one hypervisor host acts as a SPM (storage pool manager)
- It is designed to check connections to the storage domains periodically
- if it fails the cluster has no more information about the storage domains
- To get the DC up and running again the SPM has to be relocated
- That can only work if ovirt-engine knows for sure that the failed SPM is dead
Thats where power management kicks in. If it is configured correctly the host can be killed even if it is in a pathologic state. Without it the admin must switch off the host and relocate the SPM manually. That would be my scenario.
In this situation I would expect some webadmin message with better readability. E.g "SPM node failed. Power management disabled. Fix situation and relocate SPM manually."
Only an idea to prevent future questions ...
you got the gist of it...
allon - thoughts on above?
(also, now that we are using sanlock for SPM, maybe we can consider dropping the fencing requirement?)
This looks like a duplicate of bug 1007010.
Seems that we need better instructions how to test fencing - which docuement did you use to perfom this test?
sorry for the late reply. I did no tests with fencing. I just tried to simulate a node failure to get an idea how ovirt behaves afterwards. My self choosen test consists of blocking all tcp input to the ovirt node (see above). The outcome was that all storage domains where offline. All the logs did not show any good advice what was going on.
As such a situation may araise only once in the lifetime of a cluster I guess the normal admin will be overchallenged to find a solution. A simple hint that manually moving the SPM to another node will resolve the issue should be of much help.
Since SPM is critical, it should be easy to detect that it became inaccesible.
Stuff to check:
1. Check if logs contains errors when SPM become inaccesible
2. Check if there is an easy way to detect such situation in the user interface
3. Check related documenation
pushing to target release 3.5, assuming its not planned for 3.4 at this point...
Closing old bugs. If this issue is still relevant/important in current version, please re-open the bug.