Bug 1009610
Summary: | [RFE] Provide clear warning when SPM become inaccessible and needs fencing | ||||||
---|---|---|---|---|---|---|---|
Product: | [Retired] oVirt | Reporter: | Markus Stockhausen <mst> | ||||
Component: | ovirt-engine-core | Assignee: | Tal Nisan <tnisan> | ||||
Status: | CLOSED WONTFIX | QA Contact: | Aharon Canan <acanan> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 3.3 | CC: | amureini, bugs, iheim, mst, nsoffer, rbalakri, yeylon | ||||
Target Milestone: | --- | Keywords: | Improvement | ||||
Target Release: | 3.6.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | storage | ||||||
Fixed In Version: | Doc Type: | Enhancement | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2015-03-22 15:47:18 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Markus Stockhausen
2013-09-18 17:17:18 UTC
Created attachment 799500 [details]
engine log
did you configure fencing (power management)? if you kill the SPM node, engine can't elect another SPM without knowing the first SPM was fenced (via power management, or manually by you) Phew, it seems as if I did not get this setup right using only the wiki documentation. Therefore I read the "power management deep dive" presentation. Hopefully with success to some extend. I better sum it up with my own words to ensure I got it right. - Only one hypervisor host acts as a SPM (storage pool manager) - It is designed to check connections to the storage domains periodically - if it fails the cluster has no more information about the storage domains - To get the DC up and running again the SPM has to be relocated - That can only work if ovirt-engine knows for sure that the failed SPM is dead Thats where power management kicks in. If it is configured correctly the host can be killed even if it is in a pathologic state. Without it the admin must switch off the host and relocate the SPM manually. That would be my scenario. In this situation I would expect some webadmin message with better readability. E.g "SPM node failed. Power management disabled. Fix situation and relocate SPM manually." Only an idea to prevent future questions ... you got the gist of it... allon - thoughts on above? (also, now that we are using sanlock for SPM, maybe we can consider dropping the fencing requirement?) This looks like a duplicate of bug 1007010. Seems that we need better instructions how to test fencing - which docuement did you use to perfom this test? Hello, sorry for the late reply. I did no tests with fencing. I just tried to simulate a node failure to get an idea how ovirt behaves afterwards. My self choosen test consists of blocking all tcp input to the ovirt node (see above). The outcome was that all storage domains where offline. All the logs did not show any good advice what was going on. As such a situation may araise only once in the lifetime of a cluster I guess the normal admin will be overchallenged to find a solution. A simple hint that manually moving the SPM to another node will resolve the issue should be of much help. Markus Since SPM is critical, it should be easy to detect that it became inaccesible. Stuff to check: 1. Check if logs contains errors when SPM become inaccesible 2. Check if there is an easy way to detect such situation in the user interface 3. Check related documenation pushing to target release 3.5, assuming its not planned for 3.4 at this point... Closing old bugs. If this issue is still relevant/important in current version, please re-open the bug. |