Bug 927151
| Summary: | engine: after spm election of a host if it suddenly becomes non-responsive we try to contend new host as spm (contending will be blocked by sanlock) | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Dafna Ron <dron> | ||||
| Component: | ovirt-engine | Assignee: | Ayal Baron <abaron> | ||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Dafna Ron <dron> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 3.1.3 | CC: | acathrow, dyasny, hateya, iheim, lpeer, Rhev-m-bugs, scohen, yeylon, ykaul | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.2.0 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | storage | ||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2013-04-07 07:21:25 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
[root@gold-vdsc ~]# vdsClient -s 0 getSpmStatus f92efae9-b68b-4e10-be1b-39ab7ab4e026 spmId = 1 spmStatus = SPM spmLver = 0 [root@gold-vdsd ~]# vdsClient -s 0 getSpmStatus f92efae9-b68b-4e10-be1b-39ab7ab4e026 spmId = 1 spmStatus = Contend spmLver = 0 [root@gold-vdsd ~]# vdsClient -s 0 getSpmStatus f92efae9-b68b-4e10-be1b-39ab7ab4e026 spmId = 1 spmStatus = Free spmLver = 0 Dafna, I don't understand the use case. What's the end result? what's the impact on the pool? iiuc the old spm host is now up and running, does it become spm? do you need now to manually fence the new one and then the old one will become spm? 1. there could be two SPM's if we have more than one domain - I was using only 1 domain in my tests this is why the sanlock was locking the domain. but if we have more than one domain we can combine reconstruct with this action and since the Sanlock will not block the new master domain we can have two SPMs. 2. as far as the user is considered we move the domains to non-operational. (In reply to comment #3) > 1. there could be two SPM's if we have more than one domain - I was using > only 1 domain in my tests this is why the sanlock was locking the domain. > but if we have more than one domain we can combine reconstruct with this > action and since the Sanlock will not block the new master domain we can > have two SPMs. > 2. as far as the user is considered we move the domains to non-operational. ? can you run this with 2 domains and reach this state? (In reply to comment #3) > 1. there could be two SPM's if we have more than one domain - I was using > only 1 domain in my tests this is why the sanlock was locking the domain. > but if we have more than one domain we can combine reconstruct with this > action and since the Sanlock will not block the new master domain we can > have two SPMs. > 2. as far as the user is considered we move the domains to non-operational. To be clear, the moment engine tried to run spm on the new host then it should not try to run reconstruct before fencing the host on which we tried so I don't think we'll reach this state. |
Created attachment 715912 [details] logs Description of problem: after I rebooted my spm and selected the "confirm host has been rebooted', as soon as the old spm started I blocked the new spm from engine with REJECT and we automatically conetended the new spm. it seems to be coming from this command: 2013-03-25 10:21:10,428 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (QuartzScheduler_Worker-42) [795ce3ba] Irs placed on server null failed. Proceed Failover Version-Release number of selected component (if applicable): 3.1.3 How reproducible: 100% Steps to Reproduce: 1. in two hosts cluster with NFS storage, reboot the spm 2. when the host becomes non-responsive -> confirm host has been rebooted 3. when the old spm host changes state to up -> block connectivity between the engine and the host using iptables with REJECT Actual results: engine tries to contend spm again although the host is non-responsive and we should not be contending spm in such caese. Expected results: we should not conetnd spm if we have a network connectivity issue. Additional info: logs