Bug 1007010
Summary: | Automatic SPM fencing does not work | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | vvyazmin <vvyazmin> | ||||
Component: | vdsm | Assignee: | Nobody <nobody> | ||||
Status: | CLOSED NOTABUG | QA Contact: | vvyazmin <vvyazmin> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 3.3.0 | CC: | abaron, acanan, bazulay, hateya, iheim, lpeer, nsoffer, vvyazmin, yeylon | ||||
Target Milestone: | --- | Keywords: | Regression, Triaged | ||||
Target Release: | 3.3.0 | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | storage | ||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2013-09-24 08:48:31 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Did you actually reboot the spm host? (it seems like you didn't by the 'couldn't obtain lock' message). I didn't reboot SPM host or any VDSM servers, just select option “Confirm host has bis Rebooted” (In reply to vvyazmin from comment #0) > Steps to Reproduce: > 1. Create Data Center (DC) with multiple Hosts (all with different SPM > priority) > 2. Block connection SPM Host from RHEVM to the host via iptables > 3. Host will become Non-Responsive. This is not supposed to cause automatic SPM fence. When the smp is non-responsive it may still write to the storage, and the engine cannot make another host the spm. Only when host become non-operational the engine try to fence the host. To test this feature correctly do this: 1. Block connection from SPM Host to one of the storage domains host via iptables or phisically. 2. Wait a minute or so and if the SPM was changed Excepted results: 1. Orignal SPM becomes "Non Operational"* 2. One or more other hosts will become "Contending"* 3. Finally one of the other hosts will become "SPM"* * Actual text may differ (In reply to vvyazmin from comment #3) > I didn't reboot SPM host or any VDSM servers, just select option “Confirm > host has bis Rebooted” so you've confirmed you've rebooted it without rebooting it. That will not work. This is not a bug. |
Created attachment 796460 [details] ## Logs rhevm, vdsm, libvirt, thread dump, superVdsm Description of problem: Automatically SPM fence don’t work Version-Release number of selected component (if applicable): RHEVM 3.3 - IS13 environment: RHEVM: rhevm-3.3.0-0.19.master.el6ev.noarch PythonSDK: rhevm-sdk-python-3.3.0.13-1.el6ev.noarch VDSM: vdsm-4.12.0-105.git0da1561.el6ev.x86_64 LIBVIRT: libvirt-0.10.2-18.el6_4.9.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.7.x86_64 SANLOCK: sanlock-2.8-1.el6.x86_64 How reproducible: 100% Steps to Reproduce: 1. Create Data Center (DC) with multiple Hosts (all with different SPM priority) 2. Block connection SPM Host from RHEVM to the host via iptables 3. Host will become Non-Responsive. 4. Select option “Confurm host has bis Rebooted” Actual results: 1. Automatically SPM fence don’t work 2. Manually SPM fence don’t work Expected results: Next host with the highest priority should become SPM. Impact on user: DC non responsive Workaround: Restore connection SPM Host from RHEVM to the host Additional info: /var/log/ovirt-engine/engine.log 013-09-11 18:59:53,351 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] (DefaultQuartzScheduler_Worker-98) Failed in HSMGetTaskStatusVDS method 2013-09-11 18:59:53,351 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] (DefaultQuartzScheduler_Worker-98) Error code AcquireLockFailure and error m essage VDSGenericException: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = Cannot obtain lock 2013-09-11 18:59:53,351 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-98) spmStart polling ended: taskId = 10215d5c-cde3-4d 5b-ba18-db98c71bce8d task status = finished 2013-09-11 18:59:53,351 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-98) Start SPM Task failed - result: cleanSuccess, message: VDSGenericException: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = Cannot obtain lock /var/log/vdsm/vdsm.log 4a459049-f1fd-40d7-a10a-3d1dcb99d086::INFO::2013-09-11 18:59:38,054::clusterlock::225::SANLock::(acquire) Acquiring cluster lock for domain ba797f62-fbab-45bc-b7f3-8021f9ef1110 (id: 3) 4a459049-f1fd-40d7-a10a-3d1dcb99d086::ERROR::2013-09-11 18:59:38,070::task::850::TaskManager.Task::(_setError) Task=`4a459049-f1fd-40d7-a10a-3d1dcb99d086`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 857, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/task.py", line 318, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/share/vdsm/storage/sp.py", line 273, in startSpm self.masterDomain.acquireClusterLock(self.id) File "/usr/share/vdsm/storage/sd.py", line 487, in acquireClusterLock self._clusterLock.acquire(hostID) File "/usr/share/vdsm/storage/clusterlock.py", line 244, in acquire "Cannot acquire cluster lock", str(e)) AcquireLockFailure: Cannot obtain lock: "id=ba797f62-fbab-45bc-b7f3-8021f9ef1110, rc=-243, out=Cannot acquire cluster lock, err=(-243, 'Sanlock resource not acquired', 'Sanlock exception')" 4a459049-f1fd-40d7-a10a-3d1dcb99d086::DEBUG::2013-09-11 18:59:38,071::task::869::TaskManager.Task::(_run) Task=`4a459049-f1fd-40d7-a10a-3d1dcb99d086`::Task._run: 4a459049-f1fd-40d7-a10a-3d1dcb99d086 () {} failed - stopping task 4a459049-f1fd-40d7-a10a-3d1dcb99d086::DEBUG::2013-09-11 18:59:38,071::task::1194::TaskManager.Task::(stop) Task=`4a459049-f1fd-40d7-a10a-3d1dcb99d086`::stopping in state running (force False) 4a459049-f1fd-40d7-a10a-3d1dcb99d086::DEBUG::2013-09-11 18:59:38,072::task::974::TaskManager.Task::(_decref) Task=`4a459049-f1fd-40d7-a10a-3d1dcb99d086`::ref 1 aborting True 4a459049-f1fd-40d7-a10a-3d1dcb99d086::DEBUG::2013-09-11 18:59:38,072::task::900::TaskManager.Task::(_runJobs) Task=`4a459049-f1fd-40d7-a10a-3d1dcb99d086`::aborting: Task is aborted: 'Cannot o