Bug 1007010 - Automatic SPM fencing does not work
Automatic SPM fencing does not work
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm (Show other bugs)
3.3.0
x86_64 Linux
unspecified Severity urgent
: ---
: 3.3.0
Assigned To: nobody nobody
vvyazmin@redhat.com
storage
: Regression, Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-09-11 13:39 EDT by vvyazmin@redhat.com
Modified: 2016-02-10 13:36 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-09-24 04:48:31 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm (4.00 MB, application/x-gzip)
2013-09-11 13:39 EDT, vvyazmin@redhat.com
no flags Details

  None (edit)
Description vvyazmin@redhat.com 2013-09-11 13:39:59 EDT
Created attachment 796460 [details]
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm

Description of problem:
Automatically SPM fence don’t work

Version-Release number of selected component (if applicable):
RHEVM 3.3 - IS13 environment:

RHEVM:  rhevm-3.3.0-0.19.master.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.13-1.el6ev.noarch
VDSM:  vdsm-4.12.0-105.git0da1561.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-18.el6_4.9.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.355.el6_4.7.x86_64
SANLOCK:  sanlock-2.8-1.el6.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Create Data Center (DC) with multiple Hosts (all with different SPM priority)
2. Block connection SPM Host from RHEVM to the host via iptables
3. Host will become Non-Responsive.
4. Select option “Confurm host has bis Rebooted”

Actual results:
1. Automatically SPM fence don’t work
2. Manually SPM fence don’t work

Expected results:
Next host with the highest priority should become SPM.

Impact on user:
DC non responsive

Workaround:
Restore connection SPM Host from RHEVM to the host

Additional info:

/var/log/ovirt-engine/engine.log

013-09-11 18:59:53,351 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] (DefaultQuartzScheduler_Worker-98) Failed in HSMGetTaskStatusVDS method
2013-09-11 18:59:53,351 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.HSMGetTaskStatusVDSCommand] (DefaultQuartzScheduler_Worker-98) Error code AcquireLockFailure and error m
essage VDSGenericException: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = Cannot obtain lock
2013-09-11 18:59:53,351 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-98) spmStart polling ended: taskId = 10215d5c-cde3-4d
5b-ba18-db98c71bce8d task status = finished
2013-09-11 18:59:53,351 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (DefaultQuartzScheduler_Worker-98) Start SPM Task failed - result: cleanSuccess, message: VDSGenericException: VDSErrorException: Failed to HSMGetTaskStatusVDS, error = Cannot obtain lock

/var/log/vdsm/vdsm.log
4a459049-f1fd-40d7-a10a-3d1dcb99d086::INFO::2013-09-11 18:59:38,054::clusterlock::225::SANLock::(acquire) Acquiring cluster lock for domain ba797f62-fbab-45bc-b7f3-8021f9ef1110 (id: 3)
4a459049-f1fd-40d7-a10a-3d1dcb99d086::ERROR::2013-09-11 18:59:38,070::task::850::TaskManager.Task::(_setError) Task=`4a459049-f1fd-40d7-a10a-3d1dcb99d086`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/storage/task.py", line 318, in run
    return self.cmd(*self.argslist, **self.argsdict)
  File "/usr/share/vdsm/storage/sp.py", line 273, in startSpm
    self.masterDomain.acquireClusterLock(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 487, in acquireClusterLock
    self._clusterLock.acquire(hostID)
  File "/usr/share/vdsm/storage/clusterlock.py", line 244, in acquire
    "Cannot acquire cluster lock", str(e))
AcquireLockFailure: Cannot obtain lock: "id=ba797f62-fbab-45bc-b7f3-8021f9ef1110, rc=-243, out=Cannot acquire cluster lock, err=(-243, 'Sanlock resource not acquired', 'Sanlock exception')"
4a459049-f1fd-40d7-a10a-3d1dcb99d086::DEBUG::2013-09-11 18:59:38,071::task::869::TaskManager.Task::(_run) Task=`4a459049-f1fd-40d7-a10a-3d1dcb99d086`::Task._run: 4a459049-f1fd-40d7-a10a-3d1dcb99d086 () {} failed - stopping task
4a459049-f1fd-40d7-a10a-3d1dcb99d086::DEBUG::2013-09-11 18:59:38,071::task::1194::TaskManager.Task::(stop) Task=`4a459049-f1fd-40d7-a10a-3d1dcb99d086`::stopping in state running (force False)
4a459049-f1fd-40d7-a10a-3d1dcb99d086::DEBUG::2013-09-11 18:59:38,072::task::974::TaskManager.Task::(_decref) Task=`4a459049-f1fd-40d7-a10a-3d1dcb99d086`::ref 1 aborting True
4a459049-f1fd-40d7-a10a-3d1dcb99d086::DEBUG::2013-09-11 18:59:38,072::task::900::TaskManager.Task::(_runJobs) Task=`4a459049-f1fd-40d7-a10a-3d1dcb99d086`::aborting: Task is aborted: 'Cannot o
Comment 2 Ayal Baron 2013-09-15 07:31:49 EDT
Did you actually reboot the spm host? (it seems like you didn't by the 'couldn't obtain lock' message).
Comment 3 vvyazmin@redhat.com 2013-09-22 10:21:45 EDT
I didn't reboot SPM host or any VDSM servers, just select option “Confirm host has bis Rebooted”
Comment 4 Nir Soffer 2013-09-23 09:32:49 EDT
(In reply to vvyazmin@redhat.com from comment #0)
> Steps to Reproduce:
> 1. Create Data Center (DC) with multiple Hosts (all with different SPM
> priority)
> 2. Block connection SPM Host from RHEVM to the host via iptables
> 3. Host will become Non-Responsive.

This is not supposed to cause automatic SPM fence.

When the smp is non-responsive it may still write to the storage, and the engine cannot make another host the spm.

Only when host become non-operational the engine try to fence the host.

To test this feature correctly do this:

1. Block connection from SPM Host to one of the storage domains host via iptables or phisically.
2. Wait a minute or so and if the SPM was changed

Excepted results:

1. Orignal SPM becomes "Non Operational"*
2. One or more other hosts will become "Contending"*
3. Finally one of the other hosts will become "SPM"*

* Actual text may differ
Comment 5 Ayal Baron 2013-09-24 04:48:31 EDT
(In reply to vvyazmin@redhat.com from comment #3)
> I didn't reboot SPM host or any VDSM servers, just select option “Confirm
> host has bis Rebooted”

so you've confirmed you've rebooted it without rebooting it.  That will not work.
This is not a bug.

Note You need to log in before you can comment on or make changes to this bug.