Description of problem: - 1 host 1 storage domain iscsi - start an async task of post-zeroing a disk - vdsmd service is restarted, the following error appears: Traceback (most recent call last): File "/usr/share/vdsm/storage/image.py", line 116, in delete self._delete(sdUUID, imgUUID, postZero, force) File "/usr/share/vdsm/storage/image.py", line 176, in _delete mod.deleteMultipleVolumes(sdUUID, volumes, postZero) File "/usr/share/vdsm/storage/blockVolume.py", line 675, in deleteMultipleVolumes _postZero(sdUUID, volumes) File "/usr/share/vdsm/storage/blockVolume.py", line 669, in _postZero raise se.VolumesZeroingError(lv.name) VolumesZeroingError: Cannot zero out volume: ('_remove_me_YAiPSVOD_8567929d-19ce-43b1-962c-c7a3feaf44fc',) 2e935445-1be9-44d1-902d-260fe7f5873a::ERROR::2012-08-23 14:25:35,190::task::853::TaskManager.Task::(_setError) Task=`2e935445-1be9-44d1-902d-260fe7f5873a`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 861, in _run return fn(*args, **kargs) File "/usr/share/vdsm/storage/task.py", line 320, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/share/vdsm/storage/securable.py", line 63, in wrapper return f(self, *args, **kwargs) File "/usr/share/vdsm/storage/sp.py", line 1797, in deleteImage img.delete(sdUUID=sdUUID, imgUUID=imgUUID, postZero=postZero, force=force) File "/usr/share/vdsm/storage/image.py", line 116, in delete self._delete(sdUUID, imgUUID, postZero, force) File "/usr/share/vdsm/storage/image.py", line 176, in _delete mod.deleteMultipleVolumes(sdUUID, volumes, postZero) File "/usr/share/vdsm/storage/blockVolume.py", line 675, in deleteMultipleVolumes _postZero(sdUUID, volumes) File "/usr/share/vdsm/storage/blockVolume.py", line 669, in _postZero raise se.VolumesZeroingError(lv.name) VolumesZeroingError: Cannot zero out volume: ('_remove_me_YAiPSVOD_8567929d-19ce-43b1-962c-c7a3feaf44fc',) Thread-766::INFO::2012-08-23 14:25:35,199::storage_mailbox::446::Storage.MailBox.HsmMailMonitor::(run) HSM_MailboxMonitor - Incoming mail monitoring thread stopped, clearing outgoing mail Thread-711::DEBUG::2012-08-23 14:25:35,217::domainMonitor::182::Storage.DomainMonitor::(_monitorDomain) Monitorg for domain bb9d4516-3af4-4fec-b730-988d21d6b6ba is stopping 2e935445-1be9-44d1-902d-260fe7f5873a::DEBUG::2012-08-23 14:25:35,219::task::872::TaskManager.Task::(_run) Task=`2e935445-1be9-44d1-902d-260fe7f5873a`::Task._run: 2e935445-1be9-44d1-902d-260fe7f5873a () {} failed - stopping task Thread-766::INFO::2012-08-23 14:25:35,219::storage_mailbox::340::Storage.MailBox.HsmMailMonitor::(_sendMail) HSM_MailMonitor sending mail to SPM - ['/bin/dd', 'of=/rhev/data-center/4882ff40-ed1a-11e1-bcb8-001a4a16970e/mastersd/dom_md/inbox', 'iflag=fullblock', 'oflag=direct', 'conv=notrunc', 'bs=512', 'seek=8'] Thread-711::INFO::2012-08-23 14:25:35,220::safelease::183::SANLock::(releaseHostId) Releasing host id for domain bb9d4516-3af4-4fec-b730-988d21d6b6ba (id: 1) 2e935445-1be9-44d1-902d-260fe7f5873a::DEBUG::2012-08-23 14:25:35,220::task::1199::TaskManager.Task::(stop) Task=`2e935445-1be9-44d1-902d-260fe7f5873a`::stopping in state running (force False) Thread-766::DEBUG::2012-08-23 14:25:35,221::__init__::1164::Storage.Misc.excCmd::(_log) '/bin/dd of=/rhev/data-center/4882ff40-ed1a-11e1-bcb8-001a4a16970e/mastersd/dom_md/inbox iflag=fullblock oflag=direct conv=notrunc bs=512 seek=8' (cwd None) 2e935445-1be9-44d1-902d-260fe7f5873a::DEBUG::2012-08-23 14:25:35,222::task::978::TaskManager.Task::(_decref) Task=`2e935445-1be9-44d1-902d-260fe7f5873a`::ref 1 aborting True Thread-711::DEBUG::2012-08-23 14:25:35,229::domainMonitor::191::Storage.DomainMonitor::(_monitorDomain) Unable to release the host id 1 for the domain bb9d4516-3af4-4fec-b730-988d21d6b6ba Traceback (most recent call last): File "/usr/share/vdsm/storage/domainMonitor.py", line 188, in _monitorDomain domain.releaseHostId(hostId, unused=True) File "/usr/share/vdsm/storage/sd.py", line 413, in releaseHostId self._clusterLock.releaseHostId(hostId, async, unused) File "/usr/share/vdsm/storage/safelease.py", line 190, in releaseHostId raise se.ReleaseHostIdFailure(self._sdUUID, e) ReleaseHostIdFailure: Cannot release host id: ('bb9d4516-3af4-4fec-b730-988d21d6b6ba', SanlockException(16, 'Sanlock lockspace remove failure', 'Device or resource busy')) 2e935445-1be9-44d1-902d-260fe7f5873a::DEBUG::2012-08-23 14:25:35,230::task::903::TaskManager.Task::(_runJobs) Task=`2e935445-1be9-44d1-902d-260fe7f5873a`::aborting: Task is aborted: 'Cannot zero out volume' - code 374 I suspect its related to the fact we are doing some lv deactivation during preapreForShutdown and then we get a nice set of exceptions. user impact: None
Created attachment 606514 [details] vdsm log
For what I see the problem is that HSM.prepareForShutdown() doesn't include an explicit stopSpm call. The domainMonitor tries to release the lockspace but it fails since the SPM resource is still in use. With safelease the HSM.__releaseLocks() (called in the prepareForShutdown flow) does something really similar to stopSpm.
Fede, does this still happen?
(In reply to comment #8) > Fede, does this still happen? Yes. At the moment this message indicates that the vdsm service has been restarted without stopping the SPM first (safe anyway since sanlock will notice that the process has gone away and release it). In the future (when the volume leases will be used) this message might become more frequent (on HSMs too) and we might want to find a way to silence it.
still relevant with planned changes to spm?