Description of problem: After a power outage, or some other event, the hosted-engine.lockspace file refuses to accept new connections, and the manager VM fails to start. vdsm reports: :libvirtconnection::108::libvirtconnection::(wrapper) Unknown libvirterror: ecode: 38 edom: 42 level: 2 message: Failed to acquire lock: No space left on device agent.log reports: MainThread::INFO::2014-07-04 15:01:47,250::hosted_engine::454::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) Ensuring lease for lockspace hosted-engine, host id 1 is acquired (file: /rhev/data-center/mnt/readynas.awayfar.org:_isis/4b566b6f-9051-4993-88b0-a2315d7d2c40/ha_agent/hosted-engine.lockspace) MainThread::ERROR::2014-07-04 15:01:48,251::hosted_engine::482::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) cannot get lock on host id 1: (-223, 'Sanlock lockspace add failure', 'Sanlock exception') Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 458, in _initialize_sanlock host_id, lease_file) SanlockException: (-223, 'Sanlock lockspace add failure', 'Sanlock exception') MainThread::WARNING::2014-07-04 15:01:48,252::hosted_engine::333::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Error while monitoring engine: Failed to initialize sanlock: cannot get lock on host id 1: (-223, 'Sanlock lockspace add failure', 'Sanlock exception') three times and then exits. Version-Release number of selected component (if applicable): ovirt-hosted-engine-setup-1.0.0-9.el6ev.noarch ovirt-hosted-engine-ha-1.0.0-3.el6ev.noarch How reproducible: 100% Steps to Reproduce: This is not *how* the lockspace became corrupted in the field, but provides a method to get into the situation 1. As vdsm user, use dd to overwrite /rhev/data-center/mnt/<INTERNAL HE SD>/<SDUUID>/ha_agent/hosted-engine.lockspace 2. Hard power cycle the host (unclean) Actual results: When the host comes back online, any attempts to tun "hosted-engine --vm-start" will result in the logs from vdsmd above Expected results: If the lockspace is corrupt, detect that and recreate it.
I do not think automatic recovery is wise here. But adding a simple to use tool that would reinitialize the lockspace is a good idea. Our setup tool takes care of those tasks so I am reassigning there. Meanwhile you might be able to use the following procedure (the same we use in the setup): Stop all hosted engine daemons on all hosts. Execute (on any host): sanlock direct init -s hosted-engine:0:/rhev/data-center/mnt/<INTERNAL HE SD>/<SDUUID>/ha_agent/hosted-engine.lockspace:0 (Re-)start all hosted engine daemons on all hosts afterwards and they should be able to connect again.
(In reply to Martin Sivák from comment #3) > I do not think automatic recovery is wise here. But adding a simple to use > tool that would reinitialize the lockspace is a good idea. > > Our setup tool takes care of those tasks so I am reassigning there. > > > Meanwhile you might be able to use the following procedure (the same we use > in the setup): > > Stop all hosted engine daemons on all hosts. > > Execute (on any host): > > sanlock direct init -s hosted-engine:0:/rhev/data-center/mnt/<INTERNAL HE > SD>/<SDUUID>/ha_agent/hosted-engine.lockspace:0 > > (Re-)start all hosted engine daemons on all hosts afterwards and they should > be able to connect again. So you just need a tool that call above command line with the right UUIDs?
Sandro, yes we would like to have a tool that can reinitialize lockspace, but only after some sanity check is done and the user is warned about what that means (if the hosted engine environment is ok, he risks breaking the sync).
I disagree on the urgent severity of this bug since a manual workaround exist as described in comment #3. However Martin, can you define the list of sanity check to be performed by the new tool before calling sanlock direct init -s hosted-engine:0:/rhev/data-center/mnt/<INTERNAL HE SD>/<SDUUID>/ha_agent/hosted-engine.lockspace:0
Well we should make sure that there is no agent running locally and that --vm-status does not list any host as active. I am not sure how realiable that check can be, so just a big warning should be enough I think. Followed by another one if we detect something running.
We also have a similar/related issue on hosted-engine deploy: If the deploy fails for any reasons on advanced steps the use can be tempted to fix the error, manually clear the shared storage deleting everything and simply restart using the previously generated answerfile in order to avoid typing each single response again. The issue is that the answerfile contains also the sdUUID and the spUUID from the previous attempt so the deploy procedures tries to use the same values for that. Sanlock lockspace area identifier and path is uniquely determined by that and so it also trying to use the same lockspace area. The issue than is that sanlock daemon is still active and it thinks that the required lockspace area is still there while is not cause the user cleaned the shared storage in order to deploy it again for scratch. So sanlock deamon prevents itself to acquire a lock on that area cause it doesn't exist anymore. On hosted-engine logs we can find: 2015-03-19 10:31:36 DEBUG otopi.plugins.ovirt_hosted_engine_setup.storage.storage storage._createStoragePool:580 {'status': {'message': "Cannot acquire host id: ('8249dd6a-2450-4f1d-acc6-e0f73c850cb6', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))", 'code': 661}} 2015-03-19 10:31:36 DEBUG otopi.context context._executeMethod:155 method exception Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/otopi/context.py", line 145, in _executeMethod method['method']() File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/ovirt-hosted-engine-setup/storage/storage.py", line 992, in _misc self._createStoragePool() File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/ovirt-hosted-engine-setup/storage/storage.py", line 582, in _createStoragePool raise RuntimeError(status['status']['message']) RuntimeError: Cannot acquire host id: ('8249dd6a-2450-4f1d-acc6-e0f73c850cb6', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) 2015-03-19 10:31:36 ERROR otopi.context context._executeMethod:164 Failed to execute stage 'Misc configuration': Cannot acquire host id: ('8249dd6a-2450-4f1d-acc6-e0f73c850cb6', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) While on VDSM logs we can find: Thread-20::ERROR::2015-03-19 10:31:36,202::task::863::Storage.TaskManager.Task::(_setError) Task=`7746bd58-286f-4e56-a6f6-a15c3712ee55`::Unexpected error Traceback (most recent call last): File "/usr/share/vdsm/storage/task.py", line 870, in _run return fn(*args, **kargs) File "/usr/share/vdsm/logUtils.py", line 49, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/storage/hsm.py", line 1000, in createStoragePool leaseParams) File "/usr/share/vdsm/storage/sp.py", line 580, in create self._acquireTemporaryClusterLock(msdUUID, leaseParams) File "/usr/share/vdsm/storage/sp.py", line 512, in _acquireTemporaryClusterLock msd.acquireHostId(self.id) File "/usr/share/vdsm/storage/sd.py", line 477, in acquireHostId self._clusterLock.acquireHostId(hostId, async) File "/usr/share/vdsm/storage/clusterlock.py", line 237, in acquireHostId raise se.AcquireHostIdFailure(self._sdUUID, e) AcquireHostIdFailure: Cannot acquire host id: ('8249dd6a-2450-4f1d-acc6-e0f73c850cb6', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) Thread-20::DEBUG::2015-03-19 10:31:36,214::task::882::Storage.TaskManager.Task::(_run) Task=`7746bd58-286f-4e56-a6f6-a15c3712ee55`::Task._run: 7746bd58-286f-4e56-a6f6-a15c3712ee55 (None, '2fbdffad-40ff-4c88-8f78-35d6472564c0', 'hosted_datacenter', '8249dd6a-2450-4f1d-acc6-e0f73c850cb6', ['8249dd6a-2450-4f1d-acc6-e0f73c850cb6'], 1, None, None, None, None, None) {} failed - stopping task And after the failure in sanlock we still have: [root@f20t10 ~]# sanlock client status daemon 9ffe7f4e-1540-4fdb-b034-088a943fea22.f20t10.loc p -1 helper p -1 listener p -1 status s 8249dd6a-1340-4f1d-acc6-e0f73c850cb6:1:/rhev/data-center/mnt/192.168.1.115\:_Virtual_exthe6/8249dd6a-1340-4f1d-acc6-e0f73c850cb6/dom_md/ids:0 also if I completely wipe /rhev/data-center/mnt/192.168.1.115\:_Virtual_exthe6/ After a few time wdmd could also reboot the machine having sanlock failing. A workaround is to simply use a different sdUUID so it looks for a different sanlock lockspace and than no issue. Martin, could we also use that tool to have sanlock daemon ignore that lockspace area in order to have a new deploy from scratch?
The current tooling supports two operations. 1) Reinitializing the lockspace - I think sanlock should drop all locks when this is done, but I do not remember if I tested that (it was implemented to deal with corrupted lease file) 2) Removing a host from hosted engine environment - this needs proper hosted engine configuration and can (as a side effect) release the lock for a specified host id. So the answer to your question is that I am not sure what will happen :) Broken setup is a good use case though and I will look into it.
Martin, all referenced gerrit patches have been merged. Should this bug be in MODIFIED or do you still miss something for closing this?
We should make it easier to call the tools. So we need at least some documentation about how to call it.
Works for me on these components: Host: ovirt-vmconsole-host-1.0.1-0.0.master.20151105234454.git3e5d52e.el7.noarch ovirt-release36-002-2.noarch sanlock-3.2.4-1.el7.x86_64 ovirt-hosted-engine-ha-1.3.3.3-1.20151211131547.gitb84582e.el7.noarch ovirt-setup-lib-1.0.1-0.0.master.20151126203321.git2da7763.el7.centos.noarch ovirt-engine-sdk-python-3.6.1.1-0.1.20151127.git2400b22.el7.centos.noarch ovirt-vmconsole-1.0.1-0.0.master.20151105234454.git3e5d52e.el7.noarch ovirt-release36-snapshot-002-2.noarch mom-0.5.1-2.el7.noarch qemu-kvm-rhev-2.3.0-31.el7_2.5.x86_64 ovirt-hosted-engine-setup-1.3.2-0.0.master.20151209094106.gitce16937.el7.centos.noarch ovirt-host-deploy-1.4.2-0.0.master.20151122153544.gitfc808fc.el7.noarch libvirt-client-1.2.17-13.el7_2.2.x86_64 vdsm-4.17.13-28.git08ca1b0.el7.centos.noarch Linux version 3.10.0-327.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Thu Oct 29 17:29:29 EDT 2015 Engine: ovirt-host-deploy-java-1.4.1-1.el6ev.noarch ovirt-vmconsole-1.0.0-1.el6ev.noarch ovirt-host-deploy-1.4.1-1.el6ev.noarch ovirt-vmconsole-proxy-1.0.0-1.el6ev.noarch ovirt-engine-extension-aaa-jdbc-1.0.4-1.el6ev.noarch rhevm-3.6.1.3-0.1.el6.noarch Linux version 2.6.32-573.8.1.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #1 SMP Fri Sep 25 19:24:22 EDT 2015 I had 2 hosts with HE and performed the following from one of the hosts: cd /rhev/data-center/mnt/<internal SD>/ha_agent/ su -s /bin/bash vdsm mv hosted-engine.lockspace hosted-engine.lockspace.orig dd if=/dev/zero of=hosted-engine.lockspace bs=1M count=1 chmod 0777 hosted-engine.lockspace reboot Then cast "hosted-engine --reinitialize-lockspace --force" command, while both ha-agent and ha-broker were running. No errors as described in comment #1 observed.
(In reply to Nikolai Sednev from comment #41) > Works for me on these components: > Host: > ovirt-vmconsole-host-1.0.1-0.0.master.20151105234454.git3e5d52e.el7.noarch > ovirt-release36-002-2.noarch > sanlock-3.2.4-1.el7.x86_64 > ovirt-hosted-engine-ha-1.3.3.3-1.20151211131547.gitb84582e.el7.noarch > ovirt-setup-lib-1.0.1-0.0.master.20151126203321.git2da7763.el7.centos.noarch > ovirt-engine-sdk-python-3.6.1.1-0.1.20151127.git2400b22.el7.centos.noarch > ovirt-vmconsole-1.0.1-0.0.master.20151105234454.git3e5d52e.el7.noarch > ovirt-release36-snapshot-002-2.noarch > mom-0.5.1-2.el7.noarch > qemu-kvm-rhev-2.3.0-31.el7_2.5.x86_64 > ovirt-hosted-engine-setup-1.3.2-0.0.master.20151209094106.gitce16937.el7. > centos.noarch > ovirt-host-deploy-1.4.2-0.0.master.20151122153544.gitfc808fc.el7.noarch > libvirt-client-1.2.17-13.el7_2.2.x86_64 > vdsm-4.17.13-28.git08ca1b0.el7.centos.noarch > Linux version 3.10.0-327.el7.x86_64 > (mockbuild.eng.bos.redhat.com) (gcc version 4.8.3 20140911> (Red Hat 4.8.3-9) (GCC) ) #1 SMP Thu Oct 29 17:29:29 EDT 2015 > > Engine: > ovirt-host-deploy-java-1.4.1-1.el6ev.noarch > ovirt-vmconsole-1.0.0-1.el6ev.noarch > ovirt-host-deploy-1.4.1-1.el6ev.noarch > ovirt-vmconsole-proxy-1.0.0-1.el6ev.noarch > ovirt-engine-extension-aaa-jdbc-1.0.4-1.el6ev.noarch > rhevm-3.6.1.3-0.1.el6.noarch > Linux version 2.6.32-573.8.1.el6.x86_64 > (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 > (Red Hat 4.4.7-16) (GCC) ) #1 SMP Fri Sep 25 19:24:22 EDT 2015 > > I had 2 hosts with HE and performed the following from one of the hosts: > cd /rhev/data-center/mnt/<internal SD>/ha_agent/ > su -s /bin/bash vdsm > mv hosted-engine.lockspace hosted-engine.lockspace.orig > dd if=/dev/zero of=hosted-engine.lockspace bs=1M count=1 > chmod 0777 hosted-engine.lockspace > reboot > > Then cast "hosted-engine --reinitialize-lockspace --force" command, while > both ha-agent and ha-broker were running. > I suggest to add a restart again to the ha-agent, vdsm and the engine vm shutdown and start. or just another reboot. > No errors as described in comment #1 observed.
(In reply to Roy Golan from comment #42) > (In reply to Nikolai Sednev from comment #41) > > Works for me on these components: > > Host: > > ovirt-vmconsole-host-1.0.1-0.0.master.20151105234454.git3e5d52e.el7.noarch > > ovirt-release36-002-2.noarch > > sanlock-3.2.4-1.el7.x86_64 > > ovirt-hosted-engine-ha-1.3.3.3-1.20151211131547.gitb84582e.el7.noarch > > ovirt-setup-lib-1.0.1-0.0.master.20151126203321.git2da7763.el7.centos.noarch > > ovirt-engine-sdk-python-3.6.1.1-0.1.20151127.git2400b22.el7.centos.noarch > > ovirt-vmconsole-1.0.1-0.0.master.20151105234454.git3e5d52e.el7.noarch > > ovirt-release36-snapshot-002-2.noarch > > mom-0.5.1-2.el7.noarch > > qemu-kvm-rhev-2.3.0-31.el7_2.5.x86_64 > > ovirt-hosted-engine-setup-1.3.2-0.0.master.20151209094106.gitce16937.el7. > > centos.noarch > > ovirt-host-deploy-1.4.2-0.0.master.20151122153544.gitfc808fc.el7.noarch > > libvirt-client-1.2.17-13.el7_2.2.x86_64 > > vdsm-4.17.13-28.git08ca1b0.el7.centos.noarch > > Linux version 3.10.0-327.el7.x86_64 > > (mockbuild.eng.bos.redhat.com) (gcc version 4.8.3 20140911> (Red Hat 4.8.3-9) (GCC) ) #1 SMP Thu Oct 29 17:29:29 EDT 2015 > > > > Engine: > > ovirt-host-deploy-java-1.4.1-1.el6ev.noarch > > ovirt-vmconsole-1.0.0-1.el6ev.noarch > > ovirt-host-deploy-1.4.1-1.el6ev.noarch > > ovirt-vmconsole-proxy-1.0.0-1.el6ev.noarch > > ovirt-engine-extension-aaa-jdbc-1.0.4-1.el6ev.noarch > > rhevm-3.6.1.3-0.1.el6.noarch > > Linux version 2.6.32-573.8.1.el6.x86_64 > > (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 > > (Red Hat 4.4.7-16) (GCC) ) #1 SMP Fri Sep 25 19:24:22 EDT 2015 > > > > I had 2 hosts with HE and performed the following from one of the hosts: > > cd /rhev/data-center/mnt/<internal SD>/ha_agent/ > > su -s /bin/bash vdsm > > mv hosted-engine.lockspace hosted-engine.lockspace.orig > > dd if=/dev/zero of=hosted-engine.lockspace bs=1M count=1 > > chmod 0777 hosted-engine.lockspace > > reboot > > > > Then cast "hosted-engine --reinitialize-lockspace --force" command, while > > both ha-agent and ha-broker were running. > > > > I suggest to add a restart again to the ha-agent, vdsm and the engine vm > shutdown and start. or just another reboot. > > > No errors as described in comment #1 observed. I did this, forgot to mention.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-0375.html