1116469 – [RFE] Provide a tool to clear hosted-engine.lockspace that will also check that no agent is running locally and that --vm-status does not list any host as active.

Bug 1116469 - [RFE] Provide a tool to clear hosted-engine.lockspace that will also check that no agent is running locally and that --vm-status does not list any host as active.

Summary: [RFE] Provide a tool to clear hosted-engine.lockspace that will also check th...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-hosted-engine-setup
Sub Component:
Version:	3.3.0
Hardware:	All
OS:	Linux
Priority:	high
Severity:	urgent
Target Milestone:	ovirt-3.6.0-rc
Target Release:	3.6.0
Assignee:	Martin Sivák
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1108334 1234906 1234915
TreeView+	depends on / blocked

Reported:	2014-07-04 20:31 UTC by James W. Mills
Modified:	2021-08-30 12:37 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	Previously, after a power outage or another disruptive event, the hosted-engine.lockspace file refused to accept new connections, and virtual machines failed to start. The --reinitialize-lockspace command line option has been added to the hosted-engine command, which reinitializes the sanlock lockspace file and wipes all locks. This option is available only in clusters in global maintenance mode with all high availability agents shut down. Additionally, the --force option can be used with the --reinitialize-lockspace option to override the safety checks, but should be only used with caution.
Clone Of:
Environment:
Last Closed:	2016-03-09 18:51:42 UTC
oVirt Team:	Integration
Target Upstream Version:
Embargoed:
Flags:	sherold: Triaged+

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-43218	None	None	None	2021-08-30 12:16:45 UTC
Red Hat Knowledge Base (Solution)	1121193	None	None	None	Never
Red Hat Product Errata	RHEA-2016:0375	normal	SHIPPED_LIVE	ovirt-hosted-engine-setup bug fix and enhancement update	2016-03-09 23:48:34 UTC
oVirt gerrit	38289	'None'	MERGED	Add metadata cleanup functionality	2021-02-08 05:27:27 UTC
oVirt gerrit	38494	'None'	MERGED	Lockspace reset tool support	2021-02-08 05:27:27 UTC
oVirt gerrit	38495	'None'	MERGED	Correct corrupted block 0 when setting global maintenance	2021-02-08 05:27:27 UTC
oVirt gerrit	42093	'None'	MERGED	Add two new recover commands to the hosted-engine tool	2021-02-08 05:27:27 UTC
oVirt gerrit	42202	'None'	MERGED	Handle Sanlock issues properly when force cleaning	2021-02-08 05:27:27 UTC

Comment 2 James W. Mills 2014-07-04 20:49:32 UTC

Description of problem:

After a power outage, or some other event, the hosted-engine.lockspace file refuses to accept new connections, and the manager VM fails to start.  vdsm reports:

:libvirtconnection::108::libvirtconnection::(wrapper) Unknown libvirterror: ecode: 38 edom: 42 level: 2 message: Failed to acquire lock: No space left on device

agent.log reports:

MainThread::INFO::2014-07-04 15:01:47,250::hosted_engine::454::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) Ensuring lease for lockspace hosted-engine, host id 1 is acquired (file: /rhev/data-center/mnt/readynas.awayfar.org:_isis/4b566b6f-9051-4993-88b0-a2315d7d2c40/ha_agent/hosted-engine.lockspace)
MainThread::ERROR::2014-07-04 15:01:48,251::hosted_engine::482::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) cannot get lock on host id 1: (-223, 'Sanlock lockspace add failure', 'Sanlock exception')
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 458, in _initialize_sanlock
    host_id, lease_file)
SanlockException: (-223, 'Sanlock lockspace add failure', 'Sanlock exception')
MainThread::WARNING::2014-07-04 15:01:48,252::hosted_engine::333::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Error while monitoring engine: Failed to initialize sanlock: cannot get lock on host id 1: (-223, 'Sanlock lockspace add failure', 'Sanlock exception')

three times and then exits.


Version-Release number of selected component (if applicable):

ovirt-hosted-engine-setup-1.0.0-9.el6ev.noarch
ovirt-hosted-engine-ha-1.0.0-3.el6ev.noarch


How reproducible:

100%


Steps to Reproduce:

This is not *how* the lockspace became corrupted in the field, but provides a method to get into the situation

1.  As vdsm user, use dd to overwrite /rhev/data-center/mnt/<INTERNAL HE SD>/<SDUUID>/ha_agent/hosted-engine.lockspace 

2.  Hard power cycle the host (unclean)


Actual results:
When the host comes back online, any attempts to tun "hosted-engine --vm-start" will result in the logs from vdsmd above

Expected results:

If the lockspace is corrupt, detect that and recreate it.

Comment 3 Martin Sivák 2014-07-07 11:21:31 UTC

I do not think automatic recovery is wise here. But adding a simple to use tool that would reinitialize the lockspace is a good idea.

Our setup tool takes care of those tasks so I am reassigning there.


Meanwhile you might be able to use the following procedure (the same we use in the setup):

Stop all hosted engine daemons on all hosts.

Execute (on any host):

sanlock direct init -s hosted-engine:0:/rhev/data-center/mnt/<INTERNAL HE SD>/<SDUUID>/ha_agent/hosted-engine.lockspace:0

(Re-)start all hosted engine daemons on all hosts afterwards and they should be able to connect again.

Comment 7 Sandro Bonazzola 2014-07-22 14:43:47 UTC

(In reply to Martin Sivák from comment #3)
> I do not think automatic recovery is wise here. But adding a simple to use
> tool that would reinitialize the lockspace is a good idea.
> 
> Our setup tool takes care of those tasks so I am reassigning there.
> 
> 
> Meanwhile you might be able to use the following procedure (the same we use
> in the setup):
> 
> Stop all hosted engine daemons on all hosts.
> 
> Execute (on any host):
> 
> sanlock direct init -s hosted-engine:0:/rhev/data-center/mnt/<INTERNAL HE
> SD>/<SDUUID>/ha_agent/hosted-engine.lockspace:0
> 
> (Re-)start all hosted engine daemons on all hosts afterwards and they should
> be able to connect again.

So you just need a tool that call above command line with the right UUIDs?

Comment 9 Martin Sivák 2014-08-20 08:37:03 UTC

Sandro,

yes we would like to have a tool that can reinitialize lockspace, but only after some sanity check is done and the user is warned about what that means (if the hosted engine environment is ok, he risks breaking the sync).

Comment 10 Sandro Bonazzola 2014-09-19 08:32:30 UTC

I disagree on the urgent severity of this bug since a manual workaround exist as described in comment #3.

However Martin, can you define the list of sanity check to be performed by the new tool before calling 
sanlock direct init -s hosted-engine:0:/rhev/data-center/mnt/<INTERNAL HE SD>/<SDUUID>/ha_agent/hosted-engine.lockspace:0

Comment 11 Martin Sivák 2014-09-19 14:12:07 UTC

Well we should make sure that there is no agent running locally and that --vm-status does not list any host as active.

I am not sure how realiable that check can be, so just a big warning should be enough I think. Followed by another one if we detect something running.

Comment 12 Simone Tiraboschi 2015-03-19 16:51:56 UTC

We also have a similar/related issue on hosted-engine deploy:
If the deploy fails for any reasons on advanced steps the use can be tempted to fix the error, manually clear the shared storage deleting everything and simply restart using the previously generated answerfile in order to avoid typing each single response again. 

The issue is that the answerfile contains also the sdUUID and the spUUID from the previous attempt so the deploy procedures tries to use the same values for that.

Sanlock lockspace area identifier and path is uniquely determined by that and so it also trying to use the same lockspace area.

The issue than is that sanlock daemon is still active and it thinks that the required lockspace area is still there while is not cause the user cleaned the shared storage in order to deploy it again for scratch.
So sanlock deamon prevents itself to acquire a lock on that area cause it doesn't exist anymore.

On hosted-engine logs we can find:
 2015-03-19 10:31:36 DEBUG otopi.plugins.ovirt_hosted_engine_setup.storage.storage storage._createStoragePool:580 {'status': {'message': "Cannot acquire host id: ('8249dd6a-2450-4f1d-acc6-e0f73c850cb6', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))", 'code': 661}}
 2015-03-19 10:31:36 DEBUG otopi.context context._executeMethod:155 method exception
 Traceback (most recent call last):
   File "/usr/lib/python2.7/site-packages/otopi/context.py", line 145, in _executeMethod
     method['method']()
   File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/ovirt-hosted-engine-setup/storage/storage.py", line 992, in _misc
     self._createStoragePool()
   File "/usr/share/ovirt-hosted-engine-setup/scripts/../plugins/ovirt-hosted-engine-setup/storage/storage.py", line 582, in _createStoragePool
     raise RuntimeError(status['status']['message'])
 RuntimeError: Cannot acquire host id: ('8249dd6a-2450-4f1d-acc6-e0f73c850cb6', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))
 2015-03-19 10:31:36 ERROR otopi.context context._executeMethod:164 Failed to execute stage 'Misc configuration': Cannot acquire host id: ('8249dd6a-2450-4f1d-acc6-e0f73c850cb6', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))

While on VDSM logs we can find:
 Thread-20::ERROR::2015-03-19 10:31:36,202::task::863::Storage.TaskManager.Task::(_setError) Task=`7746bd58-286f-4e56-a6f6-a15c3712ee55`::Unexpected error
 Traceback (most recent call last):
   File "/usr/share/vdsm/storage/task.py", line 870, in _run
     return fn(*args, **kargs)
   File "/usr/share/vdsm/logUtils.py", line 49, in wrapper
     res = f(*args, **kwargs)
   File "/usr/share/vdsm/storage/hsm.py", line 1000, in createStoragePool
     leaseParams)
   File "/usr/share/vdsm/storage/sp.py", line 580, in create
     self._acquireTemporaryClusterLock(msdUUID, leaseParams)
   File "/usr/share/vdsm/storage/sp.py", line 512, in _acquireTemporaryClusterLock
     msd.acquireHostId(self.id)
   File "/usr/share/vdsm/storage/sd.py", line 477, in acquireHostId
     self._clusterLock.acquireHostId(hostId, async)
   File "/usr/share/vdsm/storage/clusterlock.py", line 237, in acquireHostId
     raise se.AcquireHostIdFailure(self._sdUUID, e)
 AcquireHostIdFailure: Cannot acquire host id: ('8249dd6a-2450-4f1d-acc6-e0f73c850cb6', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))
 Thread-20::DEBUG::2015-03-19 10:31:36,214::task::882::Storage.TaskManager.Task::(_run) Task=`7746bd58-286f-4e56-a6f6-a15c3712ee55`::Task._run: 7746bd58-286f-4e56-a6f6-a15c3712ee55 (None, '2fbdffad-40ff-4c88-8f78-35d6472564c0', 'hosted_datacenter', '8249dd6a-2450-4f1d-acc6-e0f73c850cb6', ['8249dd6a-2450-4f1d-acc6-e0f73c850cb6'], 1, None, None, None, None, None) {} failed - stopping task

And after the failure in sanlock we still have:
 [root@f20t10 ~]# sanlock client status
 daemon 9ffe7f4e-1540-4fdb-b034-088a943fea22.f20t10.loc
 p -1 helper
 p -1 listener
 p -1 status
 s 8249dd6a-1340-4f1d-acc6-e0f73c850cb6:1:/rhev/data-center/mnt/192.168.1.115\:_Virtual_exthe6/8249dd6a-1340-4f1d-acc6-e0f73c850cb6/dom_md/ids:0

also if I completely wipe /rhev/data-center/mnt/192.168.1.115\:_Virtual_exthe6/

After a few time wdmd could also reboot the machine having sanlock failing.

A workaround is to simply use a different sdUUID so it looks for a different sanlock lockspace and than no issue.

Martin, could we also use that tool to have sanlock daemon ignore that lockspace area in order to have a new deploy from scratch?

Comment 13 Martin Sivák 2015-03-20 10:48:29 UTC

The current tooling supports two operations.

1) Reinitializing the lockspace - I think sanlock should drop all locks when this is done, but I do not remember if I tested that (it was implemented to deal with corrupted lease file)

2) Removing a host from hosted engine environment - this needs proper hosted engine configuration and can (as a side effect) release the lock for a specified host id.

So the answer to your question is that I am not sure what will happen :) Broken setup is a good use case though and I will look into it.

Comment 14 Sandro Bonazzola 2015-04-24 06:14:33 UTC

Martin, all referenced gerrit patches have been merged.
Should this bug be in MODIFIED or do you still miss something for closing this?

Comment 15 Martin Sivák 2015-04-24 07:43:15 UTC

We should make it easier to call the tools. So we need at least some documentation about how to call it.

Comment 41 Nikolai Sednev 2015-12-20 09:30:48 UTC

Works for me on these components:
Host:
ovirt-vmconsole-host-1.0.1-0.0.master.20151105234454.git3e5d52e.el7.noarch
ovirt-release36-002-2.noarch
sanlock-3.2.4-1.el7.x86_64
ovirt-hosted-engine-ha-1.3.3.3-1.20151211131547.gitb84582e.el7.noarch
ovirt-setup-lib-1.0.1-0.0.master.20151126203321.git2da7763.el7.centos.noarch
ovirt-engine-sdk-python-3.6.1.1-0.1.20151127.git2400b22.el7.centos.noarch
ovirt-vmconsole-1.0.1-0.0.master.20151105234454.git3e5d52e.el7.noarch
ovirt-release36-snapshot-002-2.noarch
mom-0.5.1-2.el7.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.5.x86_64
ovirt-hosted-engine-setup-1.3.2-0.0.master.20151209094106.gitce16937.el7.centos.noarch
ovirt-host-deploy-1.4.2-0.0.master.20151122153544.gitfc808fc.el7.noarch
libvirt-client-1.2.17-13.el7_2.2.x86_64
vdsm-4.17.13-28.git08ca1b0.el7.centos.noarch
Linux version 3.10.0-327.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Thu Oct 29 17:29:29 EDT 2015

Engine:
ovirt-host-deploy-java-1.4.1-1.el6ev.noarch
ovirt-vmconsole-1.0.0-1.el6ev.noarch
ovirt-host-deploy-1.4.1-1.el6ev.noarch
ovirt-vmconsole-proxy-1.0.0-1.el6ev.noarch
ovirt-engine-extension-aaa-jdbc-1.0.4-1.el6ev.noarch
rhevm-3.6.1.3-0.1.el6.noarch
Linux version 2.6.32-573.8.1.el6.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) ) #1 SMP Fri Sep 25 19:24:22 EDT 2015

I had 2 hosts with HE and performed the following from one of the hosts:
cd /rhev/data-center/mnt/<internal SD>/ha_agent/
su -s /bin/bash vdsm
mv hosted-engine.lockspace hosted-engine.lockspace.orig
dd if=/dev/zero of=hosted-engine.lockspace bs=1M count=1
chmod 0777 hosted-engine.lockspace
reboot

Then cast "hosted-engine --reinitialize-lockspace --force" command, while both ha-agent and ha-broker were running.

No errors as described in comment #1 observed.

Comment 42 Roy Golan 2015-12-20 15:42:27 UTC

(In reply to Nikolai Sednev from comment #41)
> Works for me on these components:
> Host:
> ovirt-vmconsole-host-1.0.1-0.0.master.20151105234454.git3e5d52e.el7.noarch
> ovirt-release36-002-2.noarch
> sanlock-3.2.4-1.el7.x86_64
> ovirt-hosted-engine-ha-1.3.3.3-1.20151211131547.gitb84582e.el7.noarch
> ovirt-setup-lib-1.0.1-0.0.master.20151126203321.git2da7763.el7.centos.noarch
> ovirt-engine-sdk-python-3.6.1.1-0.1.20151127.git2400b22.el7.centos.noarch
> ovirt-vmconsole-1.0.1-0.0.master.20151105234454.git3e5d52e.el7.noarch
> ovirt-release36-snapshot-002-2.noarch
> mom-0.5.1-2.el7.noarch
> qemu-kvm-rhev-2.3.0-31.el7_2.5.x86_64
> ovirt-hosted-engine-setup-1.3.2-0.0.master.20151209094106.gitce16937.el7.
> centos.noarch
> ovirt-host-deploy-1.4.2-0.0.master.20151122153544.gitfc808fc.el7.noarch
> libvirt-client-1.2.17-13.el7_2.2.x86_64
> vdsm-4.17.13-28.git08ca1b0.el7.centos.noarch
> Linux version 3.10.0-327.el7.x86_64
> (mockbuild.eng.bos.redhat.com) (gcc version 4.8.3 20140911> (Red Hat 4.8.3-9) (GCC) ) #1 SMP Thu Oct 29 17:29:29 EDT 2015
> 
> Engine:
> ovirt-host-deploy-java-1.4.1-1.el6ev.noarch
> ovirt-vmconsole-1.0.0-1.el6ev.noarch
> ovirt-host-deploy-1.4.1-1.el6ev.noarch
> ovirt-vmconsole-proxy-1.0.0-1.el6ev.noarch
> ovirt-engine-extension-aaa-jdbc-1.0.4-1.el6ev.noarch
> rhevm-3.6.1.3-0.1.el6.noarch
> Linux version 2.6.32-573.8.1.el6.x86_64
> (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313
> (Red Hat 4.4.7-16) (GCC) ) #1 SMP Fri Sep 25 19:24:22 EDT 2015
> 
> I had 2 hosts with HE and performed the following from one of the hosts:
> cd /rhev/data-center/mnt/<internal SD>/ha_agent/
> su -s /bin/bash vdsm
> mv hosted-engine.lockspace hosted-engine.lockspace.orig
> dd if=/dev/zero of=hosted-engine.lockspace bs=1M count=1
> chmod 0777 hosted-engine.lockspace
> reboot
> 
> Then cast "hosted-engine --reinitialize-lockspace --force" command, while
> both ha-agent and ha-broker were running.
> 

I suggest to add a restart again to the ha-agent, vdsm and the engine vm shutdown and start. or just another reboot.

> No errors as described in comment #1 observed.

Comment 43 Nikolai Sednev 2015-12-21 15:06:35 UTC

(In reply to Roy Golan from comment #42)
> (In reply to Nikolai Sednev from comment #41)
> > Works for me on these components:
> > Host:
> > ovirt-vmconsole-host-1.0.1-0.0.master.20151105234454.git3e5d52e.el7.noarch
> > ovirt-release36-002-2.noarch
> > sanlock-3.2.4-1.el7.x86_64
> > ovirt-hosted-engine-ha-1.3.3.3-1.20151211131547.gitb84582e.el7.noarch
> > ovirt-setup-lib-1.0.1-0.0.master.20151126203321.git2da7763.el7.centos.noarch
> > ovirt-engine-sdk-python-3.6.1.1-0.1.20151127.git2400b22.el7.centos.noarch
> > ovirt-vmconsole-1.0.1-0.0.master.20151105234454.git3e5d52e.el7.noarch
> > ovirt-release36-snapshot-002-2.noarch
> > mom-0.5.1-2.el7.noarch
> > qemu-kvm-rhev-2.3.0-31.el7_2.5.x86_64
> > ovirt-hosted-engine-setup-1.3.2-0.0.master.20151209094106.gitce16937.el7.
> > centos.noarch
> > ovirt-host-deploy-1.4.2-0.0.master.20151122153544.gitfc808fc.el7.noarch
> > libvirt-client-1.2.17-13.el7_2.2.x86_64
> > vdsm-4.17.13-28.git08ca1b0.el7.centos.noarch
> > Linux version 3.10.0-327.el7.x86_64
> > (mockbuild.eng.bos.redhat.com) (gcc version 4.8.3 20140911> (Red Hat 4.8.3-9) (GCC) ) #1 SMP Thu Oct 29 17:29:29 EDT 2015
> > 
> > Engine:
> > ovirt-host-deploy-java-1.4.1-1.el6ev.noarch
> > ovirt-vmconsole-1.0.0-1.el6ev.noarch
> > ovirt-host-deploy-1.4.1-1.el6ev.noarch
> > ovirt-vmconsole-proxy-1.0.0-1.el6ev.noarch
> > ovirt-engine-extension-aaa-jdbc-1.0.4-1.el6ev.noarch
> > rhevm-3.6.1.3-0.1.el6.noarch
> > Linux version 2.6.32-573.8.1.el6.x86_64
> > (mockbuild.eng.bos.redhat.com) (gcc version 4.4.7 20120313
> > (Red Hat 4.4.7-16) (GCC) ) #1 SMP Fri Sep 25 19:24:22 EDT 2015
> > 
> > I had 2 hosts with HE and performed the following from one of the hosts:
> > cd /rhev/data-center/mnt/<internal SD>/ha_agent/
> > su -s /bin/bash vdsm
> > mv hosted-engine.lockspace hosted-engine.lockspace.orig
> > dd if=/dev/zero of=hosted-engine.lockspace bs=1M count=1
> > chmod 0777 hosted-engine.lockspace
> > reboot
> > 
> > Then cast "hosted-engine --reinitialize-lockspace --force" command, while
> > both ha-agent and ha-broker were running.
> > 
> 
> I suggest to add a restart again to the ha-agent, vdsm and the engine vm
> shutdown and start. or just another reboot.
> 
> > No errors as described in comment #1 observed.

I did this, forgot to mention.

Comment 45 errata-xmlrpc 2016-03-09 18:51:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0375.html

Note You need to log in before you can comment on or make changes to this bug.