1097767 – ovirt-ha-agent dead but subsys locked

Bug 1097767 - ovirt-ha-agent dead but subsys locked

Summary: ovirt-ha-agent dead but subsys locked

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-hosted-engine-ha
Sub Component:
Version:	3.4.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Doron Fediuck
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:	sla
Depends On:	1147411
Blocks:	1119699 rhev3.5beta 1156165
TreeView+	depends on / blocked

Reported:	2014-05-14 13:41 UTC by Nikolai Sednev
Modified:	2019-04-28 09:16 UTC (History)
CC List:	7 users (show)
Fixed In Version:	ovirt-3.5.0-beta2
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1119699 (view as bug list)
Environment:
Last Closed:	2015-02-11 21:08:48 UTC
oVirt Team:	SLA
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs from both hosts (agent, sanlock, vdsm). (1.25 MB, application/x-gzip) 2014-05-14 13:52 UTC, Nikolai Sednev	no flags	Details
Broker.log from both hosts (796.33 KB, application/x-gzip) 2014-05-14 13:57 UTC, Nikolai Sednev	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:0194	normal	SHIPPED_LIVE	ovirt-hosted-engine-ha bug fix and enhancement update	2015-02-12 01:35:33 UTC
oVirt gerrit	28047	None	None	None	Never
oVirt gerrit	29937	ovirt-hosted-engine-ha-1.1	MERGED	try harder when initializing	Never

Description Nikolai Sednev 2014-05-14 13:41:08 UTC

Description of problem:
Can't connect to HE after both hosts within the host cluster been rebooted and getting "ovirt-ha-agent dead but subsys locked" for the service ovirt-ha-agent status.

Version-Release number of selected component (if applicable):
On both hosts:
vdsm-4.14.7-2.el6ev.x86_64
qemu-kvm-rhev-0.12.1.2-2.415.el6_5.9.x86_64
libvirt-0.10.2-29.el6_5.7.x86_64
sanlock-2.8-1.el6.x86_64

On HE before the issue occurred:
(AV9.1) Red Hat Enterprise Virtualization Manager Version: 3.4.0-0.20.el6ev

How reproducible:
Unclear flow.

Steps to Reproduce:
1.Install HE on RHEL6.5 based OS host.
2.Add additional host to setup.
3.Migrate HE VM to second host.
4.Try to migrate HE VM back to initial host.
5.Reboot host with HE guest.
6.Reboot second host after a few seconds 5-10.

Actual results:
Can't connect to HE and "ovirt-ha-agent dead but subsys locked" happens each time you'll try to bring it up again:
Starting ovirt-ha-agent: ^[[A                              [  OK  ]
[root@master-vds10 ~]# service ovirt-ha-agent start
[root@master-vds10 ~]# service ovirt-ha-agent start
[root@master-vds10 ~]# service ovirt-ha-agent start
Starting ovirt-ha-agent: ^[[A                              [  OK  ]
[root@master-vds10 ~]# service ovirt-ha-agent start
[root@master-vds10 ~]# service ovirt-ha-agent start
[root@master-vds10 ~]# service ovirt-ha-agent start
^[[AStarting ovirt-ha-agent:                               [  OK  ]
[root@master-vds10 ~]# service ovirt-ha-agent start
[root@master-vds10 ~]# service ovirt-ha-agent status
ovirt-ha-agent dead but subsys locked


Expected results:
HE should be reachable even after both hosts had been rebooted, service ovirt-ha-agent should automatically start and stay at that state.

Additional info:
Logs from both hosts.

Comment 1 Nikolai Sednev 2014-05-14 13:52:36 UTC

Created attachment 895502 [details]
logs from both hosts (agent, sanlock, vdsm).

Comment 2 Nikolai Sednev 2014-05-14 13:57:52 UTC

Created attachment 895504 [details]
Broker.log from both hosts

Comment 4 Doron Fediuck 2014-05-14 18:52:11 UTC

This case manifests a double-failure which can be avoided if you simply
add additional HA nodes, so the remaining nodes are still capable of running
the engine VM.
Having that said we should fix it 3.4.z.

Comment 5 Nikolai Sednev 2014-05-15 07:35:41 UTC

(In reply to Doron Fediuck from comment #4)
> This case manifests a double-failure which can be avoided if you simply
> add additional HA nodes, so the remaining nodes are still capable of running
> the engine VM.
> Having that said we should fix it 3.4.z.

The problem existed even with single host failure, hence it's core cause have to be reviewed and fixed anyway.

Comment 6 Jiri Moskovcak 2014-05-15 08:08:22 UTC

Seems like the storage is not mounted properly, can you please post the output of:

$ tree /rhev

Comment 7 Nikolai Sednev 2014-05-15 08:45:09 UTC

(In reply to Jiri Moskovcak from comment #6)
> Seems like the storage is not mounted properly, can you please post the
> output of:
> 
> $ tree /rhev

From rose:
[root@rose05 ~]# tree /rhev
/rhev
`-- data-center
    |-- 5849b030-626e-47cb-ad90-3ce782d831b3
    |-- hsm-tasks
    `-- mnt
        |-- 10.35.160.108:_RHEV_artyom__hosted__engine
        |   |-- b1cf00e3-982c-424a-827c-95984a7d7d2f
        |   |   |-- dom_md
        |   |   |   |-- ids
        |   |   |   |-- inbox
        |   |   |   |-- leases
        |   |   |   |-- metadata
        |   |   |   `-- outbox
        |   |   |-- ha_agent
        |   |   |   |-- hosted-engine.lockspace
        |   |   |   `-- hosted-engine.metadata
        |   |   |-- images
        |   |   |   `-- 42672899-f6a5-4309-aae3-bb309bdbc01a
        |   |   |       |-- cd755562-0063-46b6-bd4a-def730236166
        |   |   |       |-- cd755562-0063-46b6-bd4a-def730236166.lease
        |   |   |       `-- cd755562-0063-46b6-bd4a-def730236166.meta
        |   |   `-- master
        |   |       |-- tasks
        |   |       `-- vms
        |   `-- __DIRECT_IO_TEST__
        `-- 10.35.160.108:_RHEV_SD4HEVMs

14 directories, 11 files


from master:
[root@master-vds10 ~]#  tree /rhev
/rhev
`-- data-center
    |-- hsm-tasks
    `-- mnt
        |-- 10.35.160.108:_RHEV_artyom__export
        |-- 10.35.160.108:_RHEV_artyom__host__deploy
        |-- 10.35.160.108:_RHEV_artyom__hosted__engine
        |   |-- b1cf00e3-982c-424a-827c-95984a7d7d2f
        |   |   |-- dom_md
        |   |   |   |-- ids
        |   |   |   |-- inbox
        |   |   |   |-- leases
        |   |   |   |-- metadata
        |   |   |   `-- outbox
        |   |   |-- ha_agent
        |   |   |   |-- hosted-engine.lockspace
        |   |   |   `-- hosted-engine.metadata
        |   |   |-- images
        |   |   |   `-- 42672899-f6a5-4309-aae3-bb309bdbc01a
        |   |   |       |-- cd755562-0063-46b6-bd4a-def730236166
        |   |   |       |-- cd755562-0063-46b6-bd4a-def730236166.lease
        |   |   |       `-- cd755562-0063-46b6-bd4a-def730236166.meta
        |   |   `-- master
        |   |       |-- tasks
        |   |       `-- vms
        |   `-- __DIRECT_IO_TEST__
        |-- 10.35.160.108:_RHEV_artyom__hosted__engine__1
        |-- 10.35.160.108:_RHEV_artyom__hosted__engine__storage
        |-- 10.35.160.108:_RHEV_artyom__hosted__engine__storage1
        |-- 10.35.160.108:_RHEV_artyom__rhevm34
        |-- 10.35.160.108:_RHEV_nsednev
        |-- 10.35.160.108:_RHEV_SD4HEVMs
        |-- 10.35.160.108:_RHEV_upgrade__hosted__engine
        `-- lion.qa.lab.tlv.redhat.com:_export_alukiano_host__deploy

22 directories, 11 files

Comment 8 Nikolai Sednev 2014-05-15 09:02:37 UTC

What does it means: "s450 add_lockspace fail result -30"?

Taking from agent.log:
MainThread::INFO::2014-05-15 11:39:59,266::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) Ensuring lease for lockspace hosted-engine, host id 1 is acquired (file: /rhev/data-center/mnt/10.35.160.108:_RHEV_artyom__hosted__engine/b1cf00e3-982c-424a-827c-95984a7d7d2f/ha_agent/hosted-engine.lockspace)
MainThread::ERROR::2014-05-15 11:39:59,333::brokerlink::217::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed
MainThread::CRITICAL::2014-05-15 11:39:59,333::agent::103::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Could not start ha-agent
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 97, in run
    self._run_agent()
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 154, in _run_agent
    hosted_engine.HostedEngine(self.shutdown_requested).start_monitoring()
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 303, in start_monitoring
    for old_state, state, delay in self.fsm:
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/fsm/machine.py", line 125, in next
    new_data = self.refresh(self._state.data)
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/state_machine.py", line 77, in refresh
    stats.update(self.hosted_engine.collect_stats())
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 648, in collect_stats
    constants.SERVICE_TYPE)
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 171, in get_stats_from_storage
    result = self._checked_communicate(request)
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 187, in _checked_communicate
    response = self._communicate(request)
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 215, in _communicate
    response = util.socket_readline(self._socket, self._log)
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 66, in socket_readline
    raise DisconnectionError("Connection closed")
DisconnectionError: Connection closed


Taken from libvirt.log:
2014-02-13 15:11:39+0200 534418 [2757]: s450 lockspace 640a205b-ed14-40e2-ba0d-96a8e25ff645:4:/rhev/data-center/mnt/10.35.160.108:_RHEV_artyom__ovirt34/640
a205b-ed14-40e2-ba0d-96a8e25ff645/dom_md/ids:0
2014-02-13 15:11:39+0200 534418 [24978]: 640a205b aio collect 1 0x7f45ac0008c0:0x7f45ac0008d0:0x7f45ac101000 result -30:0 match res
2014-02-13 15:11:39+0200 534418 [24978]: write_sectors delta_leader offset 1536 rv -30 /rhev/data-center/mnt/10.35.160.108:_RHEV_artyom__ovirt34/640a205b-e
d14-40e2-ba0d-96a8e25ff645/dom_md/ids
2014-02-13 15:11:40+0200 534419 [2757]: s450 add_lockspace fail result -30
2014-02-13 15:11:46+0200 534425 [2757]: s451 lockspace 640a205b-ed14-40e2-ba0d-96a8e25ff645:4:/rhev/data-center/mnt/10.35.160.108:_RHEV_artyom__ovirt34/640
a205b-ed14-40e2-ba0d-96a8e25ff645/dom_md/ids:0

Comment 9 Artyom 2014-05-15 12:02:00 UTC

After remove lock file from /var/lock/ovirt-ha-agent agent start normally.
So scenario, is we have host with running ovirt-ha-agent and we restart this host, from some reason /var/lock/ovirt-ha-agent not removed properly and now when ovirt-ha-agent try to start on rhel boot it failed because he thinks that service already run(lock exist)

Comment 10 Jiri Moskovcak 2014-06-30 12:25:20 UTC

(In reply to Artyom from comment #9)
> After remove lock file from /var/lock/ovirt-ha-agent agent start normally.
> So scenario, is we have host with running ovirt-ha-agent and we restart this
> host, from some reason /var/lock/ovirt-ha-agent not removed properly and now
> when ovirt-ha-agent try to start on rhel boot it failed because he thinks
> that service already run(lock exist)

Artyom, are you sure? Form the logs it seems like it has problems with acquireing the sanlock lock which is something different than the lock in /var/lock/ovirt-ha-agent. Can you reproduce this in 100% - meaning if you hit the problem with the sanlock, is it fixed when you remove the /var/lock/ovirt-ha-agent ?  My guess here is that the sanlock just expired between your re-tries.

Comment 11 Nikolai Sednev 2014-07-01 05:43:24 UTC

(In reply to Jiri Moskovcak from comment #10)
> (In reply to Artyom from comment #9)
> > After remove lock file from /var/lock/ovirt-ha-agent agent start normally.
> > So scenario, is we have host with running ovirt-ha-agent and we restart this
> > host, from some reason /var/lock/ovirt-ha-agent not removed properly and now
> > when ovirt-ha-agent try to start on rhel boot it failed because he thinks
> > that service already run(lock exist)
> 
> Artyom, are you sure? Form the logs it seems like it has problems with
> acquireing the sanlock lock which is something different than the lock in
> /var/lock/ovirt-ha-agent. Can you reproduce this in 100% - meaning if you
> hit the problem with the sanlock, is it fixed when you remove the
> /var/lock/ovirt-ha-agent ?  My guess here is that the sanlock just expired
> between your re-tries.

If log removed it's fixed until next reboot.
Have you tried the scenario flow steps on your environment?

Comment 12 Jiri Moskovcak 2014-07-01 13:06:22 UTC

(In reply to Nikolai Sednev from comment #11)
> (In reply to Jiri Moskovcak from comment #10)
> > (In reply to Artyom from comment #9)
> > > After remove lock file from /var/lock/ovirt-ha-agent agent start normally.
> > > So scenario, is we have host with running ovirt-ha-agent and we restart this
> > > host, from some reason /var/lock/ovirt-ha-agent not removed properly and now
> > > when ovirt-ha-agent try to start on rhel boot it failed because he thinks
> > > that service already run(lock exist)
> > 
> > Artyom, are you sure? Form the logs it seems like it has problems with
> > acquireing the sanlock lock which is something different than the lock in
> > /var/lock/ovirt-ha-agent. Can you reproduce this in 100% - meaning if you
> > hit the problem with the sanlock, is it fixed when you remove the
> > /var/lock/ovirt-ha-agent ?  My guess here is that the sanlock just expired
> > between your re-tries.
> 
> If log removed it's fixed until next reboot.
> Have you tried the scenario flow steps on your environment?

Actually I did that many times last month (rebooted both hosts) and never ran into this problem. Can I get access to your machines when they're in this state?

Comment 13 Nikolai Sednev 2014-07-02 07:42:21 UTC

(In reply to Jiri Moskovcak from comment #12)
> (In reply to Nikolai Sednev from comment #11)
> > (In reply to Jiri Moskovcak from comment #10)
> > > (In reply to Artyom from comment #9)
> > > > After remove lock file from /var/lock/ovirt-ha-agent agent start normally.
> > > > So scenario, is we have host with running ovirt-ha-agent and we restart this
> > > > host, from some reason /var/lock/ovirt-ha-agent not removed properly and now
> > > > when ovirt-ha-agent try to start on rhel boot it failed because he thinks
> > > > that service already run(lock exist)
> > > 
> > > Artyom, are you sure? Form the logs it seems like it has problems with
> > > acquireing the sanlock lock which is something different than the lock in
> > > /var/lock/ovirt-ha-agent. Can you reproduce this in 100% - meaning if you
> > > hit the problem with the sanlock, is it fixed when you remove the
> > > /var/lock/ovirt-ha-agent ?  My guess here is that the sanlock just expired
> > > between your re-tries.
> > 
> > If log removed it's fixed until next reboot.
> > Have you tried the scenario flow steps on your environment?
> 
> Actually I did that many times last month (rebooted both hosts) and never
> ran into this problem. Can I get access to your machines when they're in
> this state?

Sure, I'll inform you when I'll have the access to setup, as currently I don't have one at all.
We can enhance if you do have the setup and I'll show you if reproducible there.

Comment 16 Nikolai Sednev 2014-07-24 14:30:10 UTC

Hi Jiri,
I need a working repos to get the setup running, what is the exact repos?

Comment 17 Nikolai Sednev 2014-08-12 13:36:40 UTC

Works for me on these components:
ovirt-engine-setup-3.5.0-0.0.master.20140804172041.git23b558e.el6.noarch
qemu-kvm-rhev-0.12.1.2-2.415.el6_5.14.x86_64
libvirt-0.10.2-29.el6_5.10.x86_64
vdsm-4.16.1-6.gita4a4614.el6.x86_64
qemu-kvm-rhev-tools-0.12.1.2-2.415.el6_5.14.x86_64
qemu-img-rhev-0.12.1.2-2.415.el6_5.14.x86_64
sanlock-2.8-1.el6.x86_64

Comment 18 Nikolai Sednev 2014-10-21 12:57:24 UTC

Hi Jiri,
Will this fix be backported to 3.4.3 as well?
It blocks 1147411, as I saw this bug reproduced on 3.4.3 and didn't allowed me adding more hosts to HE HA cluster, no redundancy...

Comment 19 Jiri Moskovcak 2014-10-24 08:41:54 UTC

(In reply to Nikolai Sednev from comment #18)
> Hi Jiri,
> Will this fix be backported to 3.4.3 as well?
> It blocks 1147411, as I saw this bug reproduced on 3.4.3 and didn't allowed
> me adding more hosts to HE HA cluster, no redundancy...

This should be fixed in ovirt-hosted-engine-ha-1.1.5

Comment 22 errata-xmlrpc 2015-02-11 21:08:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0194.html

Note You need to log in before you can comment on or make changes to this bug.