Bug 1097767

Summary: ovirt-ha-agent dead but subsys locked
Product: Red Hat Enterprise Virtualization Manager Reporter: Nikolai Sednev <nsednev>
Component: ovirt-hosted-engine-haAssignee: Doron Fediuck <dfediuck>
Status: CLOSED ERRATA QA Contact: Nikolai Sednev <nsednev>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 3.4.0CC: alukiano, dfediuck, iheim, mavital, nsednev, sbonazzo, sherold
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: 3.5.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: sla
Fixed In Version: ovirt-3.5.0-beta2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1119699 (view as bug list) Environment:
Last Closed: 2015-02-11 21:08:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1147411    
Bug Blocks: 1119699, 1142923, 1156165    
Attachments:
Description Flags
logs from both hosts (agent, sanlock, vdsm).
none
Broker.log from both hosts none

Description Nikolai Sednev 2014-05-14 13:41:08 UTC
Description of problem:
Can't connect to HE after both hosts within the host cluster been rebooted and getting "ovirt-ha-agent dead but subsys locked" for the service ovirt-ha-agent status.

Version-Release number of selected component (if applicable):
On both hosts:
vdsm-4.14.7-2.el6ev.x86_64
qemu-kvm-rhev-0.12.1.2-2.415.el6_5.9.x86_64
libvirt-0.10.2-29.el6_5.7.x86_64
sanlock-2.8-1.el6.x86_64

On HE before the issue occurred:
(AV9.1) Red Hat Enterprise Virtualization Manager Version: 3.4.0-0.20.el6ev

How reproducible:
Unclear flow.

Steps to Reproduce:
1.Install HE on RHEL6.5 based OS host.
2.Add additional host to setup.
3.Migrate HE VM to second host.
4.Try to migrate HE VM back to initial host.
5.Reboot host with HE guest.
6.Reboot second host after a few seconds 5-10.

Actual results:
Can't connect to HE and "ovirt-ha-agent dead but subsys locked" happens each time you'll try to bring it up again:
Starting ovirt-ha-agent: ^[[A                              [  OK  ]
[root@master-vds10 ~]# service ovirt-ha-agent start
[root@master-vds10 ~]# service ovirt-ha-agent start
[root@master-vds10 ~]# service ovirt-ha-agent start
Starting ovirt-ha-agent: ^[[A                              [  OK  ]
[root@master-vds10 ~]# service ovirt-ha-agent start
[root@master-vds10 ~]# service ovirt-ha-agent start
[root@master-vds10 ~]# service ovirt-ha-agent start
^[[AStarting ovirt-ha-agent:                               [  OK  ]
[root@master-vds10 ~]# service ovirt-ha-agent start
[root@master-vds10 ~]# service ovirt-ha-agent status
ovirt-ha-agent dead but subsys locked


Expected results:
HE should be reachable even after both hosts had been rebooted, service ovirt-ha-agent should automatically start and stay at that state.

Additional info:
Logs from both hosts.

Comment 1 Nikolai Sednev 2014-05-14 13:52:36 UTC
Created attachment 895502 [details]
logs from both hosts (agent, sanlock, vdsm).

Comment 2 Nikolai Sednev 2014-05-14 13:57:52 UTC
Created attachment 895504 [details]
Broker.log from both hosts

Comment 4 Doron Fediuck 2014-05-14 18:52:11 UTC
This case manifests a double-failure which can be avoided if you simply
add additional HA nodes, so the remaining nodes are still capable of running
the engine VM.
Having that said we should fix it 3.4.z.

Comment 5 Nikolai Sednev 2014-05-15 07:35:41 UTC
(In reply to Doron Fediuck from comment #4)
> This case manifests a double-failure which can be avoided if you simply
> add additional HA nodes, so the remaining nodes are still capable of running
> the engine VM.
> Having that said we should fix it 3.4.z.

The problem existed even with single host failure, hence it's core cause have to be reviewed and fixed anyway.

Comment 6 Jiri Moskovcak 2014-05-15 08:08:22 UTC
Seems like the storage is not mounted properly, can you please post the output of:

$ tree /rhev

Comment 7 Nikolai Sednev 2014-05-15 08:45:09 UTC
(In reply to Jiri Moskovcak from comment #6)
> Seems like the storage is not mounted properly, can you please post the
> output of:
> 
> $ tree /rhev

From rose:
[root@rose05 ~]# tree /rhev
/rhev
`-- data-center
    |-- 5849b030-626e-47cb-ad90-3ce782d831b3
    |-- hsm-tasks
    `-- mnt
        |-- 10.35.160.108:_RHEV_artyom__hosted__engine
        |   |-- b1cf00e3-982c-424a-827c-95984a7d7d2f
        |   |   |-- dom_md
        |   |   |   |-- ids
        |   |   |   |-- inbox
        |   |   |   |-- leases
        |   |   |   |-- metadata
        |   |   |   `-- outbox
        |   |   |-- ha_agent
        |   |   |   |-- hosted-engine.lockspace
        |   |   |   `-- hosted-engine.metadata
        |   |   |-- images
        |   |   |   `-- 42672899-f6a5-4309-aae3-bb309bdbc01a
        |   |   |       |-- cd755562-0063-46b6-bd4a-def730236166
        |   |   |       |-- cd755562-0063-46b6-bd4a-def730236166.lease
        |   |   |       `-- cd755562-0063-46b6-bd4a-def730236166.meta
        |   |   `-- master
        |   |       |-- tasks
        |   |       `-- vms
        |   `-- __DIRECT_IO_TEST__
        `-- 10.35.160.108:_RHEV_SD4HEVMs

14 directories, 11 files


from master:
[root@master-vds10 ~]#  tree /rhev
/rhev
`-- data-center
    |-- hsm-tasks
    `-- mnt
        |-- 10.35.160.108:_RHEV_artyom__export
        |-- 10.35.160.108:_RHEV_artyom__host__deploy
        |-- 10.35.160.108:_RHEV_artyom__hosted__engine
        |   |-- b1cf00e3-982c-424a-827c-95984a7d7d2f
        |   |   |-- dom_md
        |   |   |   |-- ids
        |   |   |   |-- inbox
        |   |   |   |-- leases
        |   |   |   |-- metadata
        |   |   |   `-- outbox
        |   |   |-- ha_agent
        |   |   |   |-- hosted-engine.lockspace
        |   |   |   `-- hosted-engine.metadata
        |   |   |-- images
        |   |   |   `-- 42672899-f6a5-4309-aae3-bb309bdbc01a
        |   |   |       |-- cd755562-0063-46b6-bd4a-def730236166
        |   |   |       |-- cd755562-0063-46b6-bd4a-def730236166.lease
        |   |   |       `-- cd755562-0063-46b6-bd4a-def730236166.meta
        |   |   `-- master
        |   |       |-- tasks
        |   |       `-- vms
        |   `-- __DIRECT_IO_TEST__
        |-- 10.35.160.108:_RHEV_artyom__hosted__engine__1
        |-- 10.35.160.108:_RHEV_artyom__hosted__engine__storage
        |-- 10.35.160.108:_RHEV_artyom__hosted__engine__storage1
        |-- 10.35.160.108:_RHEV_artyom__rhevm34
        |-- 10.35.160.108:_RHEV_nsednev
        |-- 10.35.160.108:_RHEV_SD4HEVMs
        |-- 10.35.160.108:_RHEV_upgrade__hosted__engine
        `-- lion.qa.lab.tlv.redhat.com:_export_alukiano_host__deploy

22 directories, 11 files

Comment 8 Nikolai Sednev 2014-05-15 09:02:37 UTC
What does it means: "s450 add_lockspace fail result -30"?

Taking from agent.log:
MainThread::INFO::2014-05-15 11:39:59,266::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) Ensuring lease for lockspace hosted-engine, host id 1 is acquired (file: /rhev/data-center/mnt/10.35.160.108:_RHEV_artyom__hosted__engine/b1cf00e3-982c-424a-827c-95984a7d7d2f/ha_agent/hosted-engine.lockspace)
MainThread::ERROR::2014-05-15 11:39:59,333::brokerlink::217::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed
MainThread::CRITICAL::2014-05-15 11:39:59,333::agent::103::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Could not start ha-agent
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 97, in run
    self._run_agent()
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 154, in _run_agent
    hosted_engine.HostedEngine(self.shutdown_requested).start_monitoring()
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 303, in start_monitoring
    for old_state, state, delay in self.fsm:
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/fsm/machine.py", line 125, in next
    new_data = self.refresh(self._state.data)
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/state_machine.py", line 77, in refresh
    stats.update(self.hosted_engine.collect_stats())
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 648, in collect_stats
    constants.SERVICE_TYPE)
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 171, in get_stats_from_storage
    result = self._checked_communicate(request)
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 187, in _checked_communicate
    response = self._communicate(request)
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 215, in _communicate
    response = util.socket_readline(self._socket, self._log)
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 66, in socket_readline
    raise DisconnectionError("Connection closed")
DisconnectionError: Connection closed


Taken from libvirt.log:
2014-02-13 15:11:39+0200 534418 [2757]: s450 lockspace 640a205b-ed14-40e2-ba0d-96a8e25ff645:4:/rhev/data-center/mnt/10.35.160.108:_RHEV_artyom__ovirt34/640
a205b-ed14-40e2-ba0d-96a8e25ff645/dom_md/ids:0
2014-02-13 15:11:39+0200 534418 [24978]: 640a205b aio collect 1 0x7f45ac0008c0:0x7f45ac0008d0:0x7f45ac101000 result -30:0 match res
2014-02-13 15:11:39+0200 534418 [24978]: write_sectors delta_leader offset 1536 rv -30 /rhev/data-center/mnt/10.35.160.108:_RHEV_artyom__ovirt34/640a205b-e
d14-40e2-ba0d-96a8e25ff645/dom_md/ids
2014-02-13 15:11:40+0200 534419 [2757]: s450 add_lockspace fail result -30
2014-02-13 15:11:46+0200 534425 [2757]: s451 lockspace 640a205b-ed14-40e2-ba0d-96a8e25ff645:4:/rhev/data-center/mnt/10.35.160.108:_RHEV_artyom__ovirt34/640
a205b-ed14-40e2-ba0d-96a8e25ff645/dom_md/ids:0

Comment 9 Artyom 2014-05-15 12:02:00 UTC
After remove lock file from /var/lock/ovirt-ha-agent agent start normally.
So scenario, is we have host with running ovirt-ha-agent and we restart this host, from some reason /var/lock/ovirt-ha-agent not removed properly and now when ovirt-ha-agent try to start on rhel boot it failed because he thinks that service already run(lock exist)

Comment 10 Jiri Moskovcak 2014-06-30 12:25:20 UTC
(In reply to Artyom from comment #9)
> After remove lock file from /var/lock/ovirt-ha-agent agent start normally.
> So scenario, is we have host with running ovirt-ha-agent and we restart this
> host, from some reason /var/lock/ovirt-ha-agent not removed properly and now
> when ovirt-ha-agent try to start on rhel boot it failed because he thinks
> that service already run(lock exist)

Artyom, are you sure? Form the logs it seems like it has problems with acquireing the sanlock lock which is something different than the lock in /var/lock/ovirt-ha-agent. Can you reproduce this in 100% - meaning if you hit the problem with the sanlock, is it fixed when you remove the /var/lock/ovirt-ha-agent ?  My guess here is that the sanlock just expired between your re-tries.

Comment 11 Nikolai Sednev 2014-07-01 05:43:24 UTC
(In reply to Jiri Moskovcak from comment #10)
> (In reply to Artyom from comment #9)
> > After remove lock file from /var/lock/ovirt-ha-agent agent start normally.
> > So scenario, is we have host with running ovirt-ha-agent and we restart this
> > host, from some reason /var/lock/ovirt-ha-agent not removed properly and now
> > when ovirt-ha-agent try to start on rhel boot it failed because he thinks
> > that service already run(lock exist)
> 
> Artyom, are you sure? Form the logs it seems like it has problems with
> acquireing the sanlock lock which is something different than the lock in
> /var/lock/ovirt-ha-agent. Can you reproduce this in 100% - meaning if you
> hit the problem with the sanlock, is it fixed when you remove the
> /var/lock/ovirt-ha-agent ?  My guess here is that the sanlock just expired
> between your re-tries.

If log removed it's fixed until next reboot.
Have you tried the scenario flow steps on your environment?

Comment 12 Jiri Moskovcak 2014-07-01 13:06:22 UTC
(In reply to Nikolai Sednev from comment #11)
> (In reply to Jiri Moskovcak from comment #10)
> > (In reply to Artyom from comment #9)
> > > After remove lock file from /var/lock/ovirt-ha-agent agent start normally.
> > > So scenario, is we have host with running ovirt-ha-agent and we restart this
> > > host, from some reason /var/lock/ovirt-ha-agent not removed properly and now
> > > when ovirt-ha-agent try to start on rhel boot it failed because he thinks
> > > that service already run(lock exist)
> > 
> > Artyom, are you sure? Form the logs it seems like it has problems with
> > acquireing the sanlock lock which is something different than the lock in
> > /var/lock/ovirt-ha-agent. Can you reproduce this in 100% - meaning if you
> > hit the problem with the sanlock, is it fixed when you remove the
> > /var/lock/ovirt-ha-agent ?  My guess here is that the sanlock just expired
> > between your re-tries.
> 
> If log removed it's fixed until next reboot.
> Have you tried the scenario flow steps on your environment?

Actually I did that many times last month (rebooted both hosts) and never ran into this problem. Can I get access to your machines when they're in this state?

Comment 13 Nikolai Sednev 2014-07-02 07:42:21 UTC
(In reply to Jiri Moskovcak from comment #12)
> (In reply to Nikolai Sednev from comment #11)
> > (In reply to Jiri Moskovcak from comment #10)
> > > (In reply to Artyom from comment #9)
> > > > After remove lock file from /var/lock/ovirt-ha-agent agent start normally.
> > > > So scenario, is we have host with running ovirt-ha-agent and we restart this
> > > > host, from some reason /var/lock/ovirt-ha-agent not removed properly and now
> > > > when ovirt-ha-agent try to start on rhel boot it failed because he thinks
> > > > that service already run(lock exist)
> > > 
> > > Artyom, are you sure? Form the logs it seems like it has problems with
> > > acquireing the sanlock lock which is something different than the lock in
> > > /var/lock/ovirt-ha-agent. Can you reproduce this in 100% - meaning if you
> > > hit the problem with the sanlock, is it fixed when you remove the
> > > /var/lock/ovirt-ha-agent ?  My guess here is that the sanlock just expired
> > > between your re-tries.
> > 
> > If log removed it's fixed until next reboot.
> > Have you tried the scenario flow steps on your environment?
> 
> Actually I did that many times last month (rebooted both hosts) and never
> ran into this problem. Can I get access to your machines when they're in
> this state?

Sure, I'll inform you when I'll have the access to setup, as currently I don't have one at all.
We can enhance if you do have the setup and I'll show you if reproducible there.

Comment 16 Nikolai Sednev 2014-07-24 14:30:10 UTC
Hi Jiri,
I need a working repos to get the setup running, what is the exact repos?

Comment 17 Nikolai Sednev 2014-08-12 13:36:40 UTC
Works for me on these components:
ovirt-engine-setup-3.5.0-0.0.master.20140804172041.git23b558e.el6.noarch
qemu-kvm-rhev-0.12.1.2-2.415.el6_5.14.x86_64
libvirt-0.10.2-29.el6_5.10.x86_64
vdsm-4.16.1-6.gita4a4614.el6.x86_64
qemu-kvm-rhev-tools-0.12.1.2-2.415.el6_5.14.x86_64
qemu-img-rhev-0.12.1.2-2.415.el6_5.14.x86_64
sanlock-2.8-1.el6.x86_64

Comment 18 Nikolai Sednev 2014-10-21 12:57:24 UTC
Hi Jiri,
Will this fix be backported to 3.4.3 as well?
It blocks 1147411, as I saw this bug reproduced on 3.4.3 and didn't allowed me adding more hosts to HE HA cluster, no redundancy...

Comment 19 Jiri Moskovcak 2014-10-24 08:41:54 UTC
(In reply to Nikolai Sednev from comment #18)
> Hi Jiri,
> Will this fix be backported to 3.4.3 as well?
> It blocks 1147411, as I saw this bug reproduced on 3.4.3 and didn't allowed
> me adding more hosts to HE HA cluster, no redundancy...

This should be fixed in ovirt-hosted-engine-ha-1.1.5

Comment 22 errata-xmlrpc 2015-02-11 21:08:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0194.html