Description of problem: Can't connect to HE after both hosts within the host cluster been rebooted and getting "ovirt-ha-agent dead but subsys locked" for the service ovirt-ha-agent status. Version-Release number of selected component (if applicable): On both hosts: vdsm-4.14.7-2.el6ev.x86_64 qemu-kvm-rhev-0.12.1.2-2.415.el6_5.9.x86_64 libvirt-0.10.2-29.el6_5.7.x86_64 sanlock-2.8-1.el6.x86_64 On HE before the issue occurred: (AV9.1) Red Hat Enterprise Virtualization Manager Version: 3.4.0-0.20.el6ev How reproducible: Unclear flow. Steps to Reproduce: 1.Install HE on RHEL6.5 based OS host. 2.Add additional host to setup. 3.Migrate HE VM to second host. 4.Try to migrate HE VM back to initial host. 5.Reboot host with HE guest. 6.Reboot second host after a few seconds 5-10. Actual results: Can't connect to HE and "ovirt-ha-agent dead but subsys locked" happens each time you'll try to bring it up again: Starting ovirt-ha-agent: ^[[A [ OK ] [root@master-vds10 ~]# service ovirt-ha-agent start [root@master-vds10 ~]# service ovirt-ha-agent start [root@master-vds10 ~]# service ovirt-ha-agent start Starting ovirt-ha-agent: ^[[A [ OK ] [root@master-vds10 ~]# service ovirt-ha-agent start [root@master-vds10 ~]# service ovirt-ha-agent start [root@master-vds10 ~]# service ovirt-ha-agent start ^[[AStarting ovirt-ha-agent: [ OK ] [root@master-vds10 ~]# service ovirt-ha-agent start [root@master-vds10 ~]# service ovirt-ha-agent status ovirt-ha-agent dead but subsys locked Expected results: HE should be reachable even after both hosts had been rebooted, service ovirt-ha-agent should automatically start and stay at that state. Additional info: Logs from both hosts.
Created attachment 895502 [details] logs from both hosts (agent, sanlock, vdsm).
Created attachment 895504 [details] Broker.log from both hosts
This case manifests a double-failure which can be avoided if you simply add additional HA nodes, so the remaining nodes are still capable of running the engine VM. Having that said we should fix it 3.4.z.
(In reply to Doron Fediuck from comment #4) > This case manifests a double-failure which can be avoided if you simply > add additional HA nodes, so the remaining nodes are still capable of running > the engine VM. > Having that said we should fix it 3.4.z. The problem existed even with single host failure, hence it's core cause have to be reviewed and fixed anyway.
Seems like the storage is not mounted properly, can you please post the output of: $ tree /rhev
(In reply to Jiri Moskovcak from comment #6) > Seems like the storage is not mounted properly, can you please post the > output of: > > $ tree /rhev From rose: [root@rose05 ~]# tree /rhev /rhev `-- data-center |-- 5849b030-626e-47cb-ad90-3ce782d831b3 |-- hsm-tasks `-- mnt |-- 10.35.160.108:_RHEV_artyom__hosted__engine | |-- b1cf00e3-982c-424a-827c-95984a7d7d2f | | |-- dom_md | | | |-- ids | | | |-- inbox | | | |-- leases | | | |-- metadata | | | `-- outbox | | |-- ha_agent | | | |-- hosted-engine.lockspace | | | `-- hosted-engine.metadata | | |-- images | | | `-- 42672899-f6a5-4309-aae3-bb309bdbc01a | | | |-- cd755562-0063-46b6-bd4a-def730236166 | | | |-- cd755562-0063-46b6-bd4a-def730236166.lease | | | `-- cd755562-0063-46b6-bd4a-def730236166.meta | | `-- master | | |-- tasks | | `-- vms | `-- __DIRECT_IO_TEST__ `-- 10.35.160.108:_RHEV_SD4HEVMs 14 directories, 11 files from master: [root@master-vds10 ~]# tree /rhev /rhev `-- data-center |-- hsm-tasks `-- mnt |-- 10.35.160.108:_RHEV_artyom__export |-- 10.35.160.108:_RHEV_artyom__host__deploy |-- 10.35.160.108:_RHEV_artyom__hosted__engine | |-- b1cf00e3-982c-424a-827c-95984a7d7d2f | | |-- dom_md | | | |-- ids | | | |-- inbox | | | |-- leases | | | |-- metadata | | | `-- outbox | | |-- ha_agent | | | |-- hosted-engine.lockspace | | | `-- hosted-engine.metadata | | |-- images | | | `-- 42672899-f6a5-4309-aae3-bb309bdbc01a | | | |-- cd755562-0063-46b6-bd4a-def730236166 | | | |-- cd755562-0063-46b6-bd4a-def730236166.lease | | | `-- cd755562-0063-46b6-bd4a-def730236166.meta | | `-- master | | |-- tasks | | `-- vms | `-- __DIRECT_IO_TEST__ |-- 10.35.160.108:_RHEV_artyom__hosted__engine__1 |-- 10.35.160.108:_RHEV_artyom__hosted__engine__storage |-- 10.35.160.108:_RHEV_artyom__hosted__engine__storage1 |-- 10.35.160.108:_RHEV_artyom__rhevm34 |-- 10.35.160.108:_RHEV_nsednev |-- 10.35.160.108:_RHEV_SD4HEVMs |-- 10.35.160.108:_RHEV_upgrade__hosted__engine `-- lion.qa.lab.tlv.redhat.com:_export_alukiano_host__deploy 22 directories, 11 files
What does it means: "s450 add_lockspace fail result -30"? Taking from agent.log: MainThread::INFO::2014-05-15 11:39:59,266::hosted_engine::453::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) Ensuring lease for lockspace hosted-engine, host id 1 is acquired (file: /rhev/data-center/mnt/10.35.160.108:_RHEV_artyom__hosted__engine/b1cf00e3-982c-424a-827c-95984a7d7d2f/ha_agent/hosted-engine.lockspace) MainThread::ERROR::2014-05-15 11:39:59,333::brokerlink::217::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed MainThread::CRITICAL::2014-05-15 11:39:59,333::agent::103::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Could not start ha-agent Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 97, in run self._run_agent() File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/agent.py", line 154, in _run_agent hosted_engine.HostedEngine(self.shutdown_requested).start_monitoring() File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 303, in start_monitoring for old_state, state, delay in self.fsm: File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/fsm/machine.py", line 125, in next new_data = self.refresh(self._state.data) File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/state_machine.py", line 77, in refresh stats.update(self.hosted_engine.collect_stats()) File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 648, in collect_stats constants.SERVICE_TYPE) File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 171, in get_stats_from_storage result = self._checked_communicate(request) File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 187, in _checked_communicate response = self._communicate(request) File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 215, in _communicate response = util.socket_readline(self._socket, self._log) File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 66, in socket_readline raise DisconnectionError("Connection closed") DisconnectionError: Connection closed Taken from libvirt.log: 2014-02-13 15:11:39+0200 534418 [2757]: s450 lockspace 640a205b-ed14-40e2-ba0d-96a8e25ff645:4:/rhev/data-center/mnt/10.35.160.108:_RHEV_artyom__ovirt34/640 a205b-ed14-40e2-ba0d-96a8e25ff645/dom_md/ids:0 2014-02-13 15:11:39+0200 534418 [24978]: 640a205b aio collect 1 0x7f45ac0008c0:0x7f45ac0008d0:0x7f45ac101000 result -30:0 match res 2014-02-13 15:11:39+0200 534418 [24978]: write_sectors delta_leader offset 1536 rv -30 /rhev/data-center/mnt/10.35.160.108:_RHEV_artyom__ovirt34/640a205b-e d14-40e2-ba0d-96a8e25ff645/dom_md/ids 2014-02-13 15:11:40+0200 534419 [2757]: s450 add_lockspace fail result -30 2014-02-13 15:11:46+0200 534425 [2757]: s451 lockspace 640a205b-ed14-40e2-ba0d-96a8e25ff645:4:/rhev/data-center/mnt/10.35.160.108:_RHEV_artyom__ovirt34/640 a205b-ed14-40e2-ba0d-96a8e25ff645/dom_md/ids:0
After remove lock file from /var/lock/ovirt-ha-agent agent start normally. So scenario, is we have host with running ovirt-ha-agent and we restart this host, from some reason /var/lock/ovirt-ha-agent not removed properly and now when ovirt-ha-agent try to start on rhel boot it failed because he thinks that service already run(lock exist)
(In reply to Artyom from comment #9) > After remove lock file from /var/lock/ovirt-ha-agent agent start normally. > So scenario, is we have host with running ovirt-ha-agent and we restart this > host, from some reason /var/lock/ovirt-ha-agent not removed properly and now > when ovirt-ha-agent try to start on rhel boot it failed because he thinks > that service already run(lock exist) Artyom, are you sure? Form the logs it seems like it has problems with acquireing the sanlock lock which is something different than the lock in /var/lock/ovirt-ha-agent. Can you reproduce this in 100% - meaning if you hit the problem with the sanlock, is it fixed when you remove the /var/lock/ovirt-ha-agent ? My guess here is that the sanlock just expired between your re-tries.
(In reply to Jiri Moskovcak from comment #10) > (In reply to Artyom from comment #9) > > After remove lock file from /var/lock/ovirt-ha-agent agent start normally. > > So scenario, is we have host with running ovirt-ha-agent and we restart this > > host, from some reason /var/lock/ovirt-ha-agent not removed properly and now > > when ovirt-ha-agent try to start on rhel boot it failed because he thinks > > that service already run(lock exist) > > Artyom, are you sure? Form the logs it seems like it has problems with > acquireing the sanlock lock which is something different than the lock in > /var/lock/ovirt-ha-agent. Can you reproduce this in 100% - meaning if you > hit the problem with the sanlock, is it fixed when you remove the > /var/lock/ovirt-ha-agent ? My guess here is that the sanlock just expired > between your re-tries. If log removed it's fixed until next reboot. Have you tried the scenario flow steps on your environment?
(In reply to Nikolai Sednev from comment #11) > (In reply to Jiri Moskovcak from comment #10) > > (In reply to Artyom from comment #9) > > > After remove lock file from /var/lock/ovirt-ha-agent agent start normally. > > > So scenario, is we have host with running ovirt-ha-agent and we restart this > > > host, from some reason /var/lock/ovirt-ha-agent not removed properly and now > > > when ovirt-ha-agent try to start on rhel boot it failed because he thinks > > > that service already run(lock exist) > > > > Artyom, are you sure? Form the logs it seems like it has problems with > > acquireing the sanlock lock which is something different than the lock in > > /var/lock/ovirt-ha-agent. Can you reproduce this in 100% - meaning if you > > hit the problem with the sanlock, is it fixed when you remove the > > /var/lock/ovirt-ha-agent ? My guess here is that the sanlock just expired > > between your re-tries. > > If log removed it's fixed until next reboot. > Have you tried the scenario flow steps on your environment? Actually I did that many times last month (rebooted both hosts) and never ran into this problem. Can I get access to your machines when they're in this state?
(In reply to Jiri Moskovcak from comment #12) > (In reply to Nikolai Sednev from comment #11) > > (In reply to Jiri Moskovcak from comment #10) > > > (In reply to Artyom from comment #9) > > > > After remove lock file from /var/lock/ovirt-ha-agent agent start normally. > > > > So scenario, is we have host with running ovirt-ha-agent and we restart this > > > > host, from some reason /var/lock/ovirt-ha-agent not removed properly and now > > > > when ovirt-ha-agent try to start on rhel boot it failed because he thinks > > > > that service already run(lock exist) > > > > > > Artyom, are you sure? Form the logs it seems like it has problems with > > > acquireing the sanlock lock which is something different than the lock in > > > /var/lock/ovirt-ha-agent. Can you reproduce this in 100% - meaning if you > > > hit the problem with the sanlock, is it fixed when you remove the > > > /var/lock/ovirt-ha-agent ? My guess here is that the sanlock just expired > > > between your re-tries. > > > > If log removed it's fixed until next reboot. > > Have you tried the scenario flow steps on your environment? > > Actually I did that many times last month (rebooted both hosts) and never > ran into this problem. Can I get access to your machines when they're in > this state? Sure, I'll inform you when I'll have the access to setup, as currently I don't have one at all. We can enhance if you do have the setup and I'll show you if reproducible there.
Hi Jiri, I need a working repos to get the setup running, what is the exact repos?
Works for me on these components: ovirt-engine-setup-3.5.0-0.0.master.20140804172041.git23b558e.el6.noarch qemu-kvm-rhev-0.12.1.2-2.415.el6_5.14.x86_64 libvirt-0.10.2-29.el6_5.10.x86_64 vdsm-4.16.1-6.gita4a4614.el6.x86_64 qemu-kvm-rhev-tools-0.12.1.2-2.415.el6_5.14.x86_64 qemu-img-rhev-0.12.1.2-2.415.el6_5.14.x86_64 sanlock-2.8-1.el6.x86_64
Hi Jiri, Will this fix be backported to 3.4.3 as well? It blocks 1147411, as I saw this bug reproduced on 3.4.3 and didn't allowed me adding more hosts to HE HA cluster, no redundancy...
(In reply to Nikolai Sednev from comment #18) > Hi Jiri, > Will this fix be backported to 3.4.3 as well? > It blocks 1147411, as I saw this bug reproduced on 3.4.3 and didn't allowed > me adding more hosts to HE HA cluster, no redundancy... This should be fixed in ovirt-hosted-engine-ha-1.1.5
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0194.html