1318548 – Auto import hosted engine domain fails after engine DB restored on it from bare-metal engine deployment

Bug 1318548 - Auto import hosted engine domain fails after engine DB restored on it from bare-metal engine deployment

Summary: Auto import hosted engine domain fails after engine DB restored on it from ba...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.HostedEngine
Sub Component:
Version:	3.6.4
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-3.6.5
Target Release:	---
Assignee:	Martin Sivák
QA Contact:	Nikolai Sednev
Docs Contact:
URL:	https://drive.google.com/a/redhat.com...
Whiteboard:
Depends On:
Blocks:	1336614
TreeView+	depends on / blocked

Reported:	2016-03-17 08:33 UTC by Nikolai Sednev
Modified:	2017-05-11 09:23 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-04-21 14:37:14 UTC
oVirt Team:	SLA
Embargoed:
Dependent Products:
Flags:	dfediuck: ovirt-3.6.z? rule-engine: planning_ack? rule-engine: devel_ack+ rule-engine: testing_ack+

Attachments	(Terms of Use)
engine's sosreport (7.28 MB, application/x-xz) 2016-03-17 08:44 UTC, Nikolai Sednev	no flags	Details
Screenshot from 2016-03-17 16:53:40.png (190.71 KB, image/png) 2016-03-17 14:54 UTC, Nikolai Sednev	no flags	Details
View All

Description Nikolai Sednev 2016-03-17 08:33:43 UTC

Description of problem:
Auto import hosted engine domain fails after engine DB restored on it from bare-bone engine deployment.

I had bare-bone regular engine installation with one host and 10 guest VMs, ISO domain, export domain and one NFS data SD. The 3.6 engine was cleanly installed on el7.2 host named alma03. 
I've followed the  http://brq-setup.rhev.lab.eng.brq.redhat.com/ovirt-engine/docs/manual/en_US/html/Self-Hosted_Engine_Guide/chap-Migrating_from_Bare_Metal_to_a_RHEL-Based_Self-Hosted_Environment.html  and  http://brq-setup.rhev.lab.eng.brq.redhat.com/ovirt-engine/docs/manual/en_US/html/Self-Hosted_Engine_Guide/Restoring_the_Self-Hosted_Engine_Manager.html in order to get engine DB backup and then restored it on HE-VM during it's deployment on host named seal10 over NFS SD named "nsednev_3_6_he_backedup_from_alma_03".

Deployment finished successfully and I also added additional host named alma04 as second HE-host. 
I see all previously available guest-VMs, ISO domain and data SD, but auto-import fails to import the HE-SD in to the engine.

I've tried to destroy the "hosted_storage" in "Storage" tab, but it did not helped, the "hosted_storage" returns to the same state at least for 3 times I've tried this. My DC and host cluster are in 3.6 compatibility mode.

Version-Release number of selected component (if applicable):
Hosts:
vdsm-4.17.23-0.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.10.x86_64
libvirt-client-1.2.17-13.el7_2.4.x86_64
ovirt-hosted-engine-setup-1.3.3.4-1.el7ev.noarch
sanlock-3.2.4-2.el7_2.x86_64
ovirt-vmconsole-host-1.0.0-1.el7ev.noarch
ovirt-host-deploy-1.4.1-1.el7ev.noarch
mom-0.5.2-1.el7ev.noarch
ovirt-setup-lib-1.0.1-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.4.3-1.el7ev.noarch
ovirt-vmconsole-1.0.0-1.el7ev.noarch
Linux version 3.10.0-327.13.1.el7.x86_64 (mockbuild.eng.bos.redhat.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) ) #1 SMP Mon Feb 29 13:22:02 EST 2016


Engine:
rhevm-dependencies-3.6.0-1.el6ev.noarch
rhevm-branding-rhev-3.6.0-8.el6ev.noarch
rhevm-sdk-python-3.6.3.0-1.el6ev.noarch
rhevm-reports-3.6.3-1.el6ev.noarch
rhevm-setup-plugin-ovirt-engine-3.6.3.4-0.1.el6.noarch
rhevm-dbscripts-3.6.3.4-0.1.el6.noarch
rhevm-spice-client-x86-cab-3.6-6.el6.noarch
rhevm-setup-plugin-ovirt-engine-common-3.6.3.4-0.1.el6.noarch
rhevm-backend-3.6.3.4-0.1.el6.noarch
rhevm-spice-client-x86-msi-3.6-6.el6.noarch
rhevm-guest-agent-common-1.0.11-2.el6ev.noarch
rhevm-setup-base-3.6.3.4-0.1.el6.noarch
rhevm-extensions-api-impl-3.6.3.4-0.1.el6.noarch
rhevm-vmconsole-proxy-helper-3.6.3.4-0.1.el6.noarch
rhevm-restapi-3.6.3.4-0.1.el6.noarch
rhevm-doc-3.6.0-4.el6eng.noarch
rhevm-spice-client-x64-cab-3.6-6.el6.noarch
rhevm-setup-plugins-3.6.3-1.el6ev.noarch
rhevm-iso-uploader-3.6.0-1.el6ev.noarch
rhevm-dwh-3.6.2-1.el6ev.noarch
rhevm-cli-3.6.2.0-1.el6ev.noarch
rhevm-lib-3.6.3.4-0.1.el6.noarch
rhevm-websocket-proxy-3.6.3.4-0.1.el6.noarch
rhevm-setup-plugin-vmconsole-proxy-helper-3.6.3.4-0.1.el6.noarch
rhevm-userportal-3.6.3.4-0.1.el6.noarch
rhevm-3.6.3.4-0.1.el6.noarch
rhevm-spice-client-x64-msi-3.6-6.el6.noarch
rhevm-image-uploader-3.6.0-1.el6ev.noarch
rhevm-dwh-setup-3.6.2-1.el6ev.noarch
rhevm-setup-plugin-websocket-proxy-3.6.3.4-0.1.el6.noarch
rhevm-tools-3.6.3.4-0.1.el6.noarch
rhevm-log-collector-3.6.1-1.el6ev.noarch
rhevm-reports-setup-3.6.3-1.el6ev.noarch
rhevm-setup-3.6.3.4-0.1.el6.noarch
rhevm-webadmin-portal-3.6.3.4-0.1.el6.noarch


How reproducible:


Steps to Reproduce:
1.Deploy 3.6 engine on bare-bone el7.2 host, FQDN of the engine=FQDN of the host with DWH&reports&serial-console.
2.Add ISO domain and export domain and NFS data SD.
3.Add host on which you will be able to run some guest VMs.
4.Create 5-el6.6 and 5-el7.2 guest-VMs and start them.
5.Follow engine backup from http://brq-setup.rhev.lab.eng.brq.redhat.com/ovirt-engine/docs/manual/en_US/html/Self-Hosted_Engine_Guide/chap-Migrating_from_Bare_Metal_to_a_RHEL-Based_Self-Hosted_Environment.html.
6.On some additional el7.2 host, start deployment of HE, while using appliance with cloud-init.
7.During HE-deployment answer "No" for this question "Automatically execute engine-setup on the engine appliance on first boot (Yes, No)[Yes]? ".
8.Follow the instructions and restore engine's DB from backed up files from your bare-bone engine.
9.Finish the HE-deployment.
10.Add additional HE-host to your environment to meet the minimum HA requirements for your HE-VM.


Actual results:
HE storage domain not auto-imported in to the engine's WEBUI.

Expected results:
HE hosted_storage should be auto-imported in to the HE-WEBUI.

Additional info:
sosreports from both hosts and HE are attached.

Comment 1 Nikolai Sednev 2016-03-17 08:44:32 UTC

Created attachment 1137330 [details]
engine's sosreport

Comment 2 Nikolai Sednev 2016-03-17 08:52:02 UTC

Attaching sosreports from hosts as external sources:
sosreport from additional hosted-engine host (alma04):
https://drive.google.com/a/redhat.com/file/d/0B85BEaDBcF88ejhfZ0YyWWVkMG8/view?usp=sharing

sosreport from first hosted-engine host (seal10):
https://drive.google.com/a/redhat.com/file/d/0B85BEaDBcF88X1UyeWpfSF8tc3M/view?usp=sharing

Comment 3 Nikolai Sednev 2016-03-17 08:58:08 UTC

Lowering the severity as HE actually running and functions properly, except for not being auto-imported in to the WEBUI.

Comment 4 Nikolai Sednev 2016-03-17 14:52:06 UTC

Looking in to the https://bugzilla.redhat.com/show_bug.cgi?id=1269768 and checking the OVF_STORE current location in my environment, I see that it's located in nsednev_3_6_p2v_he_1, which is my Data SD for regular guest-vms.
Shouldn't OVF_STORE be located in HE-SD which is nsednev_3_6_he_backedup_from_alma_03?

Comment 5 Nikolai Sednev 2016-03-17 14:54:46 UTC

Created attachment 1137411 [details]
Screenshot from 2016-03-17 16:53:40.png

Comment 6 Roy Golan 2016-03-18 09:03:23 UTC

(In reply to Nikolai Sednev from comment #4)
> Looking in to the https://bugzilla.redhat.com/show_bug.cgi?id=1269768 and
> checking the OVF_STORE current location in my environment, I see that it's
> located in nsednev_3_6_p2v_he_1, which is my Data SD for regular guest-vms.
> Shouldn't OVF_STORE be located in HE-SD which is
> nsednev_3_6_he_backedup_from_alma_03?

An OVF_STORE (2 actually) will be created by the engine for each domain with vm disk *if* the engine knows about it. Since the engine did not import the HE VM yet then those special disks aren't created yet.

Comment 7 Yedidyah Bar David 2016-03-20 11:06:55 UTC

Now spent some time looking at various attached logs, per Roy's request.

The flow seems to have been:

1. alma04 was a host managed by the engine prior to the migration
2. At some point it was removed from the engine.
3. Then, as described above, engine was migrated from (old, physical host) alma03 to a VM on hosted-engine host seal10
4. Also as described above, alma04 was added as an additional hosted-engine host.
5. It seems to me that the engine decided to use alma04 to import the hosted storage, tried to get a sanlock lock, and failed.

agent.log on alma04 has:

MainThread::INFO::2016-03-16 17:57:27,877::hosted_engine::757::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_sanlock) Acquired lock on host id 2

but later:

MainThread::ERROR::2016-03-16 18:01:36,566::hosted_engine::845::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_domain_monitor) Failed to start monitoring domain (sd_uuid=5e3a8253-7dd5-48a4-9070-9edb741b4383, host_id=2): timeout during domain acquisition

and similar.

sanlock.log on alma04 has:

2016-03-16 16:57:09+0200 184681 [121112]: s3 delta_acquire host_id 1 busy1 1 2 14210 53300334-5c06-4f58-a562-d7d02afb67e2.seal10.qa.
2016-03-16 16:57:10+0200 184682 [58002]: s3 add_lockspace fail result -262

2016-03-16 17:37:33+0200 187106 [57992]: s4 host 1 2 187084 0525e867-a2b3-4a55-83d2-07838a5a06af.alma04.qa.
2016-03-16 17:37:33+0200 187106 [57992]: s4 host 2 2 1794 c6604252-9fe5-47cd-99e0-ec8016d20abd.seal10.qa.
2016-03-16 17:37:33+0200 187106 [57992]: s4 host 250 1 0 0525e867-a2b3-4a55-83d2-07838a5a06af.alma04.qa.

sanlock.log on seal10 has no 'fail', but does have:

2016-03-16 14:56:52+0200 7001 [7665]: s3:r3 resource dc4a1da7-e8ad-4ebf-bcb9-5c4342c62f52:SDM:/rhev/data-center/mnt/_var_lib_ovirt-hosted-engine-setup_tmpktBTvH/dc4a1da7-e8ad-4ebf-bcb9-5c4342c62f52/dom_md/leases:1048576 for 3,11,7081
2016-03-16 14:56:52+0200 7001 [6166]: s4 host 1 1 6980 53300334-5c06-4f58-a562-d7d02afb67e2.seal10.qa.
2016-03-16 14:56:52+0200 7001 [6166]: s4 host 250 1 0 53300334-5c06-4f58-a562-d7d02afb67e2.seal10.qa.
2016-03-16 14:56:52+0200 7001 [6166]: s3 host 1 1 6980 53300334-5c06-4f58-a562-d7d02afb67e2.seal10.qa.
2016-03-16 14:56:52+0200 7001 [6166]: s3 host 250 1 0 53300334-5c06-4f58-a562-d7d02afb67e2.seal10.qa.

and later:

2016-03-16 16:55:09+0200 14098 [7661]: add_lockspace 5e3a8253-7dd5-48a4-9070-9edb741b4383:2:/rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__he__backedup__from__alma__03/5e3a8253-7dd5-48a4-9070-9edb741b4383/dom_md/ids:0 conflicts with name of list1 s5 5e3a8253-7dd5-48a4-9070-9edb741b4383:1:/rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Virt_nsednev__3__6__he__backedup__from__alma__03/5e3a8253-7dd5-48a4-9070-9edb741b4383/dom_md/ids:0
2016-03-16 16:55:10+0200 14099 [6166]: s6 host 1 1 184548 0525e867-a2b3-4a55-83d2-07838a5a06af.alma04.qa.
2016-03-16 16:55:10+0200 14099 [6166]: s6 host 2 1 14078 53300334-5c06-4f58-a562-d7d02afb67e2.seal10.qa.
2016-03-16 16:55:10+0200 14099 [6166]: s6 host 250 1 0 0525e867-a2b3-4a55-83d2-07838a5a06af.alma04.qa.

and many more 'conflicts'. I can't properly read sanlock.log files, but it does not seem ok to me.

I suggest to ask some storage people about this.

Comment 8 Roy Golan 2016-03-20 12:49:39 UTC

Didi thanks.

alma04 is RHEL 7.2 Beta. This shouldn't be supported. Please upgrade it and retry.

Comment 9 Nikolai Sednev 2016-03-21 06:12:09 UTC

(In reply to Roy Golan from comment #8)
> Didi thanks.
> 
> alma04 is RHEL 7.2 Beta. This shouldn't be supported. Please upgrade it and
> retry.

Not helped.
[root@alma04 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.2 (Maipo)
[root@alma04 ~]# uname -a
Linux alma04.qa.lab.tlv.redhat.com 3.10.0-327.13.1.el7.x86_64 #1 SMP Mon Feb 29 13:22:02 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
[root@alma04 ~]# rpm -qa libvirt-client sanlock qemu-kvm-rhev vdsm mom ovirt*
ovirt-vmconsole-1.0.0-1.el7ev.noarch
ovirt-hosted-engine-setup-1.3.4.0-1.el7ev.noarch
ovirt-setup-lib-1.0.1-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.10.x86_64
mom-0.5.2-1.el7ev.noarch
sanlock-3.2.4-2.el7_2.x86_64
vdsm-4.17.23.1-0.el7ev.noarch
ovirt-hosted-engine-ha-1.3.5-1.el7ev.noarch
ovirt-host-deploy-1.4.1-1.el7ev.noarch
libvirt-client-1.2.17-13.el7_2.4.x86_64
ovirt-vmconsole-host-1.0.0-1.el7ev.noarch


[root@seal10 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.2 (Maipo)
[root@seal10 ~]# uname -a
Linux seal10.qa.lab.tlv.redhat.com 3.10.0-327.13.1.el7.x86_64 #1 SMP Mon Feb 29 13:22:02 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
[root@seal10 ~]# rpm -qa libvirt-client sanlock qemu-kvm-rhev vdsm mom ovirt*
qemu-kvm-rhev-2.3.0-31.el7_2.10.x86_64
ovirt-hosted-engine-setup-1.3.4.0-1.el7ev.noarch
libvirt-client-1.2.17-13.el7_2.4.x86_64
sanlock-3.2.4-2.el7_2.x86_64
ovirt-vmconsole-host-1.0.0-1.el7ev.noarch
ovirt-host-deploy-1.4.1-1.el7ev.noarch
vdsm-4.17.23.1-0.el7ev.noarch
mom-0.5.2-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.5-1.el7ev.noarch
ovirt-setup-lib-1.0.1-1.el7ev.noarch
ovirt-vmconsole-1.0.0-1.el7ev.noarch


[root@alma04 ~]# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : seal10.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 924f2fc8
Host timestamp                     : 51040


--== Host 2 status ==--

Status up-to-date                  : True
Hostname                           : alma04.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 15b51483
Host timestamp                     : 51166

Taken from WEBUI Events:
	
Mar 21, 2016 8:00:40 AM
VDSM command failed: Cannot acquire host id: (u'5e3a8253-7dd5-48a4-9070-9edb741b4383', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))
	
Mar 21, 2016 8:00:39 AM
Storage Pool Manager runs on Host hosted_engine_1 (Address: seal10.qa.lab.tlv.redhat.com).
	
Mar 21, 2016 8:00:32 AM
Data Center is being initialized, please wait for initialization to complete.
	
Mar 21, 2016 8:00:30 AM
VDSM command failed: Cannot acquire host id: (u'5e3a8253-7dd5-48a4-9070-9edb741b4383', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))
	
Mar 21, 2016 8:00:28 AM
Storage Domain hosted_storage was added by SYSTEM
	
Mar 21, 2016 8:00:18 AM
Storage Domain hosted_storage was forcibly removed by admin@internal



See comment 30 from https://bugzilla.redhat.com/show_bug.cgi?id=1269768  "Dec 16, 2015 12:16:57 PM VDSM hosted_engine_1 command failed: Cannot acquire host id: (u'97f5a165-4df5-4bce-99cc-8a634753bc54', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))" Looks pretty the same.

Also might be related to https://bugzilla.redhat.com/show_bug.cgi?id=1305768

Comment 10 Nikolai Sednev 2016-03-21 07:11:48 UTC

Attaching more logs from current state:
https://drive.google.com/a/redhat.com/file/d/0B85BEaDBcF88aWJxWDEwcDU2c1U/view?usp=sharing

Comment 11 Nikolai Sednev 2016-03-23 08:42:52 UTC

(In reply to Yedidyah Bar David from comment #7)
> Now spent some time looking at various attached logs, per Roy's request.
> 
> The flow seems to have been:
> 
> 1. alma04 was a host managed by the engine prior to the migration
> 2. At some point it was removed from the engine.
> 3. Then, as described above, engine was migrated from (old, physical host)
> alma03 to a VM on hosted-engine host seal10
> 4. Also as described above, alma04 was added as an additional hosted-engine
> host.
> 5. It seems to me that the engine decided to use alma04 to import the hosted
> storage, tried to get a sanlock lock, and failed.
> 
> agent.log on alma04 has:
> 
> MainThread::INFO::2016-03-16
> 17:57:27,877::hosted_engine::757::ovirt_hosted_engine_ha.agent.hosted_engine.
> HostedEngine::(_initialize_sanlock) Acquired lock on host id 2
> 
> but later:
> 
> MainThread::ERROR::2016-03-16
> 18:01:36,566::hosted_engine::845::ovirt_hosted_engine_ha.agent.hosted_engine.
> HostedEngine::(_initialize_domain_monitor) Failed to start monitoring domain
> (sd_uuid=5e3a8253-7dd5-48a4-9070-9edb741b4383, host_id=2): timeout during
> domain acquisition
> 
> and similar.
> 
> sanlock.log on alma04 has:
> 
> 2016-03-16 16:57:09+0200 184681 [121112]: s3 delta_acquire host_id 1 busy1 1
> 2 14210 53300334-5c06-4f58-a562-d7d02afb67e2.seal10.qa.
> 2016-03-16 16:57:10+0200 184682 [58002]: s3 add_lockspace fail result -262
> 
> 2016-03-16 17:37:33+0200 187106 [57992]: s4 host 1 2 187084
> 0525e867-a2b3-4a55-83d2-07838a5a06af.alma04.qa.
> 2016-03-16 17:37:33+0200 187106 [57992]: s4 host 2 2 1794
> c6604252-9fe5-47cd-99e0-ec8016d20abd.seal10.qa.
> 2016-03-16 17:37:33+0200 187106 [57992]: s4 host 250 1 0
> 0525e867-a2b3-4a55-83d2-07838a5a06af.alma04.qa.
> 
> sanlock.log on seal10 has no 'fail', but does have:
> 
> 2016-03-16 14:56:52+0200 7001 [7665]: s3:r3 resource
> dc4a1da7-e8ad-4ebf-bcb9-5c4342c62f52:SDM:/rhev/data-center/mnt/
> _var_lib_ovirt-hosted-engine-setup_tmpktBTvH/dc4a1da7-e8ad-4ebf-bcb9-
> 5c4342c62f52/dom_md/leases:1048576 for 3,11,7081
> 2016-03-16 14:56:52+0200 7001 [6166]: s4 host 1 1 6980
> 53300334-5c06-4f58-a562-d7d02afb67e2.seal10.qa.
> 2016-03-16 14:56:52+0200 7001 [6166]: s4 host 250 1 0
> 53300334-5c06-4f58-a562-d7d02afb67e2.seal10.qa.
> 2016-03-16 14:56:52+0200 7001 [6166]: s3 host 1 1 6980
> 53300334-5c06-4f58-a562-d7d02afb67e2.seal10.qa.
> 2016-03-16 14:56:52+0200 7001 [6166]: s3 host 250 1 0
> 53300334-5c06-4f58-a562-d7d02afb67e2.seal10.qa.
> 
> and later:
> 
> 2016-03-16 16:55:09+0200 14098 [7661]: add_lockspace
> 5e3a8253-7dd5-48a4-9070-9edb741b4383:2:/rhev/data-center/mnt/10.35.64.11:
> _vol_RHEV_Virt_nsednev__3__6__he__backedup__from__alma__03/5e3a8253-7dd5-
> 48a4-9070-9edb741b4383/dom_md/ids:0 conflicts with name of list1 s5
> 5e3a8253-7dd5-48a4-9070-9edb741b4383:1:/rhev/data-center/mnt/10.35.64.11:
> _vol_RHEV_Virt_nsednev__3__6__he__backedup__from__alma__03/5e3a8253-7dd5-
> 48a4-9070-9edb741b4383/dom_md/ids:0
> 2016-03-16 16:55:10+0200 14099 [6166]: s6 host 1 1 184548
> 0525e867-a2b3-4a55-83d2-07838a5a06af.alma04.qa.
> 2016-03-16 16:55:10+0200 14099 [6166]: s6 host 2 1 14078
> 53300334-5c06-4f58-a562-d7d02afb67e2.seal10.qa.
> 2016-03-16 16:55:10+0200 14099 [6166]: s6 host 250 1 0
> 0525e867-a2b3-4a55-83d2-07838a5a06af.alma04.qa.
> 
> and many more 'conflicts'. I can't properly read sanlock.log files, but it
> does not seem ok to me.
> 
> I suggest to ask some storage people about this.

Not exactly:
1. alma04 was a host managed by the engine prior to the migration<-Yes, it was and with 10 guest VMs running on top of it.
2. At some point it was removed from the engine. <-I did not removed alma04 from the engine, it remained connected to the engine within it's DB, during bare-metal->HE migration of the engine and then was reconnected to the HE from the DB restore.
3. Then, as described above, engine was migrated from (old, physical host)
alma03 to a VM on hosted-engine host seal10 <-Exactly, and migration succeeded with full DB restore.
4. Also as described above, alma04 was added as an additional hosted-engine
host. <-All guest VMs were migrated from alma04 to seal10, then ovirt-hosted-engine-setup was installed on alma04 and then alma04 was added as additional HE-host to seal10.  
5. It seems to me that the engine decided to use alma04 to import the hosted
storage, tried to get a sanlock lock, and failed. <-Exactly.

Comment 12 Yedidyah Bar David 2016-03-23 09:22:36 UTC

(In reply to Nikolai Sednev from comment #11)
> (In reply to Yedidyah Bar David from comment #7)
> 
> Not exactly:
> > 1. alma04 was a host managed by the engine prior to the migration
> 
> Yes, it was and with 10 guest VMs running on top of it.
> 
> > 2. At some point it was removed from the engine.
> 
> I did not removed alma04 from the engine, it remained connected to the engine within it's DB, during
> bare-metal->HE migration of the engine and then was reconnected to the HE
> from the DB restore.
> 
> > 3. Then, as described above, engine was migrated from (old, physical host)
> > alma03 to a VM on hosted-engine host seal10
> 
> Exactly, and migration succeeded with full DB restore.
> 
> > 4. Also as described above, alma04 was added as an additional hosted-engine
> > host.
> 
> All guest VMs were migrated from alma04 to seal10, then
> ovirt-hosted-engine-setup was installed on alma04 and then alma04 was added
> as additional HE-host to seal10.  
> 
> > 5. It seems to me that the engine decided to use alma04 to import the hosted
> > storage, tried to get a sanlock lock, and failed.
> 
> Exactly.

(Please reply inline like above, using copy/paste and '<-' is much harder to read and not less work for you)

Bottom line - your step (10.) from comment 0 was done on an existing host, not a new one.

Not sure we really need to support this flow.

Please try again, but before step 10, reinstall the host from scratch (after moving to maint and removing from engine).

For now, hosted-engine --deploy should only be run on new hosts.
In some cases it will work on existing ones, but success is not guaranteed. It's quite likely that we'll not officially support this without solving bug 1001181 and use the tool created there.

Comment 16 Martin Sivák 2016-03-23 14:52:02 UTC

Nikolai, I have one question. Did you check that no hosted engine agent and no vdsm is running when you re-added the host to the new setup?

Because what might have happened is that the "new" VDSM setup tried to acquire a new ID using the same old lockspace. That indeed results in the SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) error.

Comment 17 Nikolai Sednev 2016-03-23 15:31:33 UTC

(In reply to Martin Sivák from comment #16)
> Nikolai, I have one question. Did you check that no hosted engine agent and
> no vdsm is running when you re-added the host to the new setup?
> 
> Because what might have happened is that the "new" VDSM setup tried to
> acquire a new ID using the same old lockspace. That indeed results in the
> SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))
> error.

yes, there was no he-agent or broker runnig on alma04.

Comment 18 Nikolai Sednev 2016-03-23 15:32:57 UTC

(In reply to Martin Sivák from comment #16)
> Nikolai, I have one question. Did you check that no hosted engine agent and
> no vdsm is running when you re-added the host to the new setup?
> 
> Because what might have happened is that the "new" VDSM setup tried to
> acquire a new ID using the same old lockspace. That indeed results in the
> SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))
> error.

But yes, there was vdsm running as alma04 was the hipervisor for the 10 guest VMs.

Comment 19 Nikolai Sednev 2016-03-24 08:47:00 UTC

1)I've cleanly reprovisioned both hosts, one at a time, and redeployed HE on each, while another host was running engine and 10 guest-VMs. 
2)I destroyed the hosted_storage, so auto-import could be started again.
3)The redeployments succeeded and hosted_storage was auto-imported successfully.

Works for me with these components:

Hosts:
libvirt-client-1.2.17-13.el7_2.4.x86_64
ovirt-hosted-engine-setup-1.3.4.0-1.el7ev.noarch
sanlock-3.2.4-2.el7_2.x86_64
ovirt-vmconsole-host-1.0.0-1.el7ev.noarch
ovirt-host-deploy-1.4.1-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.10.x86_64
mom-0.5.2-1.el7ev.noarch
ovirt-setup-lib-1.0.1-1.el7ev.noarch
vdsm-4.17.23.1-0.el7ev.noarch
ovirt-hosted-engine-ha-1.3.5.1-1.el7ev.noarch
ovirt-vmconsole-1.0.0-1.el7ev.noarch
Linux seal10.qa.lab.tlv.redhat.com 3.10.0-327.13.1.el7.x86_64 #1 SMP Mon Feb 29 13:22:02 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.2 (Maipo)

Engine:
rhevm-setup-plugin-ovirt-engine-common-3.6.4-0.1.el6.noarch
rhevm-branding-rhev-3.6.0-9.el6ev.noarch
rhevm-webadmin-portal-3.6.4-0.1.el6.noarch
rhevm-dependencies-3.6.0-1.el6ev.noarch
rhevm-sdk-python-3.6.3.0-1.el6ev.noarch
rhevm-reports-3.6.3-1.el6ev.noarch
rhevm-vmconsole-proxy-helper-3.6.4-0.1.el6.noarch
rhevm-dbscripts-3.6.4-0.1.el6.noarch
rhevm-spice-client-x86-cab-3.6-6.el6.noarch
rhevm-lib-3.6.4-0.1.el6.noarch
rhevm-setup-plugin-vmconsole-proxy-helper-3.6.4-0.1.el6.noarch
rhevm-setup-3.6.4-0.1.el6.noarch
rhevm-restapi-3.6.4-0.1.el6.noarch
rhevm-tools-3.6.4-0.1.el6.noarch
rhevm-spice-client-x86-msi-3.6-6.el6.noarch
rhevm-guest-agent-common-1.0.11-2.el6ev.noarch
rhevm-setup-base-3.6.4-0.1.el6.noarch
rhevm-setup-plugin-websocket-proxy-3.6.4-0.1.el6.noarch
rhevm-extensions-api-impl-3.6.4-0.1.el6.noarch
rhevm-userportal-3.6.4-0.1.el6.noarch
rhevm-3.6.4-0.1.el6.noarch
rhevm-doc-3.6.0-4.el6eng.noarch                                                       
rhevm-spice-client-x64-cab-3.6-6.el6.noarch
rhevm-setup-plugins-3.6.3-1.el6ev.noarch
rhevm-iso-uploader-3.6.0-1.el6ev.noarch
rhevm-dwh-3.6.2-1.el6ev.noarch
rhevm-cli-3.6.2.0-1.el6ev.noarch
rhevm-websocket-proxy-3.6.4-0.1.el6.noarch
rhevm-spice-client-x64-msi-3.6-6.el6.noarch
rhevm-image-uploader-3.6.0-1.el6ev.noarch
rhevm-dwh-setup-3.6.2-1.el6ev.noarch
rhevm-setup-plugin-ovirt-engine-3.6.4-0.1.el6.noarch
rhevm-backend-3.6.4-0.1.el6.noarch
rhevm-log-collector-3.6.1-1.el6ev.noarch
rhevm-reports-setup-3.6.3-1.el6ev.noarch

Presence of VDSM/libvirt/sanlock/qemu-kvm-rhev/etc, on host that was previously non-hosted-engine-host, that was running guest-VMs, might have been a problem to hosted-engine normal deployment on top of it, I've took the suggestion of Martin from comment #16 and it solved this issue.

Please consider closing this bug as works for me and I'm adding + to doc, so this could be documented properly.

Comment 20 Yedidyah Bar David 2016-03-24 11:12:33 UTC

(In reply to Nikolai Sednev from comment #19)
> Please consider closing this bug as works for me and I'm adding + to doc, so
> this could be documented properly.

Not sure what you mean here exactly.

Bottom line: Users that want to add an existing host to their hosted-engine cluster by running on it hosted-engine --deploy, have to reinstall the OS on it to make sure it's clean.

Luci - how should we continue? Perhaps add this to the main docs somewhere? Write a KB?

Comment 21 Nikolai Sednev 2016-03-24 12:01:29 UTC

(In reply to Yedidyah Bar David from comment #20)
> (In reply to Nikolai Sednev from comment #19)
> > Please consider closing this bug as works for me and I'm adding + to doc, so
> > this could be documented properly.
> 
> Not sure what you mean here exactly.
> 
> Bottom line: Users that want to add an existing host to their hosted-engine
> cluster by running on it hosted-engine --deploy, have to reinstall the OS on
> it to make sure it's clean.
> 
> Luci - how should we continue? Perhaps add this to the main docs somewhere?
> Write a KB?

I've meant that:

-----------------------------Bare-metal-setup--------------------------
1-engine installed on host1 as bare metal deployment.
2-host2 being used as hypervisor for guest-VMs.
3-backup engine's DB.
-----------------------------------------------------------------------
                                   |
                                   V
----------------------------Bare-metal-to-HE-setup---------------------
1-engine becomes HE with restored DB, running on top of host1 or other host-x.
2-all guest-VMs migrated from regular non-he-host2 to HE-host1.
3-reprovision host2 to get clean host and install on it ovirt-hosted-engine-setup.
4-deploy he on clean host2 and add it as additional host.

Comment 22 Roy Golan 2016-03-27 08:06:54 UTC

Putting this on_qa, @Nikolai please make sure the doc is added. Thanks!

Comment 23 Martin Sivák 2016-03-29 07:38:07 UTC

Hi,

it might not be necessary to fully reinstall the host. Only the VDSM configuration has to be wiped out to make sure VDSM does not use any lockspace and won't try connecting to one after reboot.

Nir? Is there a procedure to accomplish that?

Comment 24 Nikolai Sednev 2016-03-30 14:26:01 UTC

(In reply to Roy Golan from comment #22)
> Putting this on_qa, @Nikolai please make sure the doc is added. Thanks!

I've successfully redeployed both hosts each at a time and had no troubles adding them to HE environment. Both hosts were cleanly reprovisioned each at a time, so no previously existing information about HE environment could be on them.

Hosts:
ovirt-hosted-engine-setup-1.3.4.0-1.el7ev.noarch
sanlock-3.2.4-2.el7_2.x86_64
ovirt-vmconsole-host-1.0.0-1.el7ev.noarch
ovirt-host-deploy-1.4.1-1.el7ev.noarch
qemu-kvm-rhev-2.3.0-31.el7_2.10.x86_64
mom-0.5.2-1.el7ev.noarch
ovirt-setup-lib-1.0.1-1.el7ev.noarch
ovirt-hosted-engine-ha-1.3.5.1-1.el7ev.noarch
vdsm-4.17.23.2-1.el7ev.noarch
ovirt-vmconsole-1.0.0-1.el7ev.noarch
Red Hat Enterprise Linux Server release 7.2 (Maipo)
Linux seal10.qa.lab.tlv.redhat.com 3.10.0-327.13.1.el7.x86_64 #1 SMP Mon Feb 29 13:22:02 EST 2016 x86_64 x86_64 x86_64 GNU/Linux

Engine:
rhevm-branding-rhev-3.6.0-9.el6ev.noarch
rhevm-dependencies-3.6.0-1.el6ev.noarch
rhevm-setup-plugin-vmconsole-proxy-helper-3.6.4.1-0.1.el6.noarch
rhevm-sdk-python-3.6.3.0-1.el6ev.noarch
rhevm-tools-3.6.4.1-0.1.el6.noarch
rhevm-reports-3.6.3-1.el6ev.noarch
rhevm-spice-client-x86-cab-3.6-6.el6.noarch
rhevm-setup-base-3.6.4.1-0.1.el6.noarch
rhevm-extensions-api-impl-3.6.4.1-0.1.el6.noarch
rhevm-spice-client-x86-msi-3.6-6.el6.noarch
rhevm-setup-plugin-ovirt-engine-common-3.6.4.1-0.1.el6.noarch
rhevm-websocket-proxy-3.6.4.1-0.1.el6.noarch
rhevm-backend-3.6.4.1-0.1.el6.noarch
rhevm-guest-agent-common-1.0.11-2.el6ev.noarch
rhevm-userportal-3.6.4.1-0.1.el6.noarch
rhevm-doc-3.6.0-4.el6eng.noarch
rhevm-spice-client-x64-cab-3.6-6.el6.noarch
rhevm-setup-plugins-3.6.3-1.el6ev.noarch
rhevm-setup-plugin-ovirt-engine-3.6.4.1-0.1.el6.noarch
rhevm-vmconsole-proxy-helper-3.6.4.1-0.1.el6.noarch
rhevm-dbscripts-3.6.4.1-0.1.el6.noarch
rhevm-3.6.4.1-0.1.el6.noarch
rhevm-iso-uploader-3.6.0-1.el6ev.noarch
rhevm-dwh-3.6.2-1.el6ev.noarch
rhevm-cli-3.6.2.0-1.el6ev.noarch
rhevm-spice-client-x64-msi-3.6-6.el6.noarch
rhevm-lib-3.6.4.1-0.1.el6.noarch
rhevm-setup-3.6.4.1-0.1.el6.noarch
rhevm-webadmin-portal-3.6.4.1-0.1.el6.noarch
rhevm-image-uploader-3.6.0-1.el6ev.noarch
rhevm-dwh-setup-3.6.2-1.el6ev.noarch
rhevm-log-collector-3.6.1-1.el6ev.noarch
rhevm-setup-plugin-websocket-proxy-3.6.4.1-0.1.el6.noarch
rhevm-restapi-3.6.4.1-0.1.el6.noarch
rhevm-reports-setup-3.6.3-1.el6ev.noarch
Red Hat Enterprise Linux Server release 6.7 (Santiago)
Linux alma03.qa.lab.tlv.redhat.com 2.6.32-573.22.1.el6.x86_64 #1 SMP Thu Mar 17 03:23:39 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

Luci - Still need your reply to comment #20.

Comment 25 Lucy Bopf 2016-03-31 03:00:36 UTC

(In reply to Yedidyah Bar David from comment #20)
> (In reply to Nikolai Sednev from comment #19)
> > Please consider closing this bug as works for me and I'm adding + to doc, so
> > this could be documented properly.
> 
> Not sure what you mean here exactly.
> 
> Bottom line: Users that want to add an existing host to their hosted-engine
> cluster by running on it hosted-engine --deploy, have to reinstall the OS on
> it to make sure it's clean.
> 
> Luci - how should we continue? Perhaps add this to the main docs somewhere?
> Write a KB?

Apologies for the delay, and thanks, Nikolai, for the reminder.

If I understand correctly, the documentation requirement here is to make clear that hosts that existed in the original environment must be reinstalled before they are used in the new self-hosted engine setup? If this is something that must be done for every migration, we should add it as a step in the documented procedure (https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.6/html-single/Self-Hosted_Engine_Guide/index.html#chap-Migrating_from_Bare_Metal_to_a_RHEL-Based_Self-Hosted_Environment). Can you (Didi or Nikolai) advise where this step should be added?

Comment 26 Lucy Bopf 2016-03-31 03:03:02 UTC

Reinstating needinfo on nsoffer, which was cleared by mistake.

Comment 27 Nikolai Sednev 2016-03-31 06:12:29 UTC

(In reply to Lucy Bopf from comment #25)
> (In reply to Yedidyah Bar David from comment #20)
> > (In reply to Nikolai Sednev from comment #19)
> > > Please consider closing this bug as works for me and I'm adding + to doc, so
> > > this could be documented properly.
> > 
> > Not sure what you mean here exactly.
> > 
> > Bottom line: Users that want to add an existing host to their hosted-engine
> > cluster by running on it hosted-engine --deploy, have to reinstall the OS on
> > it to make sure it's clean.
> > 
> > Luci - how should we continue? Perhaps add this to the main docs somewhere?
> > Write a KB?
> 
> Apologies for the delay, and thanks, Nikolai, for the reminder.
> 
> If I understand correctly, the documentation requirement here is to make
> clear that hosts that existed in the original environment must be
> reinstalled before they are used in the new self-hosted engine setup? If
> this is something that must be done for every migration, we should add it as
> a step in the documented procedure
> (https://access.redhat.com/documentation/en-US/
> Red_Hat_Enterprise_Virtualization/3.6/html-single/Self-Hosted_Engine_Guide/
> index.html#chap-Migrating_from_Bare_Metal_to_a_RHEL-Based_Self-
> Hosted_Environment). Can you (Didi or Nikolai) advise where this step should
> be added?

Any additional host that being added should be a clean host. Hosts that were used as non-hosted-engine-hosts, like for only hosting the guest-VMs, and they have VDSM and other components, should be reprovisioned before being added as hosted-engine-hosts (additional hosts) to the HE environment.

Clean reprovisioning and redeploying on previously existing hosted-engine-hosts is not required. This means that if there was HE-VM running on some hosted-engine-host, that host is not required to be reprovisioned and redeployed.

Comment 28 Nir Soffer 2016-05-16 16:47:59 UTC

Removing need info, I don't see anything needed now.

Note You need to log in before you can comment on or make changes to this bug.