1097639 – hosted-engine-setup will spawn duplicate hosted-engine VMs (gluster)

Bug 1097639 - hosted-engine-setup will spawn duplicate hosted-engine VMs (gluster)

Summary: hosted-engine-setup will spawn duplicate hosted-engine VMs (gluster)

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	ovirt-hosted-engine-ha
Classification:	oVirt
Component:	General
Sub Component:
Version:	---
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Doron Fediuck
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:	sla
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-05-14 08:34 UTC by Andrew Lau
Modified:	2022-03-16 08:50 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-10-02 10:46:11 UTC
oVirt Team:	SLA
Embargoed:
Flags:	andrew: needinfo-

Attachments	(Terms of Use)
agent.log (109.29 KB, text/x-log) 2014-05-14 08:34 UTC, Andrew Lau	no flags	Details
broker.log (276.21 KB, text/x-log) 2014-05-14 08:35 UTC, Andrew Lau	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-45315	0	None	None	None	2022-03-16 08:50:14 UTC

Description Andrew Lau 2014-05-14 08:34:44 UTC

Created attachment 895405 [details]
agent.log

Description of problem:
This is a follow on from an email which went off list: Hosted Engine started VM Multiple Times

Environment:
2x Physical Servers

Gluster replicated volume setup to export the NFS share for hosted-engine.

Deploying one server with a successful hosted-engine running, followed by the second host. The ha-agent seems to crash with a storage issue, making the second host start up the hosted-engine after it's install has complete.

Logs attached.

Message snippet:
===
We were able to track it down to this (thanks Andrew for providing the testing setup):

-b686-4363-bb7e-dba99e5789b6/ha_agent service_type=hosted-engine'
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 165, in handle
    response = "success " + self._dispatch(data)
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 261, in _dispatch
    .get_all_stats_for_service_type(**options)
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 41, in get_all_stats_for_service_type
    d = self.get_raw_stats_for_service_type(storage_dir, service_type)
  File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 74, in get_raw_stats_for_service_type
    f = os.open(path, direct_flag | os.O_RDONLY)
OSError: [Errno 116] Stale file handle: '/rhev/data-center/mnt/localhost:_mnt_hosted-engine/c898fd2a-b686-4363-bb7e-dba99e5789b6/ha_agent/hosted-engine.metadata'

It's definitely connected to the storage which leads us to the gluster, I'm not very familiar with the gluster so I need to check this with our gluster gurus.
===

How reproducible:
Only twice out of 10 installs.

Steps to Reproduce:
1. Setup gluster nfs on the two hosts
2. Install hosted-engine on first host
3. Install hosted-engine on the second host

Actual results:
HostedEngine VM is started twice

Expected results:
HostedEngien VM is only running on one host.

Additional info:
I have a slight feeling it could be related to the gluster self-heal process. In the recent case, the gluster volume was replicating it's contents from host 1 to host 2 (newly created brick). I can't recall if this was occurring in the other time this double HostedEngine issue happened.

Comment 1 Andrew Lau 2014-05-14 08:35:41 UTC

Created attachment 895406 [details]
broker.log

Comment 2 Sandro Bonazzola 2014-06-11 06:51:11 UTC

This is an automated message:
This bug has been re-targeted from 3.4.2 to 3.5.0 since neither priority nor severity were high or urgent. Please re-target to 3.4.3 if relevant.

Comment 3 Martin Sivák 2014-07-15 16:13:58 UTC

Stale file handle means that Gluster moved the metadata file internally.

Using hosted engine with Gluster backed storage is currently something we really warn against.

I think this bug should be closed or re-targeted at documentation, because there is nothing we can do here. Hosted engine assumes that all writes are atomic and (immediately) available for all hosts in the cluster. Gluster violates those assumptions.

Comment 4 Andrew Lau 2014-07-16 23:06:21 UTC

This has been brought up many times, on the mailing list and there still hasn't been any official notice. Many people continue to keep trying gluster w/ hosted-engine.

I have tried, and know 2 people who are running the native kernel NFS server on top of gluster. Does this meet the hosted-engine assumptions, as all hosted-engine agents write to one NFS server, but gluster just replicates the files to the other servers.

Comment 5 Maël Lavault 2014-07-21 11:39:03 UTC

I went with this setup too, in large part due to this article : http://community.redhat.com/blog/2014/05/ovirt-3-4-glusterized/

Which is linked in the docs.

It's really annoying if it doesn't works, because I spent a lot of time trying to make it works and the deadline is close. What would be an alternative HA storage for self hosted engine if not gluster ?

Thanks !

Comment 6 Jiri Moskovcak 2014-07-21 11:42:39 UTC

(In reply to Maël Lavault from comment #5)
> I went with this setup too, in large part due to this article :
> http://community.redhat.com/blog/2014/05/ovirt-3-4-glusterized/
> 
> Which is linked in the docs.
> 
> It's really annoying if it doesn't works, because I spent a lot of time
> trying to make it works and the deadline is close. What would be an
> alternative HA storage for self hosted engine if not gluster ?
> 
> Thanks !

- as a workaround you can use direct NFS which is not on top of gluster, just for the HE storage domain and then use NFS on top of gluster for the rest.

Comment 7 Maël Lavault 2014-07-21 11:50:33 UTC

But then I need to use some HA solution for NFS too. Which add a bit more complexity. We try to keep things quite simple since we are not a lot to maintain the infrastructure.

What about using native POSIXFS Gluster ? Does it work with Hosted engine ? Does it work with CentOS 6.5 ?

Thanks !

Comment 8 Martin Sivák 2014-07-23 09:19:38 UTC

Running native kernel NFS on top of glusterized filesystem might work, but all hosts have to communicate with the same NFS node. We never tried that though.

I tried to add comment to that article describing the issue, but it has not been approved yet.

oVirt 3.5 adds support for iSCSI backed storage which should help with HA setups as you can configure the HA directly in your NAS.

Comment 9 Maël Lavault 2014-07-23 09:27:04 UTC

Would a drbd volume + nfs + pacemaker work for self hosted engine ?

Comment 10 Maël Lavault 2014-07-24 14:07:30 UTC

Btw how would Kernel NFS on top on gluster works ? Because what I understand is thet HE must only communicate with one NFS server at the time, so using kernel NFS on top on gluster for HA require pacemaker with virtual ip in active/passive mode right ?

Comment 11 Itamar Heim 2014-07-24 22:43:12 UTC

we should look into adding gluster support for hosted engine based on replica 3 and quite some testing.

Comment 12 Sandro Bonazzola 2015-09-04 09:00:51 UTC

This is an automated message.
This Bugzilla report has been opened on a version which is not maintained anymore.
Please check if this bug is still relevant in oVirt 3.5.4.
If it's not relevant anymore, please close it (you may use EOL or CURRENT RELEASE resolution)
If it's an RFE please update the version to 4.0 if still relevant.

Comment 13 Sandro Bonazzola 2015-10-02 10:46:11 UTC

This is an automated message.
This Bugzilla report has been opened on a version which is not maintained
anymore.
Please check if this bug is still relevant in oVirt 3.5.4 and reopen if still
an issue.

Note You need to log in before you can comment on or make changes to this bug.