Created attachment 895405 [details]
Description of problem:
This is a follow on from an email which went off list: Hosted Engine started VM Multiple Times
2x Physical Servers
Gluster replicated volume setup to export the NFS share for hosted-engine.
Deploying one server with a successful hosted-engine running, followed by the second host. The ha-agent seems to crash with a storage issue, making the second host start up the hosted-engine after it's install has complete.
We were able to track it down to this (thanks Andrew for providing the testing setup):
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 165, in handle
response = "success " + self._dispatch(data)
File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/listener.py", line 261, in _dispatch
File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 41, in get_all_stats_for_service_type
d = self.get_raw_stats_for_service_type(storage_dir, service_type)
File "/usr/lib/python2.6/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 74, in get_raw_stats_for_service_type
f = os.open(path, direct_flag | os.O_RDONLY)
OSError: [Errno 116] Stale file handle: '/rhev/data-center/mnt/localhost:_mnt_hosted-engine/c898fd2a-b686-4363-bb7e-dba99e5789b6/ha_agent/hosted-engine.metadata'
It's definitely connected to the storage which leads us to the gluster, I'm not very familiar with the gluster so I need to check this with our gluster gurus.
Only twice out of 10 installs.
Steps to Reproduce:
1. Setup gluster nfs on the two hosts
2. Install hosted-engine on first host
3. Install hosted-engine on the second host
HostedEngine VM is started twice
HostedEngien VM is only running on one host.
I have a slight feeling it could be related to the gluster self-heal process. In the recent case, the gluster volume was replicating it's contents from host 1 to host 2 (newly created brick). I can't recall if this was occurring in the other time this double HostedEngine issue happened.
Created attachment 895406 [details]
This is an automated message:
This bug has been re-targeted from 3.4.2 to 3.5.0 since neither priority nor severity were high or urgent. Please re-target to 3.4.3 if relevant.
Stale file handle means that Gluster moved the metadata file internally.
Using hosted engine with Gluster backed storage is currently something we really warn against.
I think this bug should be closed or re-targeted at documentation, because there is nothing we can do here. Hosted engine assumes that all writes are atomic and (immediately) available for all hosts in the cluster. Gluster violates those assumptions.
This has been brought up many times, on the mailing list and there still hasn't been any official notice. Many people continue to keep trying gluster w/ hosted-engine.
I have tried, and know 2 people who are running the native kernel NFS server on top of gluster. Does this meet the hosted-engine assumptions, as all hosted-engine agents write to one NFS server, but gluster just replicates the files to the other servers.
I went with this setup too, in large part due to this article : http://community.redhat.com/blog/2014/05/ovirt-3-4-glusterized/
Which is linked in the docs.
It's really annoying if it doesn't works, because I spent a lot of time trying to make it works and the deadline is close. What would be an alternative HA storage for self hosted engine if not gluster ?
(In reply to Maël Lavault from comment #5)
> I went with this setup too, in large part due to this article :
> Which is linked in the docs.
> It's really annoying if it doesn't works, because I spent a lot of time
> trying to make it works and the deadline is close. What would be an
> alternative HA storage for self hosted engine if not gluster ?
> Thanks !
- as a workaround you can use direct NFS which is not on top of gluster, just for the HE storage domain and then use NFS on top of gluster for the rest.
But then I need to use some HA solution for NFS too. Which add a bit more complexity. We try to keep things quite simple since we are not a lot to maintain the infrastructure.
What about using native POSIXFS Gluster ? Does it work with Hosted engine ? Does it work with CentOS 6.5 ?
Running native kernel NFS on top of glusterized filesystem might work, but all hosts have to communicate with the same NFS node. We never tried that though.
I tried to add comment to that article describing the issue, but it has not been approved yet.
oVirt 3.5 adds support for iSCSI backed storage which should help with HA setups as you can configure the HA directly in your NAS.
Would a drbd volume + nfs + pacemaker work for self hosted engine ?
Btw how would Kernel NFS on top on gluster works ? Because what I understand is thet HE must only communicate with one NFS server at the time, so using kernel NFS on top on gluster for HA require pacemaker with virtual ip in active/passive mode right ?
we should look into adding gluster support for hosted engine based on replica 3 and quite some testing.
This is an automated message.
This Bugzilla report has been opened on a version which is not maintained anymore.
Please check if this bug is still relevant in oVirt 3.5.4.
If it's not relevant anymore, please close it (you may use EOL or CURRENT RELEASE resolution)
If it's an RFE please update the version to 4.0 if still relevant.
This is an automated message.
This Bugzilla report has been opened on a version which is not maintained
Please check if this bug is still relevant in oVirt 3.5.4 and reopen if still