1386443 – Installation of gluster gives Failed to execute stage 'Misc configuration': Cannot acquire host id: ... SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))

Bug 1386443 - Installation of gluster gives Failed to execute stage 'Misc configuration': Cannot acquire host id: ... SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))

Summary: Installation of gluster gives Failed to execute stage 'Misc configuration': C...

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-hosted-engine-setup
Sub Component:
Version:	4.0.3
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Sahina Bose
QA Contact:	meital avital
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-10-19 01:02 UTC by Tom Gamull
Modified:	2019-04-28 13:35 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-05-01 13:41:52 UTC
oVirt Team:	Gluster
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
gluster brick log (159.37 KB, text/plain) 2016-10-19 01:04 UTC, Tom Gamull	no flags	Details
glsuter engine log (79.59 KB, text/plain) 2016-10-19 01:05 UTC, Tom Gamull	no flags	Details
sanlock (11.43 KB, text/plain) 2016-10-19 01:05 UTC, Tom Gamull	no flags	Details
vdsm (3.09 MB, text/plain) 2016-10-19 01:05 UTC, Tom Gamull	no flags	Details
tmp file in gluster (7.39 KB, text/plain) 2016-10-19 01:06 UTC, Tom Gamull	no flags	Details
ovirt-hosted-engine-setup log (305.54 KB, text/plain) 2016-10-19 16:08 UTC, Tom Gamull	no flags	Details
strace virt05.gamull.com node2 (45.26 KB, application/x-gzip) 2016-10-20 12:44 UTC, Tom Gamull	no flags	Details
strace virt04 from failed vmstore data domain attach (22.02 KB, application/x-gzip) 2016-10-21 20:38 UTC, Tom Gamull	no flags	Details
Hosted Engine setup log (534.69 KB, text/plain) 2016-10-21 20:38 UTC, Tom Gamull	no flags	Details
ovirt-engine log (4.57 MB, text/plain) 2016-10-21 21:26 UTC, Tom Gamull	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1386830	0	medium	CLOSED	When creating a storage domain fails due to ClusterLockInitError, it is ignored and action completes	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1405447	0	unspecified	CLOSED	[RFE] Provide the configuration check for Grafton deployment	2021-02-22 00:41:40 UTC

Internal Links: 1386830 1405447

Description Tom Gamull 2016-10-19 01:02:59 UTC

Description of problem:
Following procedure here - https://access.redhat.com/articles/2578391#manual and from rhv4 install with gluster the hosted-engine install fails on a sanlock error. disabling SELinux and other options have not resolved the issue.

Version-Release number of selected component (if applicable):
rhevm-appliance-20160922.0-1.el7ev.ova
RHV4

How reproducible:
100%

Steps to Reproduce:
1. Configure GlusterFS on 3 nodes with recommended settings for RHV4 and the HCI install (link in description)
2. Test for split-brain and ensure gluster is replicating and functioning
3. Run setup

Actual results:
[ ERROR ] Failed to execute stage 'Misc configuration': Cannot acquire host id: (u'ee2bf992-ec7d-4e32-97b7-aa214aca587b', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))


Expected results:
SUCCESS

Additional info:
Logs attached
SELinux was disabled, NetworkManager disabled
IPV6 disabled as it has sometimes been a bug
ensuring communications for nodes, entered node names in hosts file (10.1.200.0/24 network for gluster) and 10.1.10.0/24 is the ovirtmgmt network

Comment 1 Tom Gamull 2016-10-19 01:04:25 UTC

Created attachment 1211941 [details]
gluster brick log

Comment 2 Tom Gamull 2016-10-19 01:05:07 UTC

Created attachment 1211943 [details]
glsuter engine log

Comment 3 Tom Gamull 2016-10-19 01:05:28 UTC

Created attachment 1211944 [details]
sanlock

Comment 4 Tom Gamull 2016-10-19 01:05:55 UTC

Created attachment 1211946 [details]
vdsm

Comment 5 Tom Gamull 2016-10-19 01:06:43 UTC

Created attachment 1211947 [details]
tmp file in gluster

Comment 6 Tom Gamull 2016-10-19 01:17:01 UTC

One note, I have fusionIO cards as the cache and for the virt01 engine volume, I doubt this matters but I do have the excluded in the multipath.conf.  If you see multipath warnings it's likely about that.  I have had this work before using nfs (self-hosting).

Comment 7 Simone Tiraboschi 2016-10-19 08:30:58 UTC

The first error is here:
jsonrpc.Executor/7::DEBUG::2016-10-18 13:02:00,302::clusterlock::183::Storage.initSANLock::(initSANLock) Initializing SANLock for domain 661efc2c-0a42-4c5e-8a7d-e86e45347dbb
jsonrpc.Executor/7::ERROR::2016-10-18 13:02:01,596::clusterlock::191::Storage.initSANLock::(initSANLock) Cannot initialize SANLock for domain 661efc2c-0a42-4c5e-8a7d-e86e45347dbb
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 186, in initSANLock
    sanlock.init_lockspace(sdUUID, idsPath)
SanlockException: (22, 'Sanlock lockspace init failure', 'Invalid argument')
jsonrpc.Executor/7::WARNING::2016-10-18 13:02:01,597::sd::467::Storage.StorageDomainManifest::(initDomainLock) lease did not initialize successfully
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sd.py", line 462, in initDomainLock
    self._domainLock.initLock()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 222, in initLock
    initSANLock(self._sdUUID, self._idsPath, self._leasesPath)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 192, in initSANLock
    raise se.ClusterLockInitError()
ClusterLockInitError: Could not initialize cluster lock: ()
jsonrpc.Executor/7::DEBUG::2016-10-18 13:02:01,598::hsm::2598::Storage.HSM::(createStorageDomain) knownSDs: {661efc2c-0a42-4c5e-8a7d-e86e45347dbb: storage.glusterSD.findDomain}


This is a subsequent error:
Thread-29::INFO::2016-10-18 13:02:23,329::clientIF::186::vds::(contEIOVms) vmContainerLock acquired
jsonrpc.Executor/5::ERROR::2016-10-18 13:02:24,318::sp::608::Storage.StoragePool::(create) Create pool hosted_datacenter canceled 
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sp.py", line 605, in create
    self.attachSD(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 78, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/share/vdsm/storage/sp.py", line 929, in attachSD
    dom.acquireHostId(self.id)
  File "/usr/share/vdsm/storage/sd.py", line 661, in acquireHostId
    self._manifest.acquireHostId(hostId, async)
  File "/usr/share/vdsm/storage/sd.py", line 400, in acquireHostId
    self._domainLock.acquireHostId(hostId, async)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 251, in acquireHostId
    raise se.AcquireHostIdFailure(self._sdUUID, e)
AcquireHostIdFailure: Cannot acquire host id: (u'661efc2c-0a42-4c5e-8a7d-e86e45347dbb', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))
jsonrpc.Executor/5::DEBUG::2016-10-18 13:02:24,319::fileUtils::191::Storage.fileUtils::(cleanupdir) Removing directory: /rhev/data-center/59762b3c-0f39-4313-9c17-bedb3983d51c

Comment 8 Simone Tiraboschi 2016-10-19 08:49:13 UTC

OK, let's take the latest reproduction (around 2016-10-18 20:47 in VDSM logs and 2016-10-19 00:47 in gluster logs).

In VDSM logs we see:
jsonrpc.Executor/4::ERROR::2016-10-18 20:47:33,623::clusterlock::191::Storage.initSANLock::(initSANLock) Cannot initialize SANLock for domain ee2bf992-ec7d-4e32-97b7-aa214aca587b
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 186, in initSANLock
    sanlock.init_lockspace(sdUUID, idsPath)
SanlockException: (22, 'Sanlock lockspace init failure', 'Invalid argument')
jsonrpc.Executor/4::WARNING::2016-10-18 20:47:33,624::sd::467::Storage.StorageDomainManifest::(initDomainLock) lease did not initialize successfully
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sd.py", line 462, in initDomainLock
    self._domainLock.initLock()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 222, in initLock
    initSANLock(self._sdUUID, self._idsPath, self._leasesPath)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 192, in initSANLock
    raise se.ClusterLockInitError()
ClusterLockInitError: Could not initialize cluster lock: ()

And in gluster volume logs:
[2016-10-19 00:47:32.397403] I [MSGID: 109036] [dht-common.c:8889:dht_log_new_layout_for_dir_selfheal] 0-engine-dht: Setting layout of /ee2bf992-ec7d-4e32-97b7-aa214aca587b/images with [Subvol_name: engine-replicate-0, Err: -1 , Start: 0 , Stop: 4294967295 , Hash: 1 ], 
[2016-10-19 00:47:32.485218] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-engine-client-0: remote operation failed [Invalid argument]
[2016-10-19 00:47:32.485363] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-engine-client-2: remote operation failed [Invalid argument]
[2016-10-19 00:47:32.485384] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-engine-client-1: remote operation failed [Invalid argument]
[2016-10-19 00:47:33.614436] W [fuse-bridge.c:2305:fuse_writev_cbk] 0-glusterfs-fuse: 114: WRITE => -1 (Invalid argument)

While in gluster brick logs:
[2016-10-19 00:47:31.649817] I [MSGID: 115029] [server-handshake.c:690:server_setvolume] 0-engine-server: accepted client from virt04.gamull.com-5132-2016/10/19-00:47:31:603846-engine-client-0-0-0 (version: 3.7.9)
[2016-10-19 00:47:32.485087] E [MSGID: 113072] [posix.c:3317:posix_writev] 0-engine-posix: write failed: offset 0, [Invalid argument]
[2016-10-19 00:47:32.485149] E [MSGID: 115067] [server-rpc-fops.c:1436:server_writev_cbk] 0-engine-server: 310: WRITEV 0 (e8e871ba-17cf-4047-bdec-8cedaec8196b) ==> (Invalid argument) [Invalid argument]

Comment 9 Simone Tiraboschi 2016-10-19 09:01:42 UTC

By the way,
although StorageDomain.create implicitly failed due to the ClusterLockInitError at comment 8,

jsonrpc.Executor/4::DEBUG::2016-10-18 20:47:31,843::__init__::530::jsonrpc.JsonRpcServer::(_handle_request) Calling 'StorageDomain.create' in bridge with {u'name': u'hosted_storage', u'domainType': 7, u'domainClass': 1, u'typeArgs': u'10.1.200.84:/engine', u'version': 3, u'storagedomainID': u'ee2bf992-ec7d-4e32-97b7-aa214aca587b'}
jsonrpc.Executor/4::DEBUG::2016-10-18 20:47:31,844::task::597::Storage.TaskManager.Task::(_updateState) Task=`8d24a62c-c1c0-462c-9ccc-57552cae39ec`::moving from state init -> state preparing
jsonrpc.Executor/4::INFO::2016-10-18 20:47:31,844::logUtils::49::dispatcher::(wrapper) Run and protect: createStorageDomain(storageType=7, sdUUID=u'ee2bf992-ec7d-4e32-97b7-aa214aca587b', domainName=u'hosted_storage', typeSpecificArg=u'10.1.200.84:/engine', domClass=1, domVersion=3, options=None)

it got reported as succeeded to the application (hosted-engine-setup) that subsequently tries to use it with other errors:

jsonrpc.Executor/4::ERROR::2016-10-18 20:47:33,623::clusterlock::191::Storage.initSANLock::(initSANLock) Cannot initialize SANLock for domain ee2bf992-ec7d-4e32-97b7-aa214aca587b
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 186, in initSANLock
    sanlock.init_lockspace(sdUUID, idsPath)
SanlockException: (22, 'Sanlock lockspace init failure', 'Invalid argument')
jsonrpc.Executor/4::WARNING::2016-10-18 20:47:33,624::sd::467::Storage.StorageDomainManifest::(initDomainLock) lease did not initialize successfully
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/sd.py", line 462, in initDomainLock
    self._domainLock.initLock()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 222, in initLock
    initSANLock(self._sdUUID, self._idsPath, self._leasesPath)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 192, in initSANLock
    raise se.ClusterLockInitError()
ClusterLockInitError: Could not initialize cluster lock: ()
jsonrpc.Executor/4::DEBUG::2016-10-18 20:47:33,625::hsm::2598::Storage.HSM::(createStorageDomain) knownSDs: {ee2bf992-ec7d-4e32-97b7-aa214aca587b: storage.glusterSD.findDomain}
jsonrpc.Executor/4::INFO::2016-10-18 20:47:33,625::logUtils::52::dispatcher::(wrapper) Run and protect: createStorageDomain, Return response: None
jsonrpc.Executor/4::DEBUG::2016-10-18 20:47:33,625::task::1193::Storage.TaskManager.Task::(prepare) Task=`8d24a62c-c1c0-462c-9ccc-57552cae39ec`::finished: None
jsonrpc.Executor/4::DEBUG::2016-10-18 20:47:33,625::task::597::Storage.TaskManager.Task::(_updateState) Task=`8d24a62c-c1c0-462c-9ccc-57552cae39ec`::moving from state preparing -> state finished
jsonrpc.Executor/4::DEBUG::2016-10-18 20:47:33,626::resourceManager::952::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {}
jsonrpc.Executor/4::DEBUG::2016-10-18 20:47:33,626::resourceManager::989::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {}
jsonrpc.Executor/4::DEBUG::2016-10-18 20:47:33,626::task::995::Storage.TaskManager.Task::(_decref) Task=`8d24a62c-c1c0-462c-9ccc-57552cae39ec`::ref 0 aborting False
jsonrpc.Executor/4::DEBUG::2016-10-18 20:47:33,626::__init__::555::jsonrpc.JsonRpcServer::(_handle_request) Return 'StorageDomain.create' in bridge with True
jsonrpc.Executor/4::INFO::2016-10-18 20:47:33,626::__init__::513::jsonrpc.JsonRpcServer::(_serveRequest) RPC call StorageDomain.create succeeded in 1.78 seconds

I think that in this case VDSM should have reported a clear error on StorageDomain.create exits so there is probably also a VDSM bug hidden here. Nir?

Comment 10 Tom Gamull 2016-10-19 12:52:02 UTC

Let me know what I can do to help. I am tempted to retry without the fusionio cards but I can't see how they make a difference. 
Hardware is 3 dell poweredge 1950s. They have standard broadcom nics with a dedicated management and dedicated storage nic.  Rhel runs off internal usb drive and the bricks are using a pair of 1tb sata drives

Do you think this is more network and timeout related? I may also retry the whole install from scratch. Maybe a typo somewhere?

Comment 11 Yaniv Kaul 2016-10-19 15:12:11 UTC

David, any idea what the sanlock issue is?

Comment 12 David Teigland 2016-10-19 15:22:04 UTC

gluster returns EINVAL for sanlock i/o.

"fusionIO cards as the cache and for the virt01 engine volume" -- I assume that plays no role in the io path for sanlock leases.

Comment 13 Simone Tiraboschi 2016-10-19 16:01:25 UTC

(In reply to David Teigland from comment #12)
> gluster returns EINVAL for sanlock i/o.

On other setups, we sucesfully used it over gluster.

Comment 14 Tom Gamull 2016-10-19 16:08:09 UTC

Created attachment 1212196 [details]
ovirt-hosted-engine-setup log

Comment 15 Simone Tiraboschi 2016-10-19 16:44:02 UTC

Yes, at hosted-engine-setup eyes, the storage domain was correctly created since VDSM didn't raised up the ClusterLockInitError.
It's not the root cause but this is worth to be fixed as well: I'm open a separate bug to track this. 

2016-10-18 20:47:31 INFO otopi.plugins.gr_he_setup.storage.storage storage._misc:1285 Creating Storage Domain
2016-10-18 20:47:31 DEBUG otopi.plugins.gr_he_setup.storage.storage storage._createStorageDomain:801 createStorageDomain
2016-10-18 20:47:33 DEBUG otopi.plugins.gr_he_setup.storage.storage storage._createStorageDomain:835 {'status': {'message': 'Done', 'code': 0}}
2016-10-18 20:47:33 DEBUG otopi.plugins.gr_he_setup.storage.storage storage._createStorageDomain:837 {'status': {'message': 'Done', 'code': 0}, u'mdasize': 0, u'mdathreshold': True, u'mdavalid': True, u'diskfree': u'107279286272', u'disktotal': u'107321753600', u'mdafree': 0}

Comment 16 Tom Gamull 2016-10-19 21:38:10 UTC

The following is from sanlock.log same attempt on virt05 I added the flag for verbose.

2016-10-19 17:36:51-0400 519 [4727]: s1 delta_acquire done 250 1 499
2016-10-19 17:36:52-0400 520 [1106]: s1 add_lockspace done
2016-10-19 17:36:52-0400 520 [1106]: cmd_add_lockspace 2,8 done 0
2016-10-19 17:36:52-0400 520 [1101]: cmd_register ci 3 fd 11 pid 4405
2016-10-19 17:36:52-0400 520 [1107]: cmd_acquire 3,11,4405 ci_in 3 fd 11 count 1
2016-10-19 17:36:52-0400 520 [1107]: s1:r1 resource 893961e1-6fe1-476f-b593-74b9e4bef1e2:SDM:/rhev/data-center/mnt/_var_lib_ovirt-hosted-engine-setup_tmp722kOG/893961e1-6fe1-476f-b593-74b9e4bef1e2/dom_md/leases:1048576 for 3,11,4405
2016-10-19 17:36:52-0400 520 [1107]: r1 paxos_acquire begin 0 0 0
2016-10-19 17:36:52-0400 520 [1107]: r1 paxos_acquire leader 0 owner 0 0 0 max mbal[1999] 0 our_dblock 0 0 0 0 0 0
2016-10-19 17:36:52-0400 520 [1107]: r1 paxos_acquire leader 0 free
2016-10-19 17:36:52-0400 520 [1107]: r1 ballot 1 phase1 mbal 250
2016-10-19 17:36:52-0400 520 [1107]: r1 ballot 1 phase2 bal 250 inp 250 1 520 q_max -1
2016-10-19 17:36:52-0400 520 [1107]: r1 ballot 1 commit self owner 250 1 520
2016-10-19 17:36:52-0400 520 [1107]: r1 acquire_disk rv 1 lver 1 at 520
2016-10-19 17:36:52-0400 520 [1107]: cmd_acquire 3,11,4405 result 0 pid_dead 0
2016-10-19 17:36:52-0400 520 [1107]: cmd_add_lockspace 4,12 96860754-377e-4600-b9eb-cd73774c1f99:250:/rhev/data-center/mnt/glusterSD/10.1.200.85:_engine/96860754-377e-4600-b9eb-cd73774c1f99/dom_md/ids:0 flags 0 timeout 0
2016-10-19 17:36:52-0400 520 [1107]: s2 lockspace 96860754-377e-4600-b9eb-cd73774c1f99:250:/rhev/data-center/mnt/glusterSD/10.1.200.85:_engine/96860754-377e-4600-b9eb-cd73774c1f99/dom_md/ids:0
2016-10-19 17:36:52-0400 520 [4745]: s2 delta_acquire begin 96860754-377e-4600-b9eb-cd73774c1f99:250
2016-10-19 17:36:52-0400 520 [4745]: 96860754 aio collect 0 0x7f3a800008c0:0x7f3a800008d0:0x7f3a80101000 result -22:0 match res
2016-10-19 17:36:52-0400 520 [4745]: read_sectors delta_leader offset 127488 rv -22 /rhev/data-center/mnt/glusterSD/10.1.200.85:_engine/96860754-377e-4600-b9eb-cd73774c1f99/dom_md/ids
2016-10-19 17:36:52-0400 520 [4745]: s2 delta_acquire leader_read1 error -22
2016-10-19 17:36:52-0400 521 [1101]: s1 host 250 1 499 8b7f00aa-2bcc-4ce1-a201-801975502ea1.virt05.gam
2016-10-19 17:36:53-0400 521 [1107]: s2 add_lockspace fail result -22
2016-10-19 17:36:53-0400 521 [1107]: cmd_add_lockspace 4,12 done -22
2016-10-19 17:36:53-0400 522 [1106]: cmd_release 3,11,4405 ci_in 3 fd 11 count 1 flags 0
2016-10-19 17:36:53-0400 522 [1106]: r1 paxos_release leader 1 owner 250 1 520
2016-10-19 17:36:53-0400 522 [1106]: r1 release_token done 1 r_flags 0 t_flags 0
2016-10-19 17:36:53-0400 522 [1106]: cmd_release 3,11,4405 result 0 pid_dead 0 count 1
2016-10-19 17:36:53-0400 522 [1107]: cmd_rem_lockspace 2,12 893961e1-6fe1-476f-b593-74b9e4bef1e2 flags 0
2016-10-19 17:36:53-0400 522 [1101]: s1 set killing_pids check 0 remove 1
2016-10-19 17:36:54-0400 522 [1101]: s1 all pids clear
2016-10-19 17:36:54-0400 522 [1101]: s1 set thread_stop
2016-10-19 17:36:54-0400 522 [1101]: s1 wdmd_test_live 0 0 to disable
2016-10-19 17:36:54-0400 523 [4727]: s1 delta_release begin 893961e1-6fe1-476f-b593-74b9e4bef1e2:250
2016-10-19 17:36:54-0400 523 [4727]: s1 delta_release done 250 1 0
2016-10-19 17:36:55-0400 523 [1101]: s1 free lockspace
2016-10-19 17:36:55-0400 524 [1107]: cmd_rem_lockspace 2,12 done 0
2016-10-19 17:36:55-0400 524 [1106]: cmd_rem_lockspace 4,9 893961e1-6fe1-476f-b593-74b9e4bef1e2 flags 0
2016-10-19 17:36:55-0400 524 [1106]: cmd_rem_lockspace 4,9 done -2

Comment 17 Tom Gamull 2016-10-19 21:40:12 UTC

Directories get created and are visible on other 2 nodes

[root@virt05 ~]# ls -la /rhev/data-center/mnt/
glusterSD/                                    _var_lib_ovirt-hosted-engine-setup_tmp722kOG/ 
[root@virt05 ~]# ls -la /rhev/data-center/mnt/_var_lib_ovirt-hosted-engine-setup_tmp722kOG/
total 24
drwxr-xr-x. 4 vdsm kvm  4096 Oct 19 17:36 .
drwxr-xr-x. 4 vdsm kvm    73 Oct 19 17:36 ..
drwxr-xr-x. 4 vdsm kvm  4096 Oct 19 17:36 893961e1-6fe1-476f-b593-74b9e4bef1e2
-rwxr-xr-x. 1 vdsm kvm     0 Oct 19 17:36 __DIRECT_IO_TEST__
drwx------. 2 vdsm kvm 16384 Oct 19 17:36 lost+found
[root@virt05 ~]# ls -la /rhev/data-center/
2340891f-6886-4acf-bfe0-8a30db0f35a5/ mnt/                                  
[root@virt05 ~]# ls -la /rhev/data-center/mnt/glusterSD/10.1.200.85\:_engine/
total 12
drwxr-xr-x. 6 vdsm kvm  4096 Oct 19 17:36 .
drwxr-xr-x. 3 vdsm kvm    32 Oct 19 17:36 ..
drwxr-xr-x. 4 vdsm kvm  4096 Oct 19 17:36 96860754-377e-4600-b9eb-cd73774c1f99
-rwxr-xr-x. 1 vdsm kvm     0 Oct 19 17:36 __DIRECT_IO_TEST__
drwxr-xr-x. 3 root root 4096 Oct 19 17:11 .trashcan
[root@virt05 ~]# ls -la /rhev/data-center/mnt/glusterSD/10.1.200.85\:_engine/96860754-377e-4600-b9eb-cd73774c1f99/
total 16
drwxr-xr-x. 4 vdsm kvm 4096 Oct 19 17:36 .
drwxr-xr-x. 6 vdsm kvm 4096 Oct 19 17:36 ..
drwxr-xr-x. 2 vdsm kvm 4096 Oct 19 17:36 dom_md
drwxr-xr-x. 2 vdsm kvm 4096 Oct 19 17:36 images
[root@virt05 ~]# ls -la /rhev/data-center/mnt/glusterSD/10.1.200.85\:_engine/96860754-377e-4600-b9eb-cd73774c1f99/images/
total 8
drwxr-xr-x. 2 vdsm kvm 4096 Oct 19 17:36 .
drwxr-xr-x. 4 vdsm kvm 4096 Oct 19 17:36 ..
[root@virt05 ~]# ls -la /rhev/data-center/mnt/glusterSD/10.1.200.85\:_engine/96860754-377e-4600-b9eb-cd73774c1f99/dom_md/
total 1033
drwxr-xr-x. 2 vdsm kvm     4096 Oct 19 17:36 .
drwxr-xr-x. 4 vdsm kvm     4096 Oct 19 17:36 ..
-rw-rw----. 1 vdsm kvm  1048576 Oct 19 17:36 ids
-rw-rw----. 1 vdsm kvm 16777216 Oct 19 17:36 inbox
-rw-rw----. 1 vdsm kvm        0 Oct 19 17:36 leases
-rw-r--r--. 1 vdsm kvm      301 Oct 19 17:36 metadata
-rw-rw----. 1 vdsm kvm 16777216 Oct 19 17:36 outbox
[root@virt05 ~]#

Comment 18 Tom Gamull 2016-10-19 21:41:06 UTC

Other node direct view...

[root@virt04 engine]# ls -la /rhgs/engine/brick/
total 0
drwxr-xr-x.  6 vdsm kvm  120 Oct 19 17:36 .
drwxr-xr-x.  3 vdsm kvm   18 Oct 19 17:09 ..
drwxr-xr-x.  4 vdsm kvm   32 Oct 19 17:36 96860754-377e-4600-b9eb-cd73774c1f99
-rwxr-xr-x.  2 vdsm kvm    0 Oct 19 17:36 __DIRECT_IO_TEST__
drw-------. 28 root root 344 Oct 19 17:36 .glusterfs
drwxr-xr-x.  2 root root   6 Oct 19 17:20 .shard
drwxr-xr-x.  3 root root  24 Oct 19 17:11 .trashcan
[root@virt04 engine]# ls -la /rhgs/engine/brick/96860754-377e-4600-b9eb-cd73774c1f99/
total 0
drwxr-xr-x. 4 vdsm kvm  32 Oct 19 17:36 .
drwxr-xr-x. 6 vdsm kvm 120 Oct 19 17:36 ..
drwxr-xr-x. 2 vdsm kvm  69 Oct 19 17:36 dom_md
drwxr-xr-x. 2 vdsm kvm   6 Oct 19 17:36 images
[root@virt04 engine]# ls -la /rhgs/engine/brick/96860754-377e-4600-b9eb-cd73774c1f99/dom_md/
total 1028
drwxr-xr-x. 2 vdsm kvm      69 Oct 19 17:36 .
drwxr-xr-x. 4 vdsm kvm      32 Oct 19 17:36 ..
-rw-rw----. 2 vdsm kvm 1048576 Oct 19 17:36 ids
-rw-rw----. 2 vdsm kvm       0 Oct 19 17:36 inbox
-rw-rw----. 2 vdsm kvm       0 Oct 19 17:36 leases
-rw-r--r--. 2 vdsm kvm     301 Oct 19 17:36 metadata
-rw-rw----. 2 vdsm kvm       0 Oct 19 17:36 outbox
[root@virt04 engine]#

Comment 19 Pranith Kumar K 2016-10-20 11:02:07 UTC

Tom,
  We are seeing the write error coming from underlying filesystem(XFS) of gluster based on the following log:
[2016-10-18 17:02:00.421938] E [MSGID: 113072] [posix.c:3317:posix_writev] 0-engine-posix: write failed: offset 0, [Invalid argument]

This can come when a file is opened with O_DIRECT passes offset/size which is not aligned. We are seeing that the write is beginning at the offset '0' which is aligned. Does anyone know the size of this particular write? Could we get the strace of the brick process when this issue is seen?

One way to figure it out is:
strace -ff -T -p <pid-of-brick> -o <path-to-the-file-where-you-want-the-output-saved>`

This will create multiple strace files per thread. Please attach them when you do this and we can figure that out.

Comment 20 Pranith Kumar K 2016-10-20 11:15:01 UTC

The other possibility is that the fd that posix xlator in glusterfs is using is wrong. All these doubts will be clarified once we have strace.

Comment 21 Tom Gamull 2016-10-20 12:44:58 UTC

Created attachment 1212511 [details]
strace virt05.gamull.com node2

Comment 22 Tom Gamull 2016-10-20 12:45:36 UTC

(In reply to Pranith Kumar K from comment #19)
> Tom,
>   We are seeing the write error coming from underlying filesystem(XFS) of
> gluster based on the following log:
> [2016-10-18 17:02:00.421938] E [MSGID: 113072] [posix.c:3317:posix_writev]
> 0-engine-posix: write failed: offset 0, [Invalid argument]
> 
> This can come when a file is opened with O_DIRECT passes offset/size which
> is not aligned. We are seeing that the write is beginning at the offset '0'
> which is aligned. Does anyone know the size of this particular write? Could
> we get the strace of the brick process when this issue is seen?
> 
> One way to figure it out is:
> strace -ff -T -p <pid-of-brick> -o
> <path-to-the-file-where-you-want-the-output-saved>`
> 
> This will create multiple strace files per thread. Please attach them when
> you do this and we can figure that out.

Just uploaded

Comment 23 Tom Gamull 2016-10-20 15:26:54 UTC

I can hop on IRC if that helps to give some live feedback, let me know.

Comment 24 Tom Gamull 2016-10-20 16:54:05 UTC

Progress!!!
I believe this ticket was due to an alignment issue. Here are my resolution steps.
Previously I had used two devices and wrote direct
/dev/sda (RAID1 1TB)
/dev/fioa (FusionIO Card)

pvcreate --dataalignment 1536K /dev/sda
pvcreate --dataalignment 1536K /dev/fioa

Above was wrong I think

I removed all LV,VG,PV and umounted

Used Fdisk to make a partition on each creating
/dev/sda1
/dev/fioa1

To solve I ran the following to set the storage correctly

pvcreate --dataalignment 256K /dev/sda1
vgcreate RHS_vg1 /dev/sda1

lvcreate -n engine --size 100G RHS_vg1
mkfs.xfs -f -i size=512 -n size=8192 /dev/RHS_vg1/engine


lvcreate --thinpool RHS_vg1/vg1_pool --size 750G --chunksize 256K --poolmetadatasize 16G

pvcreate --dataalignment 256K /dev/fioa1
vgextend RHS_vg1 /dev/fioa1

lvchange --zero n RHS_vg1/vg1_pool
lvcreate -L 200G -n rhs_cache  RHS_vg1 /dev/fioa1
lvcreate -L 2G -n rhs_cache_meta RHS_vg1 /dev/fioa1
lvconvert --type cache-pool --poolmetadata /dev/RHS_vg1/rhs_cache_meta --poolmetadataspare n --cachemode writethrough /dev/RHS_vg1/rhs_cache
lvconvert --type cache --cachepool /dev/RHS_vg1/rhs_cache /dev/RHS_vg1/vg1_pool

lvcreate --thin --name data --virtualsize 750G RHS_vg1/vg1_pool
mkfs.xfs -f -i size=512 -n size=8192 /dev/RHS_vg1/data

lvcreate --thin --name vmstore --virtualsize 750G RHS_vg1/vg1_pool
mkfs.xfs -f -i size=512 -n size=8192 /dev/RHS_vg1/vmstore

mkdir -p /rhgs/engine /rhgs/brick1 /rhgs/brick2

mount /dev/RHS_vg1/engine /rhgs/engine
mount /dev/RHS_vg1/vmstore /rhgs/brick1
mount /dev/RHS_vg1/data /rhgs/brick2

FSTAB
/dev/RHS_vg1/engine  /rhgs/engine   xfs   defaults    0 0
/dev/RHS_vg1/vmstore /rhgs/brick1  xfs   defaults    0 0
/dev/RHS_vg1/data    /rhgs/brick2     xfs   defaults    0 0

mount -a

mkdir /rhgs/engine/enginebrick
mkdir /rhgs/brick1/vmstore
mkdir /rhgs/brick2/data

semanage fcontext -a -t glusterd_brick_t /rhgs/engine/enginebrick
semanage fcontext -a -t glusterd_brick_t /rhgs/brick1/vmstore
semanage fcontext -a -t glusterd_brick_t /rhgs/brick2/data
restorecon -Rv /rhgs

==========
After doing this on all 3 nodes and verifying peer status
==========

gluster volume create engine replica 3 node1:/rhgs/engine/enginebrick node2:rhgs/engine/enginebrick node3:/rhgs/engine/enginebrick
gluster volume set engine group virt
gluster volume set engine storage.owner-uid 36 && gluster volume set engine storage.owner-gid 36
gluster volume set engine features.shard on
gluster volume set engine features.shard-block-size 512MB
gluster volume set engine performance.low-prio-threads 32
gluster volume set engine cluster.data-self-heal-algorithm full
gluster volume set engine nfs.disable on
gluster volume set engine locking-scheme granular
gluster volume set engine cluster.shd-max-threads 8
gluster volume set engine cluster.shd-wait-qlength 10000
gluster volume set engine performance.strict-o-direct on
gluster volume set engine network.remote-dio off
gluster volume set engine network.ping-timeout   30
gluster volume set engine user.cifs off

gluster volume start engine

============

on node2/virt05 I ran the deploy and we have success!!!

[ INFO  ] Stage: Transaction setup
[ INFO  ] Stage: Misc configuration
[ INFO  ] Stage: Package installation
[ INFO  ] Stage: Misc configuration
[ INFO  ] Configuring libvirt
[ INFO  ] Configuring VDSM
[ INFO  ] Starting vdsmd
[ INFO  ] Creating Storage Domain
[ INFO  ] Creating Storage Pool
[ INFO  ] Connecting Storage Pool
[ INFO  ] Verifying sanlock lockspace initialization
[ INFO  ] Creating Image for 'hosted-engine.lockspace' ...
[ INFO  ] Image for 'hosted-engine.lockspace' created successfully
[ INFO  ] Creating Image for 'hosted-engine.metadata' ...
[ INFO  ] Image for 'hosted-engine.metadata' created successfully
[ INFO  ] Creating VM Image
[ INFO  ] Extracting disk image from OVF archive (could take a few minutes depending on archive size)
[ INFO  ] Validating pre-allocated volume size
[ INFO  ] Uploading volume to data domain (could take a few minutes depending on archive size)
[ INFO  ] Image successfully imported from OVF
[ INFO  ] Destroying Storage Pool
[ INFO  ] Start monitoring domain
[ INFO  ] Configuring VM
[ INFO  ] Updating hosted-engine configuration
[ INFO  ] Stage: Transaction commit
[ INFO  ] Stage: Closing up
          The following network ports should be opened:
              tcp:5900
              tcp:5901
              tcp:9090
              udp:5900
              udp:5901
          An example of the required configuration for iptables can be found at:
              /etc/ovirt-hosted-engine/iptables.example
          In order to configure firewalld, copy the files from
          /etc/ovirt-hosted-engine/firewalld to /etc/firewalld/services
          and execute the following commands:
              firewall-cmd -service hosted-cockpit
              firewall-cmd -service hosted-console
[ INFO  ] Creating VM
          You can now connect to the VM with the following command:
          	hosted-engine --console
          You can also graphically connect to the VM from your system with the following command:
          	remote-viewer vnc://virt05.gamull.com:5900
          Use temporary password "*********" to connect to vnc console.
          Please ensure that your Guest OS is properly configured to support serial console according to your distro documentation.
          Follow http://www.ovirt.org/Serial_Console_Setup#I_need_to_access_the_console_the_old_way for more info.
          If you need to reboot the VM you will need to start it manually using the command:
          hosted-engine --vm-start
          You can then set a temporary password using the command:
          hosted-engine --add-console-password
[ INFO  ] Running engine-setup on the appliance

Comment 25 Tom Gamull 2016-10-20 16:54:35 UTC

This BZ was helpful - https://bugzilla.redhat.com/show_bug.cgi?id=958781

Comment 26 Tom Gamull 2016-10-21 02:00:19 UTC

Although the install continues I am unable to add the data domains receiving the same error.

This would be a new issue I imagine as the BZ is only for the install

Oct 20 21:56:37 virt05.gamull.com ovirt-ha-agent[3590]: pending = getattr(dispatcher, 'pending', lambda: 0)
Oct 20 21:56:37 virt05.gamull.com ovirt-ha-agent[3590]: WARNING:ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore:Unable to find OVF_STORE
Oct 20 21:56:37 virt05.gamull.com vdsm[3588]: vdsm vds.dispatcher ERROR SSL error during reading data: unexpected eof
Oct 20 21:56:37 virt05.gamull.com ovirt-ha-agent[3590]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config ERROR Unable to get vm.conf from OVF_STORE, falling back to initial vm.conf
Oct 20 21:56:37 virt05.gamull.com ovirt-ha-agent[3590]: ERROR:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config:Unable to get vm.conf from OVF_STORE, falling back to initial vm.conf
Oct 20 21:56:37 virt05.gamull.com sudo[23830]:     nrpe : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/lib64/nagios/plugins/gluster/check_volume_status.py -v engine -t heal-info
Oct 20 21:56:37 virt05.gamull.com ovirt-ha-agent[3590]: /usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352: DeprecationWarning: Dispatcher.pending is deprecated. Use Dispatcher.socket.pending instead.
Oct 20 21:56:37 virt05.gamull.com ovirt-ha-agent[3590]: pending = getattr(dispatcher, 'pending', lambda: 0)
Oct 20 21:56:39 virt05.gamull.com ovirt-ha-agent[3590]: /usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352: DeprecationWarning: Dispatcher.pending is deprecated. Use Dispatcher.socket.pending instead.
Oct 20 21:56:39 virt05.gamull.com ovirt-ha-agent[3590]: pending = getattr(dispatcher, 'pending', lambda: 0)
Oct 20 21:56:41 virt05.gamull.com ovirt-ha-broker[1043]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established
Oct 20 21:56:41 virt05.gamull.com ovirt-ha-broker[1043]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed
Oct 20 21:56:41 virt05.gamull.com ovirt-ha-broker[1043]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established
Oct 20 21:56:41 virt05.gamull.com ovirt-ha-broker[1043]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed
Oct 20 21:56:42 virt05.gamull.com ovirt-ha-broker[1043]: INFO:mgmt_bridge.MgmtBridge:Found bridge ovirtmgmt with ports
Oct 20 21:56:42 virt05.gamull.com ovirt-ha-agent[3590]: /usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352: DeprecationWarning: Dispatcher.pending is deprecated. Use Dispatcher.socket.pending instead.
Oct 20 21:56:42 virt05.gamull.com ovirt-ha-agent[3590]: pending = getattr(dispatcher, 'pending', lambda: 0)
Oct 20 21:56:42 virt05.gamull.com vdsm[3588]: vdsm vds.dispatcher ERROR SSL error during reading data: unexpected eof
Oct 20 21:56:42 virt05.gamull.com ovirt-ha-agent[3590]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Current state EngineStop (score: 3400)
Oct 20 21:56:42 virt05.gamull.com ovirt-ha-agent[3590]: /usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352: DeprecationWarning: Dispatcher.pending is deprecated. Use Dispatcher.socket.pending instead.
Oct 20 21:56:42 virt05.gamull.com ovirt-ha-agent[3590]: pending = getattr(dispatcher, 'pending', lambda: 0)
Oct 20 21:56:43 virt05.gamull.com ovirt-ha-broker[1043]: WARNING:engine_health.CpuLoadNoEngine:bad health status: Hosted Engine is not up!
Oct 20 21:56:43 virt05.gamull.com ovirt-ha-broker[1043]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established
Oct 20 21:56:43 virt05.gamull.com ovirt-ha-broker[1043]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed
Oct 20 21:56:43 virt05.gamull.com ovirt-ha-broker[1043]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established
Oct 20 21:56:43 virt05.gamull.com ovirt-ha-broker[1043]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed
Oct 20 21:56:43 virt05.gamull.com ovirt-ha-broker[1043]: INFO:mem_free.MemFree:memFree: 27405
^C
[root@virt05 ~]# hosted-engine --vm-status
/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py:15: DeprecationWarning: vdscli uses xmlrpc. since ovirt 3.6 xmlrpc is deprecated, please use vdsm.jsonrpcvdscli

Comment 27 Sahina Bose 2016-10-21 06:52:49 UTC

Is your hosted engine VM installation (engine-setup) complete? From the above logs, it looks like it is not? "ERROR SSL error during reading data: unexpected eof"

Are there errors in engine-setup logs? Could you attach those and the vdsm logs?

Comment 28 Tom Gamull 2016-10-21 20:37:34 UTC

strace attached from trying to add vmstore domain to engine.  Engine appears to complete also. uploading log.

Comment 29 Tom Gamull 2016-10-21 20:38:19 UTC

Created attachment 1212934 [details]
strace virt04 from failed vmstore data domain attach

Comment 30 Tom Gamull 2016-10-21 20:38:54 UTC

Created attachment 1212935 [details]
Hosted Engine setup log

Comment 31 Tom Gamull 2016-10-21 21:20:57 UTC

Also noticing this in the engine logs.

I have added the two hosts but they had Certificate enrollment issues. I noticed the serial.txt in the pki directory was empty so i used the old value and this worked.  I also added an NFS share as data just to get the engine working.  Right now I still can't add the data or vmstore data domains


2016-10-21 17:18:48,346 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (org.ovirt.thread.pool-6-thread-3) [74971c04] Failed in 'AttachStorageDomainVDS' method
2016-10-21 17:18:48,392 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-3) [74971c04] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM command failed: Cannot acquire host id: (u'10e3c90c-5fe8-4a10-a1ee-58904d91449c', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument'))
2016-10-21 17:18:48,392 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-6-thread-3) [74971c04] IrsBroker::Failed::AttachStorageDomainVDS: IRSGenericException: IRSErrorException: Failed to AttachStorageDomainVDS, error = Cannot acquire host id: (u'10e3c90c-5fe8-4a10-a1ee-58904d91449c', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')), code = 661
2016-10-21 17:18:48,392 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (org.ovirt.thread.pool-6-thread-3) [74971c04] FINISH, AttachStorageDomainVDSCommand, log id: 3fa27488
2016-10-21 17:18:48,392 ERROR [org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand] (org.ovirt.thread.pool-6-thread-3) [74971c04] Command 'org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.irsbroker.IRSErrorException: IRSGenericException: IRSErrorException: Failed to AttachStorageDomainVDS, error = Cannot acquire host id: (u'10e3c90c-5fe8-4a10-a1ee-58904d91449c', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')), code = 661 (Failed with error AcquireHostIdFailure and code 661)
2016-10-21 17:18:48,394 INFO  [org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand] (org.ovirt.thread.pool-6-thread-3) [74971c04] Command [id=1dc6c883-4ce1-4542-bfed-ea2f0d957bb0]: Compensating NEW_ENTITY_ID of org.ovirt.engine.core.common.businessentities.StoragePoolIsoMap; snapshot: StoragePoolIsoMapId:{storagePoolId='580a65da-02d1-0368-023f-0000000001af', storageId='10e3c90c-5fe8-4a10-a1ee-58904d91449c'}.
2016-10-21 17:18:48,408 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-3) [74971c04] Correlation ID: 7722f8e1, Job ID: 337d0f50-3c1a-485f-a0d6-293c01d3d161, Call Stack: null, Custom Event ID: -1, Message: Failed to attach Storage Domain data to Data Center Default. (User: admin@internal-authz)

Comment 32 Tom Gamull 2016-10-21 21:26:06 UTC

Created attachment 1212962 [details]
ovirt-engine log

engine log during failed adds for data and vmstore domains
showing sanlock error.
engine domain seems to be there but the two others (same gluster cluster) do not add.

I can see the data under volumes and the bricks, just can't add to storage data center

Comment 33 Tom Gamull 2016-10-22 14:26:55 UTC

Removing vg1_pool with data and vmstore and recreating them without the /dev/fioa1 (FusionIO) seems to have worked.  I am unsure if it is the fusionio card or some overprovisioning of the thin pool.  I made vmstore and data 350G each.  I plan to retest with cache again and same numbers.

Comment 34 Tom Gamull 2016-10-22 15:23:59 UTC

Reconfiguring the VG with cache and ensuring alignment on the FusionIO card still produces the error.  This is likely a FusionIO driver error (io-memory-vsl driver). However, it is still problematic as I would assume any PCIe Storage device might experience a similar outcome.

In short, I'm not sure if it's cache related or fusionio related.  I will try the following next.

1) Creation of a data domain without cache on the FusionIO storage
2) Try some combinations of cache on with FusionIO to narrow down issue.

Comment 35 Tom Gamull 2016-10-24 13:24:45 UTC

I think I've found the issue and this was completely due to 4k blocks on the fusionio devices.  I reformatted them (using fio-format) as 512K blocks. I then added them in LVM and created a new PV, VG and an LV.  I formatted the LV without options
mkfs.xfs -f /dev/iomemory_vg/iomemory
and then created the gluster volume

This worked so I believe the issue for the FusionIO PCIe storage cards is the 4K blocks are not suitable for use with gluster replicas, you need to use 512K blocks.

IN summary, this error is related to alignment issues and likely using blocks too large for the device.

SOLUTION:
reformat using 512K blocks for the device and for the filesystem. do not use dataalignment option.

Comment 36 Yaniv Kaul 2016-10-24 14:55:53 UTC

512B blocks,  not 512K?

Comment 37 Tom Gamull 2016-10-24 15:16:40 UTC

Yes 512 blocks not 512K! 0.5K

Comment 38 Sahina Bose 2017-05-01 13:41:52 UTC

Data alignment values are configured via gdeploy while creating PV, VG, LV for gluster bricks. Closing this bug, please re-open in gluster gdeploy component if issues are seen around data alignment values chosen

Comment 39 Alex Kaouris 2017-06-04 19:22:54 UTC

Seems that even when running with gdeploy the issue persists. hosted engine deploy fails with same error. 
I am trying to deploy ovirt 4.1 using latest Centos7. 
The servers have 4K sector disks. 
Any idea how this can be overcome?

Note You need to log in before you can comment on or make changes to this bug.