Description of problem: Following procedure here - https://access.redhat.com/articles/2578391#manual and from rhv4 install with gluster the hosted-engine install fails on a sanlock error. disabling SELinux and other options have not resolved the issue. Version-Release number of selected component (if applicable): rhevm-appliance-20160922.0-1.el7ev.ova RHV4 How reproducible: 100% Steps to Reproduce: 1. Configure GlusterFS on 3 nodes with recommended settings for RHV4 and the HCI install (link in description) 2. Test for split-brain and ensure gluster is replicating and functioning 3. Run setup Actual results: [ ERROR ] Failed to execute stage 'Misc configuration': Cannot acquire host id: (u'ee2bf992-ec7d-4e32-97b7-aa214aca587b', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) Expected results: SUCCESS Additional info: Logs attached SELinux was disabled, NetworkManager disabled IPV6 disabled as it has sometimes been a bug ensuring communications for nodes, entered node names in hosts file (10.1.200.0/24 network for gluster) and 10.1.10.0/24 is the ovirtmgmt network
Created attachment 1211941 [details] gluster brick log
Created attachment 1211943 [details] glsuter engine log
Created attachment 1211944 [details] sanlock
Created attachment 1211946 [details] vdsm
Created attachment 1211947 [details] tmp file in gluster
One note, I have fusionIO cards as the cache and for the virt01 engine volume, I doubt this matters but I do have the excluded in the multipath.conf. If you see multipath warnings it's likely about that. I have had this work before using nfs (self-hosting).
The first error is here: jsonrpc.Executor/7::DEBUG::2016-10-18 13:02:00,302::clusterlock::183::Storage.initSANLock::(initSANLock) Initializing SANLock for domain 661efc2c-0a42-4c5e-8a7d-e86e45347dbb jsonrpc.Executor/7::ERROR::2016-10-18 13:02:01,596::clusterlock::191::Storage.initSANLock::(initSANLock) Cannot initialize SANLock for domain 661efc2c-0a42-4c5e-8a7d-e86e45347dbb Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 186, in initSANLock sanlock.init_lockspace(sdUUID, idsPath) SanlockException: (22, 'Sanlock lockspace init failure', 'Invalid argument') jsonrpc.Executor/7::WARNING::2016-10-18 13:02:01,597::sd::467::Storage.StorageDomainManifest::(initDomainLock) lease did not initialize successfully Traceback (most recent call last): File "/usr/share/vdsm/storage/sd.py", line 462, in initDomainLock self._domainLock.initLock() File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 222, in initLock initSANLock(self._sdUUID, self._idsPath, self._leasesPath) File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 192, in initSANLock raise se.ClusterLockInitError() ClusterLockInitError: Could not initialize cluster lock: () jsonrpc.Executor/7::DEBUG::2016-10-18 13:02:01,598::hsm::2598::Storage.HSM::(createStorageDomain) knownSDs: {661efc2c-0a42-4c5e-8a7d-e86e45347dbb: storage.glusterSD.findDomain} This is a subsequent error: Thread-29::INFO::2016-10-18 13:02:23,329::clientIF::186::vds::(contEIOVms) vmContainerLock acquired jsonrpc.Executor/5::ERROR::2016-10-18 13:02:24,318::sp::608::Storage.StoragePool::(create) Create pool hosted_datacenter canceled Traceback (most recent call last): File "/usr/share/vdsm/storage/sp.py", line 605, in create self.attachSD(sdUUID) File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 78, in wrapper return method(self, *args, **kwargs) File "/usr/share/vdsm/storage/sp.py", line 929, in attachSD dom.acquireHostId(self.id) File "/usr/share/vdsm/storage/sd.py", line 661, in acquireHostId self._manifest.acquireHostId(hostId, async) File "/usr/share/vdsm/storage/sd.py", line 400, in acquireHostId self._domainLock.acquireHostId(hostId, async) File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 251, in acquireHostId raise se.AcquireHostIdFailure(self._sdUUID, e) AcquireHostIdFailure: Cannot acquire host id: (u'661efc2c-0a42-4c5e-8a7d-e86e45347dbb', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) jsonrpc.Executor/5::DEBUG::2016-10-18 13:02:24,319::fileUtils::191::Storage.fileUtils::(cleanupdir) Removing directory: /rhev/data-center/59762b3c-0f39-4313-9c17-bedb3983d51c
OK, let's take the latest reproduction (around 2016-10-18 20:47 in VDSM logs and 2016-10-19 00:47 in gluster logs). In VDSM logs we see: jsonrpc.Executor/4::ERROR::2016-10-18 20:47:33,623::clusterlock::191::Storage.initSANLock::(initSANLock) Cannot initialize SANLock for domain ee2bf992-ec7d-4e32-97b7-aa214aca587b Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 186, in initSANLock sanlock.init_lockspace(sdUUID, idsPath) SanlockException: (22, 'Sanlock lockspace init failure', 'Invalid argument') jsonrpc.Executor/4::WARNING::2016-10-18 20:47:33,624::sd::467::Storage.StorageDomainManifest::(initDomainLock) lease did not initialize successfully Traceback (most recent call last): File "/usr/share/vdsm/storage/sd.py", line 462, in initDomainLock self._domainLock.initLock() File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 222, in initLock initSANLock(self._sdUUID, self._idsPath, self._leasesPath) File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 192, in initSANLock raise se.ClusterLockInitError() ClusterLockInitError: Could not initialize cluster lock: () And in gluster volume logs: [2016-10-19 00:47:32.397403] I [MSGID: 109036] [dht-common.c:8889:dht_log_new_layout_for_dir_selfheal] 0-engine-dht: Setting layout of /ee2bf992-ec7d-4e32-97b7-aa214aca587b/images with [Subvol_name: engine-replicate-0, Err: -1 , Start: 0 , Stop: 4294967295 , Hash: 1 ], [2016-10-19 00:47:32.485218] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-engine-client-0: remote operation failed [Invalid argument] [2016-10-19 00:47:32.485363] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-engine-client-2: remote operation failed [Invalid argument] [2016-10-19 00:47:32.485384] W [MSGID: 114031] [client-rpc-fops.c:907:client3_3_writev_cbk] 0-engine-client-1: remote operation failed [Invalid argument] [2016-10-19 00:47:33.614436] W [fuse-bridge.c:2305:fuse_writev_cbk] 0-glusterfs-fuse: 114: WRITE => -1 (Invalid argument) While in gluster brick logs: [2016-10-19 00:47:31.649817] I [MSGID: 115029] [server-handshake.c:690:server_setvolume] 0-engine-server: accepted client from virt04.gamull.com-5132-2016/10/19-00:47:31:603846-engine-client-0-0-0 (version: 3.7.9) [2016-10-19 00:47:32.485087] E [MSGID: 113072] [posix.c:3317:posix_writev] 0-engine-posix: write failed: offset 0, [Invalid argument] [2016-10-19 00:47:32.485149] E [MSGID: 115067] [server-rpc-fops.c:1436:server_writev_cbk] 0-engine-server: 310: WRITEV 0 (e8e871ba-17cf-4047-bdec-8cedaec8196b) ==> (Invalid argument) [Invalid argument]
By the way, although StorageDomain.create implicitly failed due to the ClusterLockInitError at comment 8, jsonrpc.Executor/4::DEBUG::2016-10-18 20:47:31,843::__init__::530::jsonrpc.JsonRpcServer::(_handle_request) Calling 'StorageDomain.create' in bridge with {u'name': u'hosted_storage', u'domainType': 7, u'domainClass': 1, u'typeArgs': u'10.1.200.84:/engine', u'version': 3, u'storagedomainID': u'ee2bf992-ec7d-4e32-97b7-aa214aca587b'} jsonrpc.Executor/4::DEBUG::2016-10-18 20:47:31,844::task::597::Storage.TaskManager.Task::(_updateState) Task=`8d24a62c-c1c0-462c-9ccc-57552cae39ec`::moving from state init -> state preparing jsonrpc.Executor/4::INFO::2016-10-18 20:47:31,844::logUtils::49::dispatcher::(wrapper) Run and protect: createStorageDomain(storageType=7, sdUUID=u'ee2bf992-ec7d-4e32-97b7-aa214aca587b', domainName=u'hosted_storage', typeSpecificArg=u'10.1.200.84:/engine', domClass=1, domVersion=3, options=None) it got reported as succeeded to the application (hosted-engine-setup) that subsequently tries to use it with other errors: jsonrpc.Executor/4::ERROR::2016-10-18 20:47:33,623::clusterlock::191::Storage.initSANLock::(initSANLock) Cannot initialize SANLock for domain ee2bf992-ec7d-4e32-97b7-aa214aca587b Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 186, in initSANLock sanlock.init_lockspace(sdUUID, idsPath) SanlockException: (22, 'Sanlock lockspace init failure', 'Invalid argument') jsonrpc.Executor/4::WARNING::2016-10-18 20:47:33,624::sd::467::Storage.StorageDomainManifest::(initDomainLock) lease did not initialize successfully Traceback (most recent call last): File "/usr/share/vdsm/storage/sd.py", line 462, in initDomainLock self._domainLock.initLock() File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 222, in initLock initSANLock(self._sdUUID, self._idsPath, self._leasesPath) File "/usr/lib/python2.7/site-packages/vdsm/storage/clusterlock.py", line 192, in initSANLock raise se.ClusterLockInitError() ClusterLockInitError: Could not initialize cluster lock: () jsonrpc.Executor/4::DEBUG::2016-10-18 20:47:33,625::hsm::2598::Storage.HSM::(createStorageDomain) knownSDs: {ee2bf992-ec7d-4e32-97b7-aa214aca587b: storage.glusterSD.findDomain} jsonrpc.Executor/4::INFO::2016-10-18 20:47:33,625::logUtils::52::dispatcher::(wrapper) Run and protect: createStorageDomain, Return response: None jsonrpc.Executor/4::DEBUG::2016-10-18 20:47:33,625::task::1193::Storage.TaskManager.Task::(prepare) Task=`8d24a62c-c1c0-462c-9ccc-57552cae39ec`::finished: None jsonrpc.Executor/4::DEBUG::2016-10-18 20:47:33,625::task::597::Storage.TaskManager.Task::(_updateState) Task=`8d24a62c-c1c0-462c-9ccc-57552cae39ec`::moving from state preparing -> state finished jsonrpc.Executor/4::DEBUG::2016-10-18 20:47:33,626::resourceManager::952::Storage.ResourceManager.Owner::(releaseAll) Owner.releaseAll requests {} resources {} jsonrpc.Executor/4::DEBUG::2016-10-18 20:47:33,626::resourceManager::989::Storage.ResourceManager.Owner::(cancelAll) Owner.cancelAll requests {} jsonrpc.Executor/4::DEBUG::2016-10-18 20:47:33,626::task::995::Storage.TaskManager.Task::(_decref) Task=`8d24a62c-c1c0-462c-9ccc-57552cae39ec`::ref 0 aborting False jsonrpc.Executor/4::DEBUG::2016-10-18 20:47:33,626::__init__::555::jsonrpc.JsonRpcServer::(_handle_request) Return 'StorageDomain.create' in bridge with True jsonrpc.Executor/4::INFO::2016-10-18 20:47:33,626::__init__::513::jsonrpc.JsonRpcServer::(_serveRequest) RPC call StorageDomain.create succeeded in 1.78 seconds I think that in this case VDSM should have reported a clear error on StorageDomain.create exits so there is probably also a VDSM bug hidden here. Nir?
Let me know what I can do to help. I am tempted to retry without the fusionio cards but I can't see how they make a difference. Hardware is 3 dell poweredge 1950s. They have standard broadcom nics with a dedicated management and dedicated storage nic. Rhel runs off internal usb drive and the bricks are using a pair of 1tb sata drives Do you think this is more network and timeout related? I may also retry the whole install from scratch. Maybe a typo somewhere?
David, any idea what the sanlock issue is?
gluster returns EINVAL for sanlock i/o. "fusionIO cards as the cache and for the virt01 engine volume" -- I assume that plays no role in the io path for sanlock leases.
(In reply to David Teigland from comment #12) > gluster returns EINVAL for sanlock i/o. On other setups, we sucesfully used it over gluster.
Created attachment 1212196 [details] ovirt-hosted-engine-setup log
Yes, at hosted-engine-setup eyes, the storage domain was correctly created since VDSM didn't raised up the ClusterLockInitError. It's not the root cause but this is worth to be fixed as well: I'm open a separate bug to track this. 2016-10-18 20:47:31 INFO otopi.plugins.gr_he_setup.storage.storage storage._misc:1285 Creating Storage Domain 2016-10-18 20:47:31 DEBUG otopi.plugins.gr_he_setup.storage.storage storage._createStorageDomain:801 createStorageDomain 2016-10-18 20:47:33 DEBUG otopi.plugins.gr_he_setup.storage.storage storage._createStorageDomain:835 {'status': {'message': 'Done', 'code': 0}} 2016-10-18 20:47:33 DEBUG otopi.plugins.gr_he_setup.storage.storage storage._createStorageDomain:837 {'status': {'message': 'Done', 'code': 0}, u'mdasize': 0, u'mdathreshold': True, u'mdavalid': True, u'diskfree': u'107279286272', u'disktotal': u'107321753600', u'mdafree': 0}
The following is from sanlock.log same attempt on virt05 I added the flag for verbose. 2016-10-19 17:36:51-0400 519 [4727]: s1 delta_acquire done 250 1 499 2016-10-19 17:36:52-0400 520 [1106]: s1 add_lockspace done 2016-10-19 17:36:52-0400 520 [1106]: cmd_add_lockspace 2,8 done 0 2016-10-19 17:36:52-0400 520 [1101]: cmd_register ci 3 fd 11 pid 4405 2016-10-19 17:36:52-0400 520 [1107]: cmd_acquire 3,11,4405 ci_in 3 fd 11 count 1 2016-10-19 17:36:52-0400 520 [1107]: s1:r1 resource 893961e1-6fe1-476f-b593-74b9e4bef1e2:SDM:/rhev/data-center/mnt/_var_lib_ovirt-hosted-engine-setup_tmp722kOG/893961e1-6fe1-476f-b593-74b9e4bef1e2/dom_md/leases:1048576 for 3,11,4405 2016-10-19 17:36:52-0400 520 [1107]: r1 paxos_acquire begin 0 0 0 2016-10-19 17:36:52-0400 520 [1107]: r1 paxos_acquire leader 0 owner 0 0 0 max mbal[1999] 0 our_dblock 0 0 0 0 0 0 2016-10-19 17:36:52-0400 520 [1107]: r1 paxos_acquire leader 0 free 2016-10-19 17:36:52-0400 520 [1107]: r1 ballot 1 phase1 mbal 250 2016-10-19 17:36:52-0400 520 [1107]: r1 ballot 1 phase2 bal 250 inp 250 1 520 q_max -1 2016-10-19 17:36:52-0400 520 [1107]: r1 ballot 1 commit self owner 250 1 520 2016-10-19 17:36:52-0400 520 [1107]: r1 acquire_disk rv 1 lver 1 at 520 2016-10-19 17:36:52-0400 520 [1107]: cmd_acquire 3,11,4405 result 0 pid_dead 0 2016-10-19 17:36:52-0400 520 [1107]: cmd_add_lockspace 4,12 96860754-377e-4600-b9eb-cd73774c1f99:250:/rhev/data-center/mnt/glusterSD/10.1.200.85:_engine/96860754-377e-4600-b9eb-cd73774c1f99/dom_md/ids:0 flags 0 timeout 0 2016-10-19 17:36:52-0400 520 [1107]: s2 lockspace 96860754-377e-4600-b9eb-cd73774c1f99:250:/rhev/data-center/mnt/glusterSD/10.1.200.85:_engine/96860754-377e-4600-b9eb-cd73774c1f99/dom_md/ids:0 2016-10-19 17:36:52-0400 520 [4745]: s2 delta_acquire begin 96860754-377e-4600-b9eb-cd73774c1f99:250 2016-10-19 17:36:52-0400 520 [4745]: 96860754 aio collect 0 0x7f3a800008c0:0x7f3a800008d0:0x7f3a80101000 result -22:0 match res 2016-10-19 17:36:52-0400 520 [4745]: read_sectors delta_leader offset 127488 rv -22 /rhev/data-center/mnt/glusterSD/10.1.200.85:_engine/96860754-377e-4600-b9eb-cd73774c1f99/dom_md/ids 2016-10-19 17:36:52-0400 520 [4745]: s2 delta_acquire leader_read1 error -22 2016-10-19 17:36:52-0400 521 [1101]: s1 host 250 1 499 8b7f00aa-2bcc-4ce1-a201-801975502ea1.virt05.gam 2016-10-19 17:36:53-0400 521 [1107]: s2 add_lockspace fail result -22 2016-10-19 17:36:53-0400 521 [1107]: cmd_add_lockspace 4,12 done -22 2016-10-19 17:36:53-0400 522 [1106]: cmd_release 3,11,4405 ci_in 3 fd 11 count 1 flags 0 2016-10-19 17:36:53-0400 522 [1106]: r1 paxos_release leader 1 owner 250 1 520 2016-10-19 17:36:53-0400 522 [1106]: r1 release_token done 1 r_flags 0 t_flags 0 2016-10-19 17:36:53-0400 522 [1106]: cmd_release 3,11,4405 result 0 pid_dead 0 count 1 2016-10-19 17:36:53-0400 522 [1107]: cmd_rem_lockspace 2,12 893961e1-6fe1-476f-b593-74b9e4bef1e2 flags 0 2016-10-19 17:36:53-0400 522 [1101]: s1 set killing_pids check 0 remove 1 2016-10-19 17:36:54-0400 522 [1101]: s1 all pids clear 2016-10-19 17:36:54-0400 522 [1101]: s1 set thread_stop 2016-10-19 17:36:54-0400 522 [1101]: s1 wdmd_test_live 0 0 to disable 2016-10-19 17:36:54-0400 523 [4727]: s1 delta_release begin 893961e1-6fe1-476f-b593-74b9e4bef1e2:250 2016-10-19 17:36:54-0400 523 [4727]: s1 delta_release done 250 1 0 2016-10-19 17:36:55-0400 523 [1101]: s1 free lockspace 2016-10-19 17:36:55-0400 524 [1107]: cmd_rem_lockspace 2,12 done 0 2016-10-19 17:36:55-0400 524 [1106]: cmd_rem_lockspace 4,9 893961e1-6fe1-476f-b593-74b9e4bef1e2 flags 0 2016-10-19 17:36:55-0400 524 [1106]: cmd_rem_lockspace 4,9 done -2
Directories get created and are visible on other 2 nodes [root@virt05 ~]# ls -la /rhev/data-center/mnt/ glusterSD/ _var_lib_ovirt-hosted-engine-setup_tmp722kOG/ [root@virt05 ~]# ls -la /rhev/data-center/mnt/_var_lib_ovirt-hosted-engine-setup_tmp722kOG/ total 24 drwxr-xr-x. 4 vdsm kvm 4096 Oct 19 17:36 . drwxr-xr-x. 4 vdsm kvm 73 Oct 19 17:36 .. drwxr-xr-x. 4 vdsm kvm 4096 Oct 19 17:36 893961e1-6fe1-476f-b593-74b9e4bef1e2 -rwxr-xr-x. 1 vdsm kvm 0 Oct 19 17:36 __DIRECT_IO_TEST__ drwx------. 2 vdsm kvm 16384 Oct 19 17:36 lost+found [root@virt05 ~]# ls -la /rhev/data-center/ 2340891f-6886-4acf-bfe0-8a30db0f35a5/ mnt/ [root@virt05 ~]# ls -la /rhev/data-center/mnt/glusterSD/10.1.200.85\:_engine/ total 12 drwxr-xr-x. 6 vdsm kvm 4096 Oct 19 17:36 . drwxr-xr-x. 3 vdsm kvm 32 Oct 19 17:36 .. drwxr-xr-x. 4 vdsm kvm 4096 Oct 19 17:36 96860754-377e-4600-b9eb-cd73774c1f99 -rwxr-xr-x. 1 vdsm kvm 0 Oct 19 17:36 __DIRECT_IO_TEST__ drwxr-xr-x. 3 root root 4096 Oct 19 17:11 .trashcan [root@virt05 ~]# ls -la /rhev/data-center/mnt/glusterSD/10.1.200.85\:_engine/96860754-377e-4600-b9eb-cd73774c1f99/ total 16 drwxr-xr-x. 4 vdsm kvm 4096 Oct 19 17:36 . drwxr-xr-x. 6 vdsm kvm 4096 Oct 19 17:36 .. drwxr-xr-x. 2 vdsm kvm 4096 Oct 19 17:36 dom_md drwxr-xr-x. 2 vdsm kvm 4096 Oct 19 17:36 images [root@virt05 ~]# ls -la /rhev/data-center/mnt/glusterSD/10.1.200.85\:_engine/96860754-377e-4600-b9eb-cd73774c1f99/images/ total 8 drwxr-xr-x. 2 vdsm kvm 4096 Oct 19 17:36 . drwxr-xr-x. 4 vdsm kvm 4096 Oct 19 17:36 .. [root@virt05 ~]# ls -la /rhev/data-center/mnt/glusterSD/10.1.200.85\:_engine/96860754-377e-4600-b9eb-cd73774c1f99/dom_md/ total 1033 drwxr-xr-x. 2 vdsm kvm 4096 Oct 19 17:36 . drwxr-xr-x. 4 vdsm kvm 4096 Oct 19 17:36 .. -rw-rw----. 1 vdsm kvm 1048576 Oct 19 17:36 ids -rw-rw----. 1 vdsm kvm 16777216 Oct 19 17:36 inbox -rw-rw----. 1 vdsm kvm 0 Oct 19 17:36 leases -rw-r--r--. 1 vdsm kvm 301 Oct 19 17:36 metadata -rw-rw----. 1 vdsm kvm 16777216 Oct 19 17:36 outbox [root@virt05 ~]#
Other node direct view... [root@virt04 engine]# ls -la /rhgs/engine/brick/ total 0 drwxr-xr-x. 6 vdsm kvm 120 Oct 19 17:36 . drwxr-xr-x. 3 vdsm kvm 18 Oct 19 17:09 .. drwxr-xr-x. 4 vdsm kvm 32 Oct 19 17:36 96860754-377e-4600-b9eb-cd73774c1f99 -rwxr-xr-x. 2 vdsm kvm 0 Oct 19 17:36 __DIRECT_IO_TEST__ drw-------. 28 root root 344 Oct 19 17:36 .glusterfs drwxr-xr-x. 2 root root 6 Oct 19 17:20 .shard drwxr-xr-x. 3 root root 24 Oct 19 17:11 .trashcan [root@virt04 engine]# ls -la /rhgs/engine/brick/96860754-377e-4600-b9eb-cd73774c1f99/ total 0 drwxr-xr-x. 4 vdsm kvm 32 Oct 19 17:36 . drwxr-xr-x. 6 vdsm kvm 120 Oct 19 17:36 .. drwxr-xr-x. 2 vdsm kvm 69 Oct 19 17:36 dom_md drwxr-xr-x. 2 vdsm kvm 6 Oct 19 17:36 images [root@virt04 engine]# ls -la /rhgs/engine/brick/96860754-377e-4600-b9eb-cd73774c1f99/dom_md/ total 1028 drwxr-xr-x. 2 vdsm kvm 69 Oct 19 17:36 . drwxr-xr-x. 4 vdsm kvm 32 Oct 19 17:36 .. -rw-rw----. 2 vdsm kvm 1048576 Oct 19 17:36 ids -rw-rw----. 2 vdsm kvm 0 Oct 19 17:36 inbox -rw-rw----. 2 vdsm kvm 0 Oct 19 17:36 leases -rw-r--r--. 2 vdsm kvm 301 Oct 19 17:36 metadata -rw-rw----. 2 vdsm kvm 0 Oct 19 17:36 outbox [root@virt04 engine]#
Tom, We are seeing the write error coming from underlying filesystem(XFS) of gluster based on the following log: [2016-10-18 17:02:00.421938] E [MSGID: 113072] [posix.c:3317:posix_writev] 0-engine-posix: write failed: offset 0, [Invalid argument] This can come when a file is opened with O_DIRECT passes offset/size which is not aligned. We are seeing that the write is beginning at the offset '0' which is aligned. Does anyone know the size of this particular write? Could we get the strace of the brick process when this issue is seen? One way to figure it out is: strace -ff -T -p <pid-of-brick> -o <path-to-the-file-where-you-want-the-output-saved>` This will create multiple strace files per thread. Please attach them when you do this and we can figure that out.
The other possibility is that the fd that posix xlator in glusterfs is using is wrong. All these doubts will be clarified once we have strace.
Created attachment 1212511 [details] strace virt05.gamull.com node2
(In reply to Pranith Kumar K from comment #19) > Tom, > We are seeing the write error coming from underlying filesystem(XFS) of > gluster based on the following log: > [2016-10-18 17:02:00.421938] E [MSGID: 113072] [posix.c:3317:posix_writev] > 0-engine-posix: write failed: offset 0, [Invalid argument] > > This can come when a file is opened with O_DIRECT passes offset/size which > is not aligned. We are seeing that the write is beginning at the offset '0' > which is aligned. Does anyone know the size of this particular write? Could > we get the strace of the brick process when this issue is seen? > > One way to figure it out is: > strace -ff -T -p <pid-of-brick> -o > <path-to-the-file-where-you-want-the-output-saved>` > > This will create multiple strace files per thread. Please attach them when > you do this and we can figure that out. Just uploaded
I can hop on IRC if that helps to give some live feedback, let me know.
Progress!!! I believe this ticket was due to an alignment issue. Here are my resolution steps. Previously I had used two devices and wrote direct /dev/sda (RAID1 1TB) /dev/fioa (FusionIO Card) pvcreate --dataalignment 1536K /dev/sda pvcreate --dataalignment 1536K /dev/fioa Above was wrong I think I removed all LV,VG,PV and umounted Used Fdisk to make a partition on each creating /dev/sda1 /dev/fioa1 To solve I ran the following to set the storage correctly pvcreate --dataalignment 256K /dev/sda1 vgcreate RHS_vg1 /dev/sda1 lvcreate -n engine --size 100G RHS_vg1 mkfs.xfs -f -i size=512 -n size=8192 /dev/RHS_vg1/engine lvcreate --thinpool RHS_vg1/vg1_pool --size 750G --chunksize 256K --poolmetadatasize 16G pvcreate --dataalignment 256K /dev/fioa1 vgextend RHS_vg1 /dev/fioa1 lvchange --zero n RHS_vg1/vg1_pool lvcreate -L 200G -n rhs_cache RHS_vg1 /dev/fioa1 lvcreate -L 2G -n rhs_cache_meta RHS_vg1 /dev/fioa1 lvconvert --type cache-pool --poolmetadata /dev/RHS_vg1/rhs_cache_meta --poolmetadataspare n --cachemode writethrough /dev/RHS_vg1/rhs_cache lvconvert --type cache --cachepool /dev/RHS_vg1/rhs_cache /dev/RHS_vg1/vg1_pool lvcreate --thin --name data --virtualsize 750G RHS_vg1/vg1_pool mkfs.xfs -f -i size=512 -n size=8192 /dev/RHS_vg1/data lvcreate --thin --name vmstore --virtualsize 750G RHS_vg1/vg1_pool mkfs.xfs -f -i size=512 -n size=8192 /dev/RHS_vg1/vmstore mkdir -p /rhgs/engine /rhgs/brick1 /rhgs/brick2 mount /dev/RHS_vg1/engine /rhgs/engine mount /dev/RHS_vg1/vmstore /rhgs/brick1 mount /dev/RHS_vg1/data /rhgs/brick2 FSTAB /dev/RHS_vg1/engine /rhgs/engine xfs defaults 0 0 /dev/RHS_vg1/vmstore /rhgs/brick1 xfs defaults 0 0 /dev/RHS_vg1/data /rhgs/brick2 xfs defaults 0 0 mount -a mkdir /rhgs/engine/enginebrick mkdir /rhgs/brick1/vmstore mkdir /rhgs/brick2/data semanage fcontext -a -t glusterd_brick_t /rhgs/engine/enginebrick semanage fcontext -a -t glusterd_brick_t /rhgs/brick1/vmstore semanage fcontext -a -t glusterd_brick_t /rhgs/brick2/data restorecon -Rv /rhgs ========== After doing this on all 3 nodes and verifying peer status ========== gluster volume create engine replica 3 node1:/rhgs/engine/enginebrick node2:rhgs/engine/enginebrick node3:/rhgs/engine/enginebrick gluster volume set engine group virt gluster volume set engine storage.owner-uid 36 && gluster volume set engine storage.owner-gid 36 gluster volume set engine features.shard on gluster volume set engine features.shard-block-size 512MB gluster volume set engine performance.low-prio-threads 32 gluster volume set engine cluster.data-self-heal-algorithm full gluster volume set engine nfs.disable on gluster volume set engine locking-scheme granular gluster volume set engine cluster.shd-max-threads 8 gluster volume set engine cluster.shd-wait-qlength 10000 gluster volume set engine performance.strict-o-direct on gluster volume set engine network.remote-dio off gluster volume set engine network.ping-timeout 30 gluster volume set engine user.cifs off gluster volume start engine ============ on node2/virt05 I ran the deploy and we have success!!! [ INFO ] Stage: Transaction setup [ INFO ] Stage: Misc configuration [ INFO ] Stage: Package installation [ INFO ] Stage: Misc configuration [ INFO ] Configuring libvirt [ INFO ] Configuring VDSM [ INFO ] Starting vdsmd [ INFO ] Creating Storage Domain [ INFO ] Creating Storage Pool [ INFO ] Connecting Storage Pool [ INFO ] Verifying sanlock lockspace initialization [ INFO ] Creating Image for 'hosted-engine.lockspace' ... [ INFO ] Image for 'hosted-engine.lockspace' created successfully [ INFO ] Creating Image for 'hosted-engine.metadata' ... [ INFO ] Image for 'hosted-engine.metadata' created successfully [ INFO ] Creating VM Image [ INFO ] Extracting disk image from OVF archive (could take a few minutes depending on archive size) [ INFO ] Validating pre-allocated volume size [ INFO ] Uploading volume to data domain (could take a few minutes depending on archive size) [ INFO ] Image successfully imported from OVF [ INFO ] Destroying Storage Pool [ INFO ] Start monitoring domain [ INFO ] Configuring VM [ INFO ] Updating hosted-engine configuration [ INFO ] Stage: Transaction commit [ INFO ] Stage: Closing up The following network ports should be opened: tcp:5900 tcp:5901 tcp:9090 udp:5900 udp:5901 An example of the required configuration for iptables can be found at: /etc/ovirt-hosted-engine/iptables.example In order to configure firewalld, copy the files from /etc/ovirt-hosted-engine/firewalld to /etc/firewalld/services and execute the following commands: firewall-cmd -service hosted-cockpit firewall-cmd -service hosted-console [ INFO ] Creating VM You can now connect to the VM with the following command: hosted-engine --console You can also graphically connect to the VM from your system with the following command: remote-viewer vnc://virt05.gamull.com:5900 Use temporary password "*********" to connect to vnc console. Please ensure that your Guest OS is properly configured to support serial console according to your distro documentation. Follow http://www.ovirt.org/Serial_Console_Setup#I_need_to_access_the_console_the_old_way for more info. If you need to reboot the VM you will need to start it manually using the command: hosted-engine --vm-start You can then set a temporary password using the command: hosted-engine --add-console-password [ INFO ] Running engine-setup on the appliance
This BZ was helpful - https://bugzilla.redhat.com/show_bug.cgi?id=958781
Although the install continues I am unable to add the data domains receiving the same error. This would be a new issue I imagine as the BZ is only for the install Oct 20 21:56:37 virt05.gamull.com ovirt-ha-agent[3590]: pending = getattr(dispatcher, 'pending', lambda: 0) Oct 20 21:56:37 virt05.gamull.com ovirt-ha-agent[3590]: WARNING:ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore:Unable to find OVF_STORE Oct 20 21:56:37 virt05.gamull.com vdsm[3588]: vdsm vds.dispatcher ERROR SSL error during reading data: unexpected eof Oct 20 21:56:37 virt05.gamull.com ovirt-ha-agent[3590]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config ERROR Unable to get vm.conf from OVF_STORE, falling back to initial vm.conf Oct 20 21:56:37 virt05.gamull.com ovirt-ha-agent[3590]: ERROR:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config:Unable to get vm.conf from OVF_STORE, falling back to initial vm.conf Oct 20 21:56:37 virt05.gamull.com sudo[23830]: nrpe : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/lib64/nagios/plugins/gluster/check_volume_status.py -v engine -t heal-info Oct 20 21:56:37 virt05.gamull.com ovirt-ha-agent[3590]: /usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352: DeprecationWarning: Dispatcher.pending is deprecated. Use Dispatcher.socket.pending instead. Oct 20 21:56:37 virt05.gamull.com ovirt-ha-agent[3590]: pending = getattr(dispatcher, 'pending', lambda: 0) Oct 20 21:56:39 virt05.gamull.com ovirt-ha-agent[3590]: /usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352: DeprecationWarning: Dispatcher.pending is deprecated. Use Dispatcher.socket.pending instead. Oct 20 21:56:39 virt05.gamull.com ovirt-ha-agent[3590]: pending = getattr(dispatcher, 'pending', lambda: 0) Oct 20 21:56:41 virt05.gamull.com ovirt-ha-broker[1043]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Oct 20 21:56:41 virt05.gamull.com ovirt-ha-broker[1043]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed Oct 20 21:56:41 virt05.gamull.com ovirt-ha-broker[1043]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Oct 20 21:56:41 virt05.gamull.com ovirt-ha-broker[1043]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed Oct 20 21:56:42 virt05.gamull.com ovirt-ha-broker[1043]: INFO:mgmt_bridge.MgmtBridge:Found bridge ovirtmgmt with ports Oct 20 21:56:42 virt05.gamull.com ovirt-ha-agent[3590]: /usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352: DeprecationWarning: Dispatcher.pending is deprecated. Use Dispatcher.socket.pending instead. Oct 20 21:56:42 virt05.gamull.com ovirt-ha-agent[3590]: pending = getattr(dispatcher, 'pending', lambda: 0) Oct 20 21:56:42 virt05.gamull.com vdsm[3588]: vdsm vds.dispatcher ERROR SSL error during reading data: unexpected eof Oct 20 21:56:42 virt05.gamull.com ovirt-ha-agent[3590]: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Current state EngineStop (score: 3400) Oct 20 21:56:42 virt05.gamull.com ovirt-ha-agent[3590]: /usr/lib/python2.7/site-packages/yajsonrpc/stomp.py:352: DeprecationWarning: Dispatcher.pending is deprecated. Use Dispatcher.socket.pending instead. Oct 20 21:56:42 virt05.gamull.com ovirt-ha-agent[3590]: pending = getattr(dispatcher, 'pending', lambda: 0) Oct 20 21:56:43 virt05.gamull.com ovirt-ha-broker[1043]: WARNING:engine_health.CpuLoadNoEngine:bad health status: Hosted Engine is not up! Oct 20 21:56:43 virt05.gamull.com ovirt-ha-broker[1043]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Oct 20 21:56:43 virt05.gamull.com ovirt-ha-broker[1043]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed Oct 20 21:56:43 virt05.gamull.com ovirt-ha-broker[1043]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection established Oct 20 21:56:43 virt05.gamull.com ovirt-ha-broker[1043]: INFO:ovirt_hosted_engine_ha.broker.listener.ConnectionHandler:Connection closed Oct 20 21:56:43 virt05.gamull.com ovirt-ha-broker[1043]: INFO:mem_free.MemFree:memFree: 27405 ^C [root@virt05 ~]# hosted-engine --vm-status /usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py:15: DeprecationWarning: vdscli uses xmlrpc. since ovirt 3.6 xmlrpc is deprecated, please use vdsm.jsonrpcvdscli
Is your hosted engine VM installation (engine-setup) complete? From the above logs, it looks like it is not? "ERROR SSL error during reading data: unexpected eof" Are there errors in engine-setup logs? Could you attach those and the vdsm logs?
strace attached from trying to add vmstore domain to engine. Engine appears to complete also. uploading log.
Created attachment 1212934 [details] strace virt04 from failed vmstore data domain attach
Created attachment 1212935 [details] Hosted Engine setup log
Also noticing this in the engine logs. I have added the two hosts but they had Certificate enrollment issues. I noticed the serial.txt in the pki directory was empty so i used the old value and this worked. I also added an NFS share as data just to get the engine working. Right now I still can't add the data or vmstore data domains 2016-10-21 17:18:48,346 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (org.ovirt.thread.pool-6-thread-3) [74971c04] Failed in 'AttachStorageDomainVDS' method 2016-10-21 17:18:48,392 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-3) [74971c04] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM command failed: Cannot acquire host id: (u'10e3c90c-5fe8-4a10-a1ee-58904d91449c', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')) 2016-10-21 17:18:48,392 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (org.ovirt.thread.pool-6-thread-3) [74971c04] IrsBroker::Failed::AttachStorageDomainVDS: IRSGenericException: IRSErrorException: Failed to AttachStorageDomainVDS, error = Cannot acquire host id: (u'10e3c90c-5fe8-4a10-a1ee-58904d91449c', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')), code = 661 2016-10-21 17:18:48,392 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.AttachStorageDomainVDSCommand] (org.ovirt.thread.pool-6-thread-3) [74971c04] FINISH, AttachStorageDomainVDSCommand, log id: 3fa27488 2016-10-21 17:18:48,392 ERROR [org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand] (org.ovirt.thread.pool-6-thread-3) [74971c04] Command 'org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.irsbroker.IRSErrorException: IRSGenericException: IRSErrorException: Failed to AttachStorageDomainVDS, error = Cannot acquire host id: (u'10e3c90c-5fe8-4a10-a1ee-58904d91449c', SanlockException(22, 'Sanlock lockspace add failure', 'Invalid argument')), code = 661 (Failed with error AcquireHostIdFailure and code 661) 2016-10-21 17:18:48,394 INFO [org.ovirt.engine.core.bll.storage.domain.AttachStorageDomainToPoolCommand] (org.ovirt.thread.pool-6-thread-3) [74971c04] Command [id=1dc6c883-4ce1-4542-bfed-ea2f0d957bb0]: Compensating NEW_ENTITY_ID of org.ovirt.engine.core.common.businessentities.StoragePoolIsoMap; snapshot: StoragePoolIsoMapId:{storagePoolId='580a65da-02d1-0368-023f-0000000001af', storageId='10e3c90c-5fe8-4a10-a1ee-58904d91449c'}. 2016-10-21 17:18:48,408 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-3) [74971c04] Correlation ID: 7722f8e1, Job ID: 337d0f50-3c1a-485f-a0d6-293c01d3d161, Call Stack: null, Custom Event ID: -1, Message: Failed to attach Storage Domain data to Data Center Default. (User: admin@internal-authz)
Created attachment 1212962 [details] ovirt-engine log engine log during failed adds for data and vmstore domains showing sanlock error. engine domain seems to be there but the two others (same gluster cluster) do not add. I can see the data under volumes and the bricks, just can't add to storage data center
Removing vg1_pool with data and vmstore and recreating them without the /dev/fioa1 (FusionIO) seems to have worked. I am unsure if it is the fusionio card or some overprovisioning of the thin pool. I made vmstore and data 350G each. I plan to retest with cache again and same numbers.
Reconfiguring the VG with cache and ensuring alignment on the FusionIO card still produces the error. This is likely a FusionIO driver error (io-memory-vsl driver). However, it is still problematic as I would assume any PCIe Storage device might experience a similar outcome. In short, I'm not sure if it's cache related or fusionio related. I will try the following next. 1) Creation of a data domain without cache on the FusionIO storage 2) Try some combinations of cache on with FusionIO to narrow down issue.
I think I've found the issue and this was completely due to 4k blocks on the fusionio devices. I reformatted them (using fio-format) as 512K blocks. I then added them in LVM and created a new PV, VG and an LV. I formatted the LV without options mkfs.xfs -f /dev/iomemory_vg/iomemory and then created the gluster volume This worked so I believe the issue for the FusionIO PCIe storage cards is the 4K blocks are not suitable for use with gluster replicas, you need to use 512K blocks. IN summary, this error is related to alignment issues and likely using blocks too large for the device. SOLUTION: reformat using 512K blocks for the device and for the filesystem. do not use dataalignment option.
512B blocks, not 512K?
Yes 512 blocks not 512K! 0.5K
Data alignment values are configured via gdeploy while creating PV, VG, LV for gluster bricks. Closing this bug, please re-open in gluster gdeploy component if issues are seen around data alignment values chosen
Seems that even when running with gdeploy the issue persists. hosted engine deploy fails with same error. I am trying to deploy ovirt 4.1 using latest Centos7. The servers have 4K sector disks. Any idea how this can be overcome?