Bug 1624586
| Summary: | Gluster over infiniband causing host timeouts and failures | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] vdsm | Reporter: | housemouse | ||||||||||||||||
| Component: | Gluster | Assignee: | Gobinda Das <godas> | ||||||||||||||||
| Status: | CLOSED WONTFIX | QA Contact: | SATHEESARAN <sasundar> | ||||||||||||||||
| Severity: | medium | Docs Contact: | |||||||||||||||||
| Priority: | medium | ||||||||||||||||||
| Version: | 4.20.31 | CC: | bugs, housemouse, kdhananj, rkavunga, sabose | ||||||||||||||||
| Target Milestone: | --- | ||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||
| Hardware: | x86_64 | ||||||||||||||||||
| OS: | Other | ||||||||||||||||||
| Whiteboard: | |||||||||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||
| Last Closed: | 2019-07-19 06:57:03 UTC | Type: | Bug | ||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||
| oVirt Team: | Gluster | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
| Embargoed: | |||||||||||||||||||
| Attachments: |
|
||||||||||||||||||
|
Description
housemouse
2018-09-01 21:36:22 UTC
Can you provide vdsm and gluster logs? Created attachment 1486018 [details]
ovirt-11_vdsm.log
Created attachment 1486019 [details]
ovirt-11_glusterd.log
Created attachment 1486020 [details]
ovirt-12_vdsm.log
Created attachment 1486021 [details]
ovirt-12_glusterd.log
Created attachment 1486022 [details]
ovirt-13_vdsm.log
Created attachment 1486023 [details]
ovirt-13_glusterd.log
Is there any update on this as I am still trying to get gluster and InfiniBand working reliably? The vdsm logs contain below which indicated the storage domain is not functioning as expected. MiscFileReadException: Internal file read failure: (u'/rhev/data-center/mnt/glusterSD/10.0.14.11:_gluster__vmstore/68ad64a6-5d6a-4691-b0a3-929803674a54/dom_md/metadata', 1, 'Read timeout') Are there corresponding errors in the gluster mount logs? (/var/log/glusterfs/rhev-data-center-mnt-glusterSD-10.0.14.11:_gluster__vmstore.log) Please provide the tar of /var/log/glusterfs from ovirt-11 Created attachment 1503545 [details]
Folder with tar logs from 3 hosts
I had removed the gluster volume as I hadn't heard a response back. I just tried re-creating the gluster volume and could not connect to it via tcp or RDMA. It also caused issues with one of my hosts that was connected via iser over InfiniBand to a separate SAN device.
I found an issue with selinux where it was blocking gluster access for infiniband.
Had to create exemption for it.
Module infiniband_pkey 1.0;
require {
type unlabeled_t;
type glusterd_t;
class infiniband_pkey access;
}
#============= glusterd_t ==============
#!!!! WARNING: 'unlabeled_t' is a base type.
allow glusterd_t unlabeled_t:infiniband_pkey access;
I am noticing the following items in /var/log/messages when I was copying data to the gluster volume
Nov 18 15:20:17 ovirt-11 sanlock[10869]: 2018-11-18 15:20:17 149684 [1516]: s3 delta_renew long write time 18 sec
Nov 18 15:20:30 ovirt-11 vdsm[24659]: WARN Worker blocked: <Worker name=jsonrpc/7 running <Task <JsonRpcTask {'params': {u'connectionParams': [{u'mnt_options': u'backup-volfile-servers=10.0.14.12:10.0.14.13,transport=rdma', u'id': u'00000000-0000-0000-0000-000000000000', u'connection': u'10.0.14.11:/gluster_hdd', u'iqn': u'', u'user': u'', u'tpgt': u'1', u'vfs_type': u'glusterfs', u'password': '********', u'port': u''}], u'storagepoolID': u'00000000-0000-0000-0000-000000000000', u'domainType': 7}, 'jsonrpc': '2.0', 'method': u'StoragePool.connectStorageServer', 'id': u'30254c00-159e-4227-9336-7e95d2a04436'} at 0x7fedbc425050> timeout=60, duration=10020 at 0x7fee1815b690> task#=8469 at 0x7fedbef15dd0>, traceback:#012File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap#012 self.__bootstrap_inner()#012File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner#012 self.run()#012File: "/usr/lib64/python2.7/threading.py", line 765, in run#012 self.__target(*self.__args, **self.__kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run#012 ret = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run#012 self._execute_task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012 task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__#012 self._callable()#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 523, in __call__#012 self._handler(self._ctx, self._req)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 566, in _serveRequest#012 response = self._handle_request(req, ctx)#012File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 606, in _handle_request#012 res = method(**params)#012File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 193, in _dynamicMethod#012 result = fn(*methodArgs)#012File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1068, in connectStorageServer#012 connectionParams)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/dispatcher.py", line 73, in wrapper#012 result = ctask.prepare(func, *args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 108, in wrapper#012 return m(self, *a, **kw)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 1179, in prepare#012 result = self._run(func, *args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run#012 return fn(*args, **kargs)#012File: "<string>", line 2, in connectStorageServer#012File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method#012 ret = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2412, in connectStorageServer#012 conObj.connect()#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/storageServer.py", line 171, in connect#012 self._mount.mount(self.options, self._vfsType, cgroup=self.CGROUP)#012File: "/usr/lib/python2.7/site-packages/vdsm/storage/mount.py", line 207, in mount#012 cgroup=cgroup)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 55, in __call__#012 return callMethod()#012File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 53, in <lambda>#012 **kwargs)#012File: "<string>", line 2, in mount#012File: "/usr/lib64/python2.7/multiprocessing/managers.py", line 759, in _callmethod#012 kind, result = conn.recv()
Nov 18 15:20:32 ovirt-11 sanlock[10869]: 2018-11-18 15:20:32 149699 [1516]: s3 delta_renew long write time 12 sec
Sorry, not enough info to understand what's going wrong here. Any errors in gluster mount logs or vdsm logs related to this? We do not actively test the Infiniband access, hence there could be unknown issues here I have updated the link for the Folder with the tar logs of the 3 hosts. When I was watching some of the logs during transfers, I was seeing possible sanlock issues with delays. [2018-11-18 21:40:20.457735] I [MSGID: 133017] [shard.c:4941:shard_seek] 0-gluster_hdd-shard: seek called on d10cfb25-7948-4cec-8273-88e7467acd38. [Operation not supported] [2018-11-18 21:40:23.627200] I [MSGID: 133017] [shard.c:4941:shard_seek] 0-gluster_hdd-shard: seek called on 9e519c74-e8ef-41d5-af01-58aa03f3f951. [Operation not supported] [2018-11-18 21:40:23.627505] I [MSGID: 133017] [shard.c:4941:shard_seek] 0-gluster_hdd-shard: seek called on d10cfb25-7948-4cec-8273-88e7467acd38. [Operation not supported] The mount logs have above errors, is it related? (In reply to Sahina Bose from comment #14) > [2018-11-18 21:40:20.457735] I [MSGID: 133017] [shard.c:4941:shard_seek] > 0-gluster_hdd-shard: seek called on d10cfb25-7948-4cec-8273-88e7467acd38. > [Operation not supported] > [2018-11-18 21:40:23.627200] I [MSGID: 133017] [shard.c:4941:shard_seek] > 0-gluster_hdd-shard: seek called on 9e519c74-e8ef-41d5-af01-58aa03f3f951. > [Operation not supported] > [2018-11-18 21:40:23.627505] I [MSGID: 133017] [shard.c:4941:shard_seek] > 0-gluster_hdd-shard: seek called on d10cfb25-7948-4cec-8273-88e7467acd38. > [Operation not supported] > > The mount logs have above errors, is it related? Also glusterd logs have errors: [2018-11-21 15:36:28.381085] W [socket.c:3272:socket_connect] 0-glustershd: Ignore failed connection attempt on , (No such file or directory) [2018-11-21 15:36:35.383659] W [socket.c:3272:socket_connect] 0-glustershd: Ignore failed connection attempt on , (No such file or directory) Since we do not test rdma for oVirt HC, I'm not very sure how far we can debug this (In reply to Sahina Bose from comment #14) > [2018-11-18 21:40:20.457735] I [MSGID: 133017] [shard.c:4941:shard_seek] > 0-gluster_hdd-shard: seek called on d10cfb25-7948-4cec-8273-88e7467acd38. > [Operation not supported] > [2018-11-18 21:40:23.627200] I [MSGID: 133017] [shard.c:4941:shard_seek] > 0-gluster_hdd-shard: seek called on 9e519c74-e8ef-41d5-af01-58aa03f3f951. > [Operation not supported] > [2018-11-18 21:40:23.627505] I [MSGID: 133017] [shard.c:4941:shard_seek] > 0-gluster_hdd-shard: seek called on d10cfb25-7948-4cec-8273-88e7467acd38. > [Operation not supported] > > The mount logs have above errors, is it related? Unlikely, given fuse didn't log the failure. Rafi, any idea what could be going wrong? (In reply to Sahina Bose from comment #17) > Rafi, any idea what could be going wrong? I'm not seeing any errors in mount log (vm-store, hdd), but I can see a lot of error's related to network (Selinux, network-configuration) like On ovirt12 and ovirt13 [2018-09-23 01:32:54.240670] W [socket.c:600:__socket_rwv] 0-glusterfs: readv on 10.0.14.11:24007 failed (No data available) [2018-09-23 01:32:54.240730] I [glusterfsd-mgmt.c:2281:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.0.14.11 [2018-09-23 01:32:54.245129] I [glusterfsd-mgmt.c:2323:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting to next volfile server 10.0.14.12 [2018-09-23 01:33:04.771205] I [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing [2018-09-23 01:33:49.647469] W [MSGID: 103070] [rdma.c:4317:gf_rdma_handle_failed_send_completion] 0-rpc-transport/rdma: send work request on `qib0' returned error wc.status = 12, wc.vendor_err = 0, post->buf = 0x7fd7f5f3f000, wc.byte_len = 196, post->reused = 23803 [2018-09-23 01:33:49.647541] E [MSGID: 102012] [rdma.c:4326:gf_rdma_handle_failed_send_completion] 0-rdma: connection between client and server not working. check by running 'ibv_srq_pingpong'. also make sure subnet manager is running (eg: 'opensm'), or check if rdma port is valid (or active) by running 'ibv_devinfo'. contact Gluster Support Team if the problem persists. [2018-09-23 01:34:15.769914] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gluster_vmstore-client-0: server 10.0.14.11:49152 has not responded in the last 30 seconds, disconnecting. [2018-09-23 01:34:33.375215] W [MSGID: 103070] [rdma.c:4317:gf_rdma_handle_failed_send_completion] 0-rpc-transport/rdma: send work request on `qib0' returned error wc.status = 5, wc.vendor_err = 0, post->buf = 0x7fd7f5f3f000, wc.byte_len = 128, post->reused = 23806 On Ovirt-11 [2018-11-09 04:15:00.073853] C [MSGID: 103021] [rdma.c:3263:gf_rdma_create_qp] 0-gluster_hdd-client-0: gluster_hdd-client-0: could not create QP [Permission denied] [2018-11-09 04:15:00.073960] W [MSGID: 103021] [rdma.c:1101:gf_rdma_cm_handle_route_resolved] 0-gluster_hdd-client-0: could not create QP (peer:10.0.14.11:24008 me:10.0.14.11:49151) [2018-11-09 04:18:50.167359] W [socket.c:600:__socket_rwv] 0-glusterfs: readv on 10.0.14.11:24007 failed (No data available) [2018-11-09 04:18:50.167408] I [glusterfsd-mgmt.c:2277:mgmt_rpc_notify] 0-glusterfsd-mgmt: disconnected from remote-host: 10.0.14.11 [2018-11-09 04:18:50.167440] I [glusterfsd-mgmt.c:2319:mgmt_rpc_notify] 0-glusterfsd-mgmt: connecting to next volfile server 10.0.14.12 Closing as this RDMA support is not a priority |