Description of problem: ----------------------- After replacing the defunct brick of replica 3 sharded volume, the hosted engine VM running with its image on that volume went to paused state. Fuse mount logs showed EIO Version-Release number of selected component (if applicable): -------------------------------------------------------------- RHGS 3.1.3 RHV 4.0.2 How reproducible: ----------------- 1/1 Steps to Reproduce: ------------------- 1. Create a replica 3 sharded volume optimized for VM store 2. Create a hosted engine VM with this volume as a 'data domain' 3. After hosted-engine up and operational, kill one of the bricks of the volume 4. Add a new node to the cluster 5. Replace the old-brick with the new-brick from the newly added node to the cluster Actual results: --------------- hosted-engine vm went to paused state, fuse mount showed EIO with error 'Lookup on shard 3 failed.' Expected results: ----------------- There shouldn't be any error messages, after performing replace brick operation
[2016-08-25 08:16:44.373211] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument] [2016-08-25 08:16:44.373283] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-1: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument] [2016-08-25 08:16:44.373343] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument] [2016-08-25 08:16:44.374685] E [MSGID: 133010] [shard.c:1582:shard_common_lookup_shards_cbk] 2-enginevol-shard: Lookup on shard 3 failed. Base file gfid = 853758b3-f79d-4114-86b4-c9e4fe1f97db [Input/output error] [2016-08-25 08:16:44.374734] W [fuse-bridge.c:2224:fuse_readv_cbk] 0-glusterfs-fuse: 3986689: READ => -1 (Input/output error) [2016-08-25 08:37:22.183269] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument] [2016-08-25 08:37:22.183355] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-1: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument] [2016-08-25 08:37:22.183414] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument] [2016-08-25 08:37:22.184586] E [MSGID: 133010] [shard.c:1582:shard_common_lookup_shards_cbk] 2-enginevol-shard: Lookup on shard 3 failed. Base file gfid = 853758b3-f79d-4114-86b4-c9e4fe1f97db [Input/output error] [2016-08-25 08:37:22.184625] W [fuse-bridge.c:2224:fuse_readv_cbk] 0-glusterfs-fuse: 4023987: READ => -1 (Input/output error)
comment3 is the snip from fuse mount.log
I couldn't hit back for the second time, when Krutika asked for debug enabled logs. Later, Krutika also confirmed that the community user too seeing this problem
(In reply to SATHEESARAN from comment #5) > I couldn't hit back for the second time, when Krutika asked for debug > enabled logs. > > Later, Krutika also confirmed that the community user too seeing this problem I stand corrected. I figured later that it was with granular-entry-heal enabled and a case where the same brick was wiped off and healed. The issue there was a combination of dated documentation and the lack of reset-brick functionality. I did see logs of the kind you have pasted in comment #3, of lookups failing with EINVAL but no input/output errors. -Krutika
Sas, Do you have the logs from this run, of the bricks, shds and the clients? -Krutika
(In reply to Krutika Dhananjay from comment #7) > Sas, > > Do you have the logs from this run, of the bricks, shds and the clients? > > -Krutika Hi Krutika, I have missed the logs. I will try to reproduce this issue. But could you guess any problems with the error messages in comment3 ?
Not quite, Sas. The EINVAL seems to be getting propagated by the brick(s) since protocol/client which is the lowest layer in the client stack is receiving EINVAL from over the network. -Krutika
Patch posted in upstream for review - http://review.gluster.org/#/c/15788/ Moving this bug to POST state.
https://code.engineering.redhat.com/gerrit/#/c/89412/
Tested with RHGS 3.2.0 interim build ( glusterfs-3.8.4-9.el7rhgs ) with the following steps: 1. Created HC setup with self-hosted engine backed with replica 3 sharded volume 2. Added the new host to the cluster and prepared bricks on the host 3. Replaced one of the brick of the replica volume with the new brick with 'replace-brick' utility from RHV UI, which triggers 'gluster volume replace-brick <vol-name> <source-brick> <dest-brick> commit force' Post replacing the brick, self-heal triggered and completed successfully. Engine VM was healthy.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html