Bug 1370350

Summary: Hosted Engine VM paused post replace-brick operation
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: SATHEESARAN <sasundar>
Component: shardingAssignee: Krutika Dhananjay <kdhananj>
Status: CLOSED ERRATA QA Contact: SATHEESARAN <sasundar>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: pkarampu, rhinduja, rhs-bugs, sasundar, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.2.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-4 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1392445 (view as bug list) Environment:
RHEV-RHGS Hyperconvergence (HCI)
Last Closed: 2017-03-23 05:45:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1351528, 1392445, 1392844, 1392846, 1392853    

Description SATHEESARAN 2016-08-26 02:30:51 UTC
Description of problem:
-----------------------
After replacing the defunct brick of replica 3 sharded volume, the hosted engine VM running with its image on that volume went to paused state.

Fuse mount logs showed EIO

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
RHGS 3.1.3
RHV 4.0.2

How reproducible:
-----------------
1/1

Steps to Reproduce:
-------------------
1. Create a replica 3 sharded volume optimized for VM store
2. Create a hosted engine VM with this volume as a 'data domain'
3. After hosted-engine up and operational, kill one of the bricks of the volume
4. Add a new node to the cluster
5. Replace the old-brick with the new-brick from the newly added node to the cluster

Actual results:
---------------
hosted-engine vm went to paused state, fuse mount showed EIO with error 'Lookup on shard 3 failed.'

Expected results:
-----------------
There shouldn't be any error messages, after performing replace brick operation

Comment 3 SATHEESARAN 2016-08-26 02:45:40 UTC
[2016-08-25 08:16:44.373211] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
[2016-08-25 08:16:44.373283] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-1: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
[2016-08-25 08:16:44.373343] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
[2016-08-25 08:16:44.374685] E [MSGID: 133010] [shard.c:1582:shard_common_lookup_shards_cbk] 2-enginevol-shard: Lookup on shard 3 failed. Base file gfid = 853758b3-f79d-4114-86b4-c9e4fe1f97db [Input/output error]
[2016-08-25 08:16:44.374734] W [fuse-bridge.c:2224:fuse_readv_cbk] 0-glusterfs-fuse: 3986689: READ => -1 (Input/output error)
[2016-08-25 08:37:22.183269] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
[2016-08-25 08:37:22.183355] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-1: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
[2016-08-25 08:37:22.183414] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
[2016-08-25 08:37:22.184586] E [MSGID: 133010] [shard.c:1582:shard_common_lookup_shards_cbk] 2-enginevol-shard: Lookup on shard 3 failed. Base file gfid = 853758b3-f79d-4114-86b4-c9e4fe1f97db [Input/output error]
[2016-08-25 08:37:22.184625] W [fuse-bridge.c:2224:fuse_readv_cbk] 0-glusterfs-fuse: 4023987: READ => -1 (Input/output error)

Comment 4 SATHEESARAN 2016-08-26 02:46:06 UTC
comment3 is the snip from fuse mount.log

Comment 5 SATHEESARAN 2016-08-31 03:05:51 UTC
I couldn't hit back for the second time, when Krutika asked for debug enabled logs.

Later, Krutika also confirmed that the community user too seeing this problem

Comment 6 Krutika Dhananjay 2016-09-13 13:58:51 UTC
(In reply to SATHEESARAN from comment #5)
> I couldn't hit back for the second time, when Krutika asked for debug
> enabled logs.
> 
> Later, Krutika also confirmed that the community user too seeing this problem

I stand corrected. I figured later that it was with granular-entry-heal enabled and a case where the same brick was wiped off and healed. The issue there was a combination of dated documentation and the lack of reset-brick functionality.

I did see logs of the kind you have pasted in comment #3, of lookups failing with EINVAL but no input/output errors.

-Krutika

Comment 7 Krutika Dhananjay 2016-09-13 14:06:52 UTC
Sas,

Do you have the logs from this run, of the bricks, shds and the clients?

-Krutika

Comment 8 SATHEESARAN 2016-10-14 10:55:56 UTC
(In reply to Krutika Dhananjay from comment #7)
> Sas,
> 
> Do you have the logs from this run, of the bricks, shds and the clients?
> 
> -Krutika

Hi Krutika,

I have missed the logs.
I will try to reproduce this issue.

But could you guess any problems with the error messages in comment3 ?

Comment 9 Krutika Dhananjay 2016-10-24 07:08:04 UTC
Not quite, Sas. The EINVAL seems to be getting propagated by the brick(s) since protocol/client which is the lowest layer in the client stack is receiving EINVAL from over the network.

-Krutika

Comment 11 Krutika Dhananjay 2016-11-07 14:24:29 UTC
Patch posted in upstream for review - http://review.gluster.org/#/c/15788/
Moving this bug to POST state.

Comment 13 Krutika Dhananjay 2016-11-08 11:33:36 UTC
https://code.engineering.redhat.com/gerrit/#/c/89412/

Comment 15 SATHEESARAN 2016-12-26 11:00:36 UTC
Tested with RHGS 3.2.0 interim build ( glusterfs-3.8.4-9.el7rhgs ) with the following steps:

1. Created HC setup with self-hosted engine backed with replica 3 sharded volume
2. Added the new host to the cluster and prepared bricks on the host
3. Replaced one of the brick of the replica volume with the new brick with 'replace-brick' utility from RHV UI, which triggers 'gluster volume replace-brick <vol-name> <source-brick> <dest-brick> commit force'

Post replacing the brick, self-heal triggered and completed successfully. Engine VM was healthy.

Comment 17 errata-xmlrpc 2017-03-23 05:45:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html