Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1370350 - Hosted Engine VM paused post replace-brick operation
Hosted Engine VM paused post replace-brick operation
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: sharding (Show other bugs)
3.1
x86_64 Linux
unspecified Severity high
: ---
: RHGS 3.2.0
Assigned To: Krutika Dhananjay
SATHEESARAN
:
Depends On:
Blocks: 1351528 1392445 1392844 1392846 1392853
  Show dependency treegraph
 
Reported: 2016-08-25 22:30 EDT by SATHEESARAN
Modified: 2017-03-23 01:45 EDT (History)
5 users (show)

See Also:
Fixed In Version: glusterfs-3.8.4-4
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1392445 (view as bug list)
Environment:
RHEV-RHGS Hyperconvergence (HCI)
Last Closed: 2017-03-23 01:45:56 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 05:18:45 EDT

  None (edit)
Description SATHEESARAN 2016-08-25 22:30:51 EDT
Description of problem:
-----------------------
After replacing the defunct brick of replica 3 sharded volume, the hosted engine VM running with its image on that volume went to paused state.

Fuse mount logs showed EIO

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
RHGS 3.1.3
RHV 4.0.2

How reproducible:
-----------------
1/1

Steps to Reproduce:
-------------------
1. Create a replica 3 sharded volume optimized for VM store
2. Create a hosted engine VM with this volume as a 'data domain'
3. After hosted-engine up and operational, kill one of the bricks of the volume
4. Add a new node to the cluster
5. Replace the old-brick with the new-brick from the newly added node to the cluster

Actual results:
---------------
hosted-engine vm went to paused state, fuse mount showed EIO with error 'Lookup on shard 3 failed.'

Expected results:
-----------------
There shouldn't be any error messages, after performing replace brick operation
Comment 3 SATHEESARAN 2016-08-25 22:45:40 EDT
[2016-08-25 08:16:44.373211] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
[2016-08-25 08:16:44.373283] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-1: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
[2016-08-25 08:16:44.373343] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
[2016-08-25 08:16:44.374685] E [MSGID: 133010] [shard.c:1582:shard_common_lookup_shards_cbk] 2-enginevol-shard: Lookup on shard 3 failed. Base file gfid = 853758b3-f79d-4114-86b4-c9e4fe1f97db [Input/output error]
[2016-08-25 08:16:44.374734] W [fuse-bridge.c:2224:fuse_readv_cbk] 0-glusterfs-fuse: 3986689: READ => -1 (Input/output error)
[2016-08-25 08:37:22.183269] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
[2016-08-25 08:37:22.183355] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-1: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
[2016-08-25 08:37:22.183414] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
[2016-08-25 08:37:22.184586] E [MSGID: 133010] [shard.c:1582:shard_common_lookup_shards_cbk] 2-enginevol-shard: Lookup on shard 3 failed. Base file gfid = 853758b3-f79d-4114-86b4-c9e4fe1f97db [Input/output error]
[2016-08-25 08:37:22.184625] W [fuse-bridge.c:2224:fuse_readv_cbk] 0-glusterfs-fuse: 4023987: READ => -1 (Input/output error)
Comment 4 SATHEESARAN 2016-08-25 22:46:06 EDT
comment3 is the snip from fuse mount.log
Comment 5 SATHEESARAN 2016-08-30 23:05:51 EDT
I couldn't hit back for the second time, when Krutika asked for debug enabled logs.

Later, Krutika also confirmed that the community user too seeing this problem
Comment 6 Krutika Dhananjay 2016-09-13 09:58:51 EDT
(In reply to SATHEESARAN from comment #5)
> I couldn't hit back for the second time, when Krutika asked for debug
> enabled logs.
> 
> Later, Krutika also confirmed that the community user too seeing this problem

I stand corrected. I figured later that it was with granular-entry-heal enabled and a case where the same brick was wiped off and healed. The issue there was a combination of dated documentation and the lack of reset-brick functionality.

I did see logs of the kind you have pasted in comment #3, of lookups failing with EINVAL but no input/output errors.

-Krutika
Comment 7 Krutika Dhananjay 2016-09-13 10:06:52 EDT
Sas,

Do you have the logs from this run, of the bricks, shds and the clients?

-Krutika
Comment 8 SATHEESARAN 2016-10-14 06:55:56 EDT
(In reply to Krutika Dhananjay from comment #7)
> Sas,
> 
> Do you have the logs from this run, of the bricks, shds and the clients?
> 
> -Krutika

Hi Krutika,

I have missed the logs.
I will try to reproduce this issue.

But could you guess any problems with the error messages in comment3 ?
Comment 9 Krutika Dhananjay 2016-10-24 03:08:04 EDT
Not quite, Sas. The EINVAL seems to be getting propagated by the brick(s) since protocol/client which is the lowest layer in the client stack is receiving EINVAL from over the network.

-Krutika
Comment 11 Krutika Dhananjay 2016-11-07 09:24:29 EST
Patch posted in upstream for review - http://review.gluster.org/#/c/15788/
Moving this bug to POST state.
Comment 13 Krutika Dhananjay 2016-11-08 06:33:36 EST
https://code.engineering.redhat.com/gerrit/#/c/89412/
Comment 15 SATHEESARAN 2016-12-26 06:00:36 EST
Tested with RHGS 3.2.0 interim build ( glusterfs-3.8.4-9.el7rhgs ) with the following steps:

1. Created HC setup with self-hosted engine backed with replica 3 sharded volume
2. Added the new host to the cluster and prepared bricks on the host
3. Replaced one of the brick of the replica volume with the new brick with 'replace-brick' utility from RHV UI, which triggers 'gluster volume replace-brick <vol-name> <source-brick> <dest-brick> commit force'

Post replacing the brick, self-heal triggered and completed successfully. Engine VM was healthy.
Comment 17 errata-xmlrpc 2017-03-23 01:45:56 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.