1370350 – Hosted Engine VM paused post replace-brick operation

Bug 1370350 - Hosted Engine VM paused post replace-brick operation

Summary: Hosted Engine VM paused post replace-brick operation

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	sharding
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Krutika Dhananjay
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1351528 1392445 1392844 1392846 1392853
TreeView+	depends on / blocked

Reported:	2016-08-26 02:30 UTC by SATHEESARAN
Modified:	2017-03-23 05:45 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-3.8.4-4
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1392445 (view as bug list)
Environment:	RHEV-RHGS Hyperconvergence (HCI)
Last Closed:	2017-03-23 05:45:56 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Description SATHEESARAN 2016-08-26 02:30:51 UTC

Description of problem:
-----------------------
After replacing the defunct brick of replica 3 sharded volume, the hosted engine VM running with its image on that volume went to paused state.

Fuse mount logs showed EIO

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
RHGS 3.1.3
RHV 4.0.2

How reproducible:
-----------------
1/1

Steps to Reproduce:
-------------------
1. Create a replica 3 sharded volume optimized for VM store
2. Create a hosted engine VM with this volume as a 'data domain'
3. After hosted-engine up and operational, kill one of the bricks of the volume
4. Add a new node to the cluster
5. Replace the old-brick with the new-brick from the newly added node to the cluster

Actual results:
---------------
hosted-engine vm went to paused state, fuse mount showed EIO with error 'Lookup on shard 3 failed.'

Expected results:
-----------------
There shouldn't be any error messages, after performing replace brick operation

Comment 3 SATHEESARAN 2016-08-26 02:45:40 UTC

[2016-08-25 08:16:44.373211] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
[2016-08-25 08:16:44.373283] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-1: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
[2016-08-25 08:16:44.373343] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
[2016-08-25 08:16:44.374685] E [MSGID: 133010] [shard.c:1582:shard_common_lookup_shards_cbk] 2-enginevol-shard: Lookup on shard 3 failed. Base file gfid = 853758b3-f79d-4114-86b4-c9e4fe1f97db [Input/output error]
[2016-08-25 08:16:44.374734] W [fuse-bridge.c:2224:fuse_readv_cbk] 0-glusterfs-fuse: 3986689: READ => -1 (Input/output error)
[2016-08-25 08:37:22.183269] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-0: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
[2016-08-25 08:37:22.183355] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-1: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
[2016-08-25 08:37:22.183414] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 2-enginevol-client-2: remote operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [Invalid argument]
[2016-08-25 08:37:22.184586] E [MSGID: 133010] [shard.c:1582:shard_common_lookup_shards_cbk] 2-enginevol-shard: Lookup on shard 3 failed. Base file gfid = 853758b3-f79d-4114-86b4-c9e4fe1f97db [Input/output error]
[2016-08-25 08:37:22.184625] W [fuse-bridge.c:2224:fuse_readv_cbk] 0-glusterfs-fuse: 4023987: READ => -1 (Input/output error)

Comment 4 SATHEESARAN 2016-08-26 02:46:06 UTC

comment3 is the snip from fuse mount.log

Comment 5 SATHEESARAN 2016-08-31 03:05:51 UTC

I couldn't hit back for the second time, when Krutika asked for debug enabled logs.

Later, Krutika also confirmed that the community user too seeing this problem

Comment 6 Krutika Dhananjay 2016-09-13 13:58:51 UTC

(In reply to SATHEESARAN from comment #5)
> I couldn't hit back for the second time, when Krutika asked for debug
> enabled logs.
> 
> Later, Krutika also confirmed that the community user too seeing this problem

I stand corrected. I figured later that it was with granular-entry-heal enabled and a case where the same brick was wiped off and healed. The issue there was a combination of dated documentation and the lack of reset-brick functionality.

I did see logs of the kind you have pasted in comment #3, of lookups failing with EINVAL but no input/output errors.

-Krutika

Comment 7 Krutika Dhananjay 2016-09-13 14:06:52 UTC

Sas,

Do you have the logs from this run, of the bricks, shds and the clients?

-Krutika

Comment 8 SATHEESARAN 2016-10-14 10:55:56 UTC

(In reply to Krutika Dhananjay from comment #7)
> Sas,
> 
> Do you have the logs from this run, of the bricks, shds and the clients?
> 
> -Krutika

Hi Krutika,

I have missed the logs.
I will try to reproduce this issue.

But could you guess any problems with the error messages in comment3 ?

Comment 9 Krutika Dhananjay 2016-10-24 07:08:04 UTC

Not quite, Sas. The EINVAL seems to be getting propagated by the brick(s) since protocol/client which is the lowest layer in the client stack is receiving EINVAL from over the network.

-Krutika

Comment 11 Krutika Dhananjay 2016-11-07 14:24:29 UTC

Patch posted in upstream for review - http://review.gluster.org/#/c/15788/
Moving this bug to POST state.

Comment 13 Krutika Dhananjay 2016-11-08 11:33:36 UTC

https://code.engineering.redhat.com/gerrit/#/c/89412/

Comment 15 SATHEESARAN 2016-12-26 11:00:36 UTC

Tested with RHGS 3.2.0 interim build ( glusterfs-3.8.4-9.el7rhgs ) with the following steps:

1. Created HC setup with self-hosted engine backed with replica 3 sharded volume
2. Added the new host to the cluster and prepared bricks on the host
3. Replaced one of the brick of the replica volume with the new brick with 'replace-brick' utility from RHV UI, which triggers 'gluster volume replace-brick <vol-name> <source-brick> <dest-brick> commit force'

Post replacing the brick, self-heal triggered and completed successfully. Engine VM was healthy.

Comment 17 errata-xmlrpc 2017-03-23 05:45:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.