1409102 – [Arbiter] IO Failure and mount point inaccessible after killing a brick

Bug 1409102 - [Arbiter] IO Failure and mount point inaccessible after killing a brick

Summary: [Arbiter] IO Failure and mount point inaccessible after killing a brick

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rpc
Sub Component:
Version:	rhgs-3.2
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Milind Changire
QA Contact:	Karan Sandha
Docs Contact:
URL:
Whiteboard:	rebase
Depends On:
Blocks:	1503134
TreeView+	depends on / blocked

Reported:	2016-12-29 14:06 UTC by Karan Sandha
Modified:	2018-09-21 08:33 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.12.2-1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-04 06:29:55 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2018:2607	0	None	None	None	2018-09-04 06:32:05 UTC

Description Karan Sandha 2016-12-29 14:06:04 UTC

Description of problem:
The IO's hung and mount point became inaccessible after killing starting a brick. This bug is a quite similar to bug 1385605 by seeing at the logs

Version-Release number of selected component (if applicable):
3.8.4-10
Logs are placed at 
rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/sosreports/<bug>
How reproducible:
Tried once

Steps to Reproduce:
1. Create 3 x (2+1) arbiter volume  
2. Mount the volume on gnfs and fuse protocol
3. Create small files using small-file tool (multi-client)on gNFS 
4. now kill a brick ; and Start a small file cleanup  
5. Force start the volume to start the volume.
6. and start a large file from FIO tool.
7. trigger heal info on server

Actual results:
Heal info hung
Mount point not accessible
IO tool reports I/O error

Expected results:
IO's should run smoothly
No errors should be reported

Additional info:

Comment 3 Raghavendra G 2017-01-02 05:12:28 UTC

My gut feeling is that its the same as bug [1]. [1] was hit when protocol/client received events in the order,

CONNECT
DISCONNECT
DISCONNECT
CONNECT

However, in this bz I think protocol/client received events in the order,

DISCONNECT
CONNECT
CONNECT
DISCONNECT

Though we need to think such an ordering is possible (since there can be only one event from socket due to EPOLL_ONESHOT, but the events can be on different sockets since for every new connection transport/socket uses a new socket). Another point to note that [2] fixes [1], by making:

1. making priv->connected=0
2. notifying higher layers a DISCONNECT event

as atomic in rpc-client. However, if indeed there are racing events, what about a CONNECT and DISCONNECT racing b/w transport/socket and rpc-client and changing the order. Is it possible? Something to ponder about.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1385605
[2] http://review.gluster.org/15916

Comment 6 Karan Sandha 2017-01-03 12:01:59 UTC

rjosoph,

This is very intermittently reproducible but when this issue gets hit it makes the whole system in a hanged state. I have the statedump taken at that time when the issue got hit. Its placed at the location itself. pstack output is not taken.

Thanks & regards
Karan Sandha

Comment 10 Raghavendra G 2017-09-01 09:29:58 UTC

Patches [1][2] are merged in rhgs-3.3.0. Should we close this bug as fixed?

[1] https://code.engineering.redhat.com/gerrit/#/c/99220/
[2] http://review.gluster.org/15916

regards,
Raghavendra

Comment 16 errata-xmlrpc 2018-09-04 06:29:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.