Bug 1409102 - [Arbiter] IO Failure and mount point inaccessible after killing a brick
Summary: [Arbiter] IO Failure and mount point inaccessible after killing a brick
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: rpc
Version: rhgs-3.2
Hardware: All
OS: Linux
Target Milestone: ---
: RHGS 3.4.0
Assignee: Milind Changire
QA Contact: Karan Sandha
Whiteboard: rebase
Depends On:
Blocks: 1503134
TreeView+ depends on / blocked
Reported: 2016-12-29 14:06 UTC by Karan Sandha
Modified: 2018-09-21 08:33 UTC (History)
8 users (show)

Fixed In Version: glusterfs-3.12.2-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2018-09-04 06:29:55 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2607 None None None 2018-09-04 06:32:05 UTC

Description Karan Sandha 2016-12-29 14:06:04 UTC
Description of problem:
The IO's hung and mount point became inaccessible after killing starting a brick. This bug is a quite similar to bug 1385605 by seeing at the logs

Version-Release number of selected component (if applicable):
Logs are placed at 
How reproducible:
Tried once

Steps to Reproduce:
1. Create 3 x (2+1) arbiter volume  
2. Mount the volume on gnfs and fuse protocol
3. Create small files using small-file tool (multi-client)on gNFS 
4. now kill a brick ; and Start a small file cleanup  
5. Force start the volume to start the volume.
6. and start a large file from FIO tool.
7. trigger heal info on server

Actual results:
Heal info hung
Mount point not accessible
IO tool reports I/O error

Expected results:
IO's should run smoothly
No errors should be reported

Additional info:

Comment 3 Raghavendra G 2017-01-02 05:12:28 UTC
My gut feeling is that its the same as bug [1]. [1] was hit when protocol/client received events in the order,


However, in this bz I think protocol/client received events in the order,


Though we need to think such an ordering is possible (since there can be only one event from socket due to EPOLL_ONESHOT, but the events can be on different sockets since for every new connection transport/socket uses a new socket). Another point to note that [2] fixes [1], by making:

1. making priv->connected=0
2. notifying higher layers a DISCONNECT event

as atomic in rpc-client. However, if indeed there are racing events, what about a CONNECT and DISCONNECT racing b/w transport/socket and rpc-client and changing the order. Is it possible? Something to ponder about.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1385605
[2] http://review.gluster.org/15916

Comment 6 Karan Sandha 2017-01-03 12:01:59 UTC

This is very intermittently reproducible but when this issue gets hit it makes the whole system in a hanged state. I have the statedump taken at that time when the issue got hit. Its placed at the location itself. pstack output is not taken.

Thanks & regards
Karan Sandha

Comment 10 Raghavendra G 2017-09-01 09:29:58 UTC
Patches [1][2] are merged in rhgs-3.3.0. Should we close this bug as fixed?

[1] https://code.engineering.redhat.com/gerrit/#/c/99220/
[2] http://review.gluster.org/15916


Comment 16 errata-xmlrpc 2018-09-04 06:29:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.