Bug 1409102 - [Arbiter] IO Failure and mount point inaccessible after killing a brick
Summary: [Arbiter] IO Failure and mount point inaccessible after killing a brick
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: rpc
Version: rhgs-3.2
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
: RHGS 3.4.0
Assignee: Milind Changire
QA Contact: Karan Sandha
URL:
Whiteboard: rebase
Depends On:
Blocks: 1503134
TreeView+ depends on / blocked
 
Reported: 2016-12-29 14:06 UTC by Karan Sandha
Modified: 2018-09-21 08:33 UTC (History)
8 users (show)

Fixed In Version: glusterfs-3.12.2-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-04 06:29:55 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2607 None None None 2018-09-04 06:32:05 UTC

Description Karan Sandha 2016-12-29 14:06:04 UTC
Description of problem:
The IO's hung and mount point became inaccessible after killing starting a brick. This bug is a quite similar to bug 1385605 by seeing at the logs

Version-Release number of selected component (if applicable):
3.8.4-10
Logs are placed at 
rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/sosreports/<bug>
How reproducible:
Tried once

Steps to Reproduce:
1. Create 3 x (2+1) arbiter volume  
2. Mount the volume on gnfs and fuse protocol
3. Create small files using small-file tool (multi-client)on gNFS 
4. now kill a brick ; and Start a small file cleanup  
5. Force start the volume to start the volume.
6. and start a large file from FIO tool.
7. trigger heal info on server

Actual results:
Heal info hung
Mount point not accessible
IO tool reports I/O error

Expected results:
IO's should run smoothly
No errors should be reported

Additional info:

Comment 3 Raghavendra G 2017-01-02 05:12:28 UTC
My gut feeling is that its the same as bug [1]. [1] was hit when protocol/client received events in the order,

CONNECT
DISCONNECT
DISCONNECT
CONNECT

However, in this bz I think protocol/client received events in the order,

DISCONNECT
CONNECT
CONNECT
DISCONNECT

Though we need to think such an ordering is possible (since there can be only one event from socket due to EPOLL_ONESHOT, but the events can be on different sockets since for every new connection transport/socket uses a new socket). Another point to note that [2] fixes [1], by making:

1. making priv->connected=0
2. notifying higher layers a DISCONNECT event

as atomic in rpc-client. However, if indeed there are racing events, what about a CONNECT and DISCONNECT racing b/w transport/socket and rpc-client and changing the order. Is it possible? Something to ponder about.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1385605
[2] http://review.gluster.org/15916

Comment 6 Karan Sandha 2017-01-03 12:01:59 UTC
rjosoph,

This is very intermittently reproducible but when this issue gets hit it makes the whole system in a hanged state. I have the statedump taken at that time when the issue got hit. Its placed at the location itself. pstack output is not taken.

Thanks & regards
Karan Sandha

Comment 10 Raghavendra G 2017-09-01 09:29:58 UTC
Patches [1][2] are merged in rhgs-3.3.0. Should we close this bug as fixed?

[1] https://code.engineering.redhat.com/gerrit/#/c/99220/
[2] http://review.gluster.org/15916

regards,
Raghavendra

Comment 16 errata-xmlrpc 2018-09-04 06:29:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607


Note You need to log in before you can comment on or make changes to this bug.