Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1409102 - [Arbiter] IO Failure and mount point inaccessible after killing a brick
[Arbiter] IO Failure and mount point inaccessible after killing a brick
Status: CLOSED ERRATA
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: rpc (Show other bugs)
3.2
All Linux
unspecified Severity high
: ---
: RHGS 3.4.0
Assigned To: Milind Changire
Karan Sandha
rebase
: ZStream
Depends On:
Blocks: 1503134
  Show dependency treegraph
 
Reported: 2016-12-29 09:06 EST by Karan Sandha
Modified: 2018-09-21 04:33 EDT (History)
8 users (show)

See Also:
Fixed In Version: glusterfs-3.12.2-1
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-09-04 02:29:55 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2607 None None None 2018-09-04 02:32 EDT

  None (edit)
Description Karan Sandha 2016-12-29 09:06:04 EST
Description of problem:
The IO's hung and mount point became inaccessible after killing starting a brick. This bug is a quite similar to bug 1385605 by seeing at the logs

Version-Release number of selected component (if applicable):
3.8.4-10
Logs are placed at 
rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/sosreports/<bug>
How reproducible:
Tried once

Steps to Reproduce:
1. Create 3 x (2+1) arbiter volume  
2. Mount the volume on gnfs and fuse protocol
3. Create small files using small-file tool (multi-client)on gNFS 
4. now kill a brick ; and Start a small file cleanup  
5. Force start the volume to start the volume.
6. and start a large file from FIO tool.
7. trigger heal info on server

Actual results:
Heal info hung
Mount point not accessible
IO tool reports I/O error

Expected results:
IO's should run smoothly
No errors should be reported

Additional info:
Comment 3 Raghavendra G 2017-01-02 00:12:28 EST
My gut feeling is that its the same as bug [1]. [1] was hit when protocol/client received events in the order,

CONNECT
DISCONNECT
DISCONNECT
CONNECT

However, in this bz I think protocol/client received events in the order,

DISCONNECT
CONNECT
CONNECT
DISCONNECT

Though we need to think such an ordering is possible (since there can be only one event from socket due to EPOLL_ONESHOT, but the events can be on different sockets since for every new connection transport/socket uses a new socket). Another point to note that [2] fixes [1], by making:

1. making priv->connected=0
2. notifying higher layers a DISCONNECT event

as atomic in rpc-client. However, if indeed there are racing events, what about a CONNECT and DISCONNECT racing b/w transport/socket and rpc-client and changing the order. Is it possible? Something to ponder about.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1385605
[2] http://review.gluster.org/15916
Comment 6 Karan Sandha 2017-01-03 07:01:59 EST
rjosoph,

This is very intermittently reproducible but when this issue gets hit it makes the whole system in a hanged state. I have the statedump taken at that time when the issue got hit. Its placed at the location itself. pstack output is not taken.

Thanks & regards
Karan Sandha
Comment 10 Raghavendra G 2017-09-01 05:29:58 EDT
Patches [1][2] are merged in rhgs-3.3.0. Should we close this bug as fixed?

[1] https://code.engineering.redhat.com/gerrit/#/c/99220/
[2] http://review.gluster.org/15916

regards,
Raghavendra
Comment 16 errata-xmlrpc 2018-09-04 02:29:55 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607

Note You need to log in before you can comment on or make changes to this bug.