Bug 1409102

Summary: [Arbiter] IO Failure and mount point inaccessible after killing a brick
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Karan Sandha <ksandha>
Component: rpcAssignee: Milind Changire <mchangir>
Status: CLOSED ERRATA QA Contact: Karan Sandha <ksandha>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.2CC: amukherj, ksandha, mchangir, rcyriac, rgowdapp, rhs-bugs, sheggodu, smali
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.4.0   
Hardware: All   
OS: Linux   
Whiteboard: rebase
Fixed In Version: glusterfs-3.12.2-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-04 06:29:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1503134    

Description Karan Sandha 2016-12-29 14:06:04 UTC
Description of problem:
The IO's hung and mount point became inaccessible after killing starting a brick. This bug is a quite similar to bug 1385605 by seeing at the logs

Version-Release number of selected component (if applicable):
3.8.4-10
Logs are placed at 
rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/sosreports/<bug>
How reproducible:
Tried once

Steps to Reproduce:
1. Create 3 x (2+1) arbiter volume  
2. Mount the volume on gnfs and fuse protocol
3. Create small files using small-file tool (multi-client)on gNFS 
4. now kill a brick ; and Start a small file cleanup  
5. Force start the volume to start the volume.
6. and start a large file from FIO tool.
7. trigger heal info on server

Actual results:
Heal info hung
Mount point not accessible
IO tool reports I/O error

Expected results:
IO's should run smoothly
No errors should be reported

Additional info:

Comment 3 Raghavendra G 2017-01-02 05:12:28 UTC
My gut feeling is that its the same as bug [1]. [1] was hit when protocol/client received events in the order,

CONNECT
DISCONNECT
DISCONNECT
CONNECT

However, in this bz I think protocol/client received events in the order,

DISCONNECT
CONNECT
CONNECT
DISCONNECT

Though we need to think such an ordering is possible (since there can be only one event from socket due to EPOLL_ONESHOT, but the events can be on different sockets since for every new connection transport/socket uses a new socket). Another point to note that [2] fixes [1], by making:

1. making priv->connected=0
2. notifying higher layers a DISCONNECT event

as atomic in rpc-client. However, if indeed there are racing events, what about a CONNECT and DISCONNECT racing b/w transport/socket and rpc-client and changing the order. Is it possible? Something to ponder about.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1385605
[2] http://review.gluster.org/15916

Comment 6 Karan Sandha 2017-01-03 12:01:59 UTC
rjosoph,

This is very intermittently reproducible but when this issue gets hit it makes the whole system in a hanged state. I have the statedump taken at that time when the issue got hit. Its placed at the location itself. pstack output is not taken.

Thanks & regards
Karan Sandha

Comment 10 Raghavendra G 2017-09-01 09:29:58 UTC
Patches [1][2] are merged in rhgs-3.3.0. Should we close this bug as fixed?

[1] https://code.engineering.redhat.com/gerrit/#/c/99220/
[2] http://review.gluster.org/15916

regards,
Raghavendra

Comment 16 errata-xmlrpc 2018-09-04 06:29:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607