Description of problem: The IO's hung and mount point became inaccessible after killing starting a brick. This bug is a quite similar to bug 1385605 by seeing at the logs Version-Release number of selected component (if applicable): 3.8.4-10 Logs are placed at rhsqe-repo.lab.eng.blr.redhat.com:/var/www/html/sosreports/<bug> How reproducible: Tried once Steps to Reproduce: 1. Create 3 x (2+1) arbiter volume 2. Mount the volume on gnfs and fuse protocol 3. Create small files using small-file tool (multi-client)on gNFS 4. now kill a brick ; and Start a small file cleanup 5. Force start the volume to start the volume. 6. and start a large file from FIO tool. 7. trigger heal info on server Actual results: Heal info hung Mount point not accessible IO tool reports I/O error Expected results: IO's should run smoothly No errors should be reported Additional info:
My gut feeling is that its the same as bug [1]. [1] was hit when protocol/client received events in the order, CONNECT DISCONNECT DISCONNECT CONNECT However, in this bz I think protocol/client received events in the order, DISCONNECT CONNECT CONNECT DISCONNECT Though we need to think such an ordering is possible (since there can be only one event from socket due to EPOLL_ONESHOT, but the events can be on different sockets since for every new connection transport/socket uses a new socket). Another point to note that [2] fixes [1], by making: 1. making priv->connected=0 2. notifying higher layers a DISCONNECT event as atomic in rpc-client. However, if indeed there are racing events, what about a CONNECT and DISCONNECT racing b/w transport/socket and rpc-client and changing the order. Is it possible? Something to ponder about. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1385605 [2] http://review.gluster.org/15916
rjosoph, This is very intermittently reproducible but when this issue gets hit it makes the whole system in a hanged state. I have the statedump taken at that time when the issue got hit. Its placed at the location itself. pstack output is not taken. Thanks & regards Karan Sandha
Patches [1][2] are merged in rhgs-3.3.0. Should we close this bug as fixed? [1] https://code.engineering.redhat.com/gerrit/#/c/99220/ [2] http://review.gluster.org/15916 regards, Raghavendra
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607