Description of problem: ======================= Created a sample Distributed volume using one node, rebooted the node and checked the volume status, bricks was in offline state. And the same issue observed with RHSC setup. Version-Release number of selected component (if applicable): ============================================================= glusterfs-3.7.5-11 How reproducible: ================= Multiple times able to reproduce it. Steps to Reproduce: =================== Scenario-I: 1. Created a sample volume using one node cluster 2. Rebooted the node 3. Check the volume status //Bricks are in offline state. Scenario-II:[USING RHSC Setup] 1. Have two node cluster (node-1 & node-2) with Distributed-replicate volume 2. Move one of the node to maintenance state //say node-1 3. Check volume status on other active node Actual results: =============== Scenario-I: Bricks are in offline state after node reboot Scenario-II: Local bricks of the active node are showing offline Expected results: ================= In both the above scenarios, bricks should be running. Additional info:
We actually debugged this issue in a set up where we identified that the moment glusterd started the brick(s), immediately it received disconnect event from the brick. Post that rpc_reconnect () fails after every 3 seconds because the underlying socket is already connected, this looks weird but that's what we got to know from the state of these layers. During this, we also identified a place in glusterd code where we are not handling the rpc connection in a right way (in terms of (un)setting the connected flag). I've posted a patch [1] upstream, however having said that the patch doesn't solve the entire problem. So the RCA is still unknown. [1] http://review.gluster.org/#/c/12908/ Byreddy, As communicated, can you try your luck to get to a reproducer?
I think this is now fixed through BZ 1385605.
BUILD : 3.8.4-35 After node reboot, bricks are online. In RHSC Setup, bricks are online when one of the node is in maintenance state as expected. Hence marking the bug as verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774