Description of problem: Problem reported in detail by Brent Kolasinski here: http://supercolony.gluster.org/pipermail/gluster-users/2014-June/040541.html Version-Release number of selected component (if applicable): glusterfs-3.5 and master branch How reproducible: Always Steps to Reproduce: 1. Create a 1x2 replica volume using 2 nodes 2. NFS mount the volume from a client 3. `pkill gluster` on node 2 (node 1 still serves the files from the volume) 4. `pkill gluster` on node 1 5. restart glusterd on node 1 6. Try I/O from the NFS mount. Actual results: I/O fails because glusterd fails to start nfs, shd and sometimes the brick processes, as seen from `gluster volume status` Expected results: glusterd should spawn them. Additional info:
Issue: ---------------------------------- glusterd_friend_sm () { quorum_action = _gf_false; while (!list_empty (&gd_friend_sm_queue)){ //blah blah quorum_action = _gf_true; } if (quorum_action) glusterd_spawn_daemons } ---------------------------------- As long as node 2 is down gd_friend_sm_queue is empty and hence glusterd_spawn_daemons never gets called. While discussing with KP, I was given to understand that the above code was intentionally written so that each glusterd does not start the glusterfsd processes until it's friends are also up and running and are in sync. Need to come up with a solution which covers the use case given in the bug description. A workaround is to 'gluster volume start <volname> force` on the node which is up.
REVIEW: http://review.gluster.org/8034 (glusterd: spawn daemons/processes when peer count less than 2) posted (#1) for review on master by Ravishankar N (ravishankar)
Have a user configurable timeout. In fact, that was what I was expecting, but after waiting for a long time I realized that wasn't the way it worked. I think a timeout value is a good compromise. Maybe something like 5 minutes as a default?
Closing this as currenly there is no way of determining if the one node that came up after both nodes went down is the pristine one after all.