Bug 1107649

Summary:	glusterd fails to spawn brick , nfs and self-heald processes
Product:	[Community] GlusterFS	Reporter:	Ravishankar N <ravishankar>
Component:	glusterd	Assignee:	krishnan parthasarathi <kparthas>
Status:	CLOSED WONTFIX	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	mainline	CC:	alexeyzilber, bkolasinski, gluster-bugs, nsathyan
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:
Clones:	1112515 (view as bug list)		Environment:
Last Closed:	2014-07-14 10:30:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1112515

Description Ravishankar N 2014-06-10 11:48:01 UTC

Description of problem:
Problem reported in detail by Brent Kolasinski here:
http://supercolony.gluster.org/pipermail/gluster-users/2014-June/040541.html

Version-Release number of selected component (if applicable):
glusterfs-3.5 and master branch

How reproducible:
Always

Steps to Reproduce:
1. Create a 1x2 replica volume using 2 nodes
2. NFS mount the volume from a client
3. `pkill gluster` on node 2 (node 1 still serves the files from the volume)
4. `pkill gluster` on node 1
5.  restart glusterd on node 1
6. Try I/O from the NFS mount.

Actual results:
I/O fails because glusterd fails to start nfs, shd and sometimes the brick processes, as seen from `gluster volume status`

Expected results:
glusterd should spawn them.

Additional info:

Comment 1 Ravishankar N 2014-06-10 15:41:08 UTC

Issue:
----------------------------------
glusterd_friend_sm ()
{
  quorum_action = _gf_false;

  while (!list_empty (&gd_friend_sm_queue)){
              //blah blah
               quorum_action = _gf_true;
  }
  if (quorum_action) 
                glusterd_spawn_daemons
}
----------------------------------

As long as node 2 is down gd_friend_sm_queue is empty and hence glusterd_spawn_daemons never gets called.

While discussing with KP, I was given to understand that the above code was intentionally written so that each glusterd does not start the glusterfsd processes until it's friends are also up and running and are in sync. Need to come up with a solution which covers the use case given in the bug description. A workaround is to 'gluster volume start <volname> force` on the node which is up.

Comment 2 Anand Avati 2014-06-11 10:28:40 UTC

REVIEW: http://review.gluster.org/8034 (glusterd: spawn daemons/processes when peer count less than 2) posted (#1) for review on master by Ravishankar N (ravishankar)

Comment 3 Alexey Zilber 2014-06-17 10:28:15 UTC

Have a user configurable timeout.  In fact, that was what I was expecting, but after waiting for a long time I realized that wasn't the way it worked.  I think a timeout value is a good compromise.  Maybe something like 5 minutes as a default?

Comment 4 Ravishankar N 2014-07-14 10:30:06 UTC

Closing this as currenly there is no way of determining if the one node that came up after both nodes went down is the pristine one after all.