Bug 1417097

Summary:	glusterd: shared storage volume didn't get mount after node reboot.
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Anil Shah <ashah>
Component:	glusterd	Assignee:	Sunny Kumar <sunkumar>
Status:	CLOSED WONTFIX	QA Contact:	Bala Konda Reddy M <bmekala>
Severity:	urgent	Docs Contact:
Priority:	medium
Version:	rhgs-3.2	CC:	amukherj, asriram, atumball, bmohanra, rcyriac, rhs-bugs, storage-qa-internal, vbellur
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:	shared-storage
Fixed In Version:		Doc Type:	Known Issue
Doc Text:	glusterd takes time to initialize if the setup is slow. As a result, by the time /etc/fstab entries are mounted, glusterd on the node is not ready to serve that mount, and the glusterd mount fails. Due to this, shared storage may not get mounted after node reboots. Workaround: If shared storage is not mounted after the node reboots, check if glusterd is up and mount the shared storage volume manually.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-10-30 16:25:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Anil Shah 2017-01-27 08:49:24 UTC

Description of problem:

After node reboot, shared storage brick didn't get mount after node reboot.

Note: Snapshot were scheduled using scheduler.
There were 214 present in the system, out of which 100 snapshots were activated.


Version-Release number of selected component (if applicable):

glusterfs-3.8.4-12.el7rhgs.x86_64

How reproducible:

2/2

Steps to Reproduce:
1. Created 3*2 distributed replicated volume
2. Enabled shared storage 
3. Scheduled snapshot using scheduler 
4. Restart one of the Server node

Actual results:

After reboot, shared storage brick didn't get mount 


Expected results:

Shared storage brick should get mount after node reboot.

Additional info:

Comment 3 Atin Mukherjee 2017-01-27 10:53:06 UTC

Although this issue is consistently reproducible in a particular node of this setup I have some details around the issue. Shared storage didn't mount automatically because glusterd didn't come up by the time the mount request was sent. Now to understand why glusterd takes ~4-5 minutes in initialization every time on this particular node , here are the set of things I did:

1. restart glusterd and tail /var/log/glusterd.log

After few seconds, the tail output paused after dumping the following:

[2017-01-27 10:38:01.591263] D [MSGID: 0] [glusterd-locks.c:446:glusterd_multiple_mgmt_v3_unlock] 0-management: Returning 0
[2017-01-27 10:38:01.591309] D [MSGID: 0] [glusterd-mgmt-handler.c:789:glusterd_mgmt_v3_unlock_send_resp] 0-management: Responded to mgmt_v3 unlock, ret: 0

and then begun logging with

[2017-01-27 10:41:48.171773] D [logging.c:1829:gf_log_flush_timeout_cbk] 0-logging-infra: Log timer timed out. About to flush outstanding messages if present

So it seems like that logging was stuck for 3 mins 47 seconds and then log timer is timed out which gives me an indication that there is something wrong in the underlying file system.

I've couple of questions for you here:

1. Was there any xfs related issue observed in the same node?
2. Is this issue seen on a different setup?

If the answer of 1 is yes and 2 is no then I am inclined to close this issue as not a bug.

Comment 4 Atin Mukherjee 2017-01-27 11:51:21 UTC

To add to above, when glusterd process was taken into gdb during the interval of 3 mins 40 secs mentioned in the above comment, I didn't see any evidence of threads getting stuck and processing any events.

Comment 6 Avra Sengupta 2017-01-30 09:00:35 UTC

The problem with doing so isn't in the implementation, but the user behaviour. A user unmounting shared storage, is outside glusterd's purview and scope. In such a situation we should not be remounting shared storage, because the user has explicitly unmounted it.

To implement such a move would mean adding unnecessary complexity to glusterd, and confusion for the user.

Comment 7 Atin Mukherjee 2017-01-30 09:11:21 UTC

(In reply to Avra Sengupta from comment #6)
> The problem with doing so isn't in the implementation, but the user
> behaviour. A user unmounting shared storage, is outside glusterd's purview
> and scope. In such a situation we should not be remounting shared storage,
> because the user has explicitly unmounted it.
> 
> To implement such a move would mean adding unnecessary complexity to
> glusterd, and confusion for the user.

I didn't mean that GlusterD has to remount the shared storage, what I am looking for a dependency chain where the mount attempt will *only* be made once GlusterD has finished its initialization and an active pid is available.

Comment 10 Bhavana 2017-03-13 16:15:24 UTC

Updated the doc text slightly for the release notes.