Description of problem:
After node reboot, shared storage brick didn't get mount after node reboot.
Note: Snapshot were scheduled using scheduler.
There were 214 present in the system, out of which 100 snapshots were activated.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Created 3*2 distributed replicated volume
2. Enabled shared storage
3. Scheduled snapshot using scheduler
4. Restart one of the Server node
After reboot, shared storage brick didn't get mount
Shared storage brick should get mount after node reboot.
Although this issue is consistently reproducible in a particular node of this setup I have some details around the issue. Shared storage didn't mount automatically because glusterd didn't come up by the time the mount request was sent. Now to understand why glusterd takes ~4-5 minutes in initialization every time on this particular node , here are the set of things I did:
1. restart glusterd and tail /var/log/glusterd.log
After few seconds, the tail output paused after dumping the following:
[2017-01-27 10:38:01.591263] D [MSGID: 0] [glusterd-locks.c:446:glusterd_multiple_mgmt_v3_unlock] 0-management: Returning 0
[2017-01-27 10:38:01.591309] D [MSGID: 0] [glusterd-mgmt-handler.c:789:glusterd_mgmt_v3_unlock_send_resp] 0-management: Responded to mgmt_v3 unlock, ret: 0
and then begun logging with
[2017-01-27 10:41:48.171773] D [logging.c:1829:gf_log_flush_timeout_cbk] 0-logging-infra: Log timer timed out. About to flush outstanding messages if present
So it seems like that logging was stuck for 3 mins 47 seconds and then log timer is timed out which gives me an indication that there is something wrong in the underlying file system.
I've couple of questions for you here:
1. Was there any xfs related issue observed in the same node?
2. Is this issue seen on a different setup?
If the answer of 1 is yes and 2 is no then I am inclined to close this issue as not a bug.
To add to above, when glusterd process was taken into gdb during the interval of 3 mins 40 secs mentioned in the above comment, I didn't see any evidence of threads getting stuck and processing any events.
The problem with doing so isn't in the implementation, but the user behaviour. A user unmounting shared storage, is outside glusterd's purview and scope. In such a situation we should not be remounting shared storage, because the user has explicitly unmounted it.
To implement such a move would mean adding unnecessary complexity to glusterd, and confusion for the user.
(In reply to Avra Sengupta from comment #6)
> The problem with doing so isn't in the implementation, but the user
> behaviour. A user unmounting shared storage, is outside glusterd's purview
> and scope. In such a situation we should not be remounting shared storage,
> because the user has explicitly unmounted it.
> To implement such a move would mean adding unnecessary complexity to
> glusterd, and confusion for the user.
I didn't mean that GlusterD has to remount the shared storage, what I am looking for a dependency chain where the mount attempt will *only* be made once GlusterD has finished its initialization and an active pid is available.
Updated the doc text slightly for the release notes.