1452527 – Shared volume doesn't get mounted on few nodes after rebooting all nodes in cluster.

Bug 1452527 - Shared volume doesn't get mounted on few nodes after rebooting all nodes in cluster.

Summary: Shared volume doesn't get mounted on few nodes after rebooting all nodes in c...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	scripts
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Jiffin
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1335090 1451981
TreeView+	depends on / blocked

Reported:	2017-05-19 07:11 UTC by Jiffin
Modified:	2017-09-05 17:30 UTC (History)
CC List:	13 users (show)
Fixed In Version:	glusterfs-3.12.0
Clone Of:	1335090
Environment:
Last Closed:	2017-09-05 17:30:44 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Jiffin 2017-05-19 07:11:07 UTC

+++ This bug was initially created as a clone of Bug #1335090 +++

Description of problem:

shared volume doesn't get mounted on one (maybe two) node after rebooting all nodes in cluster, resulting in missing symlink (/var/lib/nfs -> /var/run/gluster/shared_storage/nfs-ganesha/dhcp42-239.lab.eng.blr.redhat.com/nfs) .

Version-Release number of selected component (if applicable):
mainline

How reproducible:
Always

Steps to Reproduce:
1. Create a 4 node ganesha cluster.
2. Make sure the shared volume is created and mounted on all the nodes of cluster and the symlink is created as below.

[root@dhcp42-20 ~]# gluster volume status gluster_shared_storage
Status of volume: gluster_shared_storage
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick dhcp42-239.lab.eng.blr.redhat.com:/va
r/lib/glusterd/ss_brick                     49155     0          Y       2293 
Brick dhcp43-175.lab.eng.blr.redhat.com:/va
r/lib/glusterd/ss_brick                     49155     0          Y       2281 
Brick dhcp42-20.lab.eng.blr.redhat.com:/var
/lib/glusterd/ss_brick                      49155     0          Y       2266 
Self-heal Daemon on localhost               N/A       N/A        Y       2257 
Self-heal Daemon on dhcp42-239.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       2287 
Self-heal Daemon on dhcp43-175.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       2253 
Self-heal Daemon on dhcp42-196.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       2258 
 
Task Status of Volume gluster_shared_storage
------------------------------------------------------------------------------
There are no active volume tasks

dhcp42-20.lab.eng.blr.redhat.com:/gluster_shared_storage  27740928 1697152  26043776   7% /run/gluster/shared_storage

dhcp42-239.lab.eng.blr.redhat.com:/gluster_shared_storage  27740928 1697152  26043776   7% /run/gluster/shared_storage

dhcp43-175.lab.eng.blr.redhat.com:/gluster_shared_storage  27740928 1697152  26043776   7% /run/gluster/shared_storage

dhcp42-196.lab.eng.blr.redhat.com:/gluster_shared_storage  27740928 1697152  26043776   7% /run/gluster/shared_storage

[root@dhcp42-20 ~]# ls -ld /var/lib/nfs
lrwxrwxrwx. 1 root root 80 May 11 21:26 /var/lib/nfs -> /var/run/gluster/shared_storage/nfs-ganesha/dhcp42-20.lab.eng.blr.redhat.com/nfs

[root@dhcp42-239 ~]# ls -ld /var/lib/nfs
lrwxrwxrwx. 1 root root 81 May 11 21:26 /var/lib/nfs -> /var/run/gluster/shared_storage/nfs-ganesha/dhcp42-239.lab.eng.blr.redhat.com/nfs

[root@dhcp43-175 ~]# ls -ld /var/lib/nfs
lrwxrwxrwx. 1 root root 81 May 11 21:26 /var/lib/nfs -> /var/run/gluster/shared_storage/nfs-ganesha/dhcp43-175.lab.eng.blr.redhat.com/nfs

[root@dhcp42-196 ~]# ls -ld /var/lib/nfs
lrwxrwxrwx. 1 root root 81 May 11 21:19 /var/lib/nfs -> /var/run/gluster/shared_storage/nfs-ganesha/dhcp42-196.lab.eng.blr.redhat.com/nfs

3. Reboot all the nodes of the cluster.
4. Observe that on 2 of the 4 nodes, shared storage is not mounted. (most of the times it doesnt get mounted on any one node).
5.And the symlink from /var/lib/nfs doesn't get created because of this on these 2 nodes.
6. Both of these nodes have the entries in /etc/fstab and manually mounting the shared storage on these nodes works.


Actual results:

Shared volume doesn't get mounted on few nodes after rebooting all nodes in cluster.

Expected results:

Shared volume should get mounted on all the nodes after reboot

Additional info:


--- Additional comment from Soumya Koduri on 2016-05-11 07:38:20 EDT ---

I see below error in node4 logs 

[2016-05-11 15:56:04.984079] E [MSGID: 114058] [client-handshake.c:1524:client_query_portmap_cbk] 0-gluster_shared_storage-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2016-05-11 15:56:04.984357] I [MSGID: 114018] [client.c:2030:client_rpc_notify] 0-gluster_shared_storage-client-1: disconnected from gluster_shared_storage-client-1. Client process will keep trying to connect to glusterd until brick's port is available
[2016-05-11 15:56:04.984374] W [MSGID: 108001] [afr-common.c:4210:afr_notify] 0-gluster_shared_storage-replicate-0: Client-quorum is not met
[2016-05-11 15:56:05.291773] E [MSGID: 114058] [client-handshake.c:1524:client_query_portmap_cbk] 0-gluster_shared_storage-client-2: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.
[2016-05-11 15:56:05.292104] I [MSGID: 114018] [client.c:2030:client_rpc_notify] 0-gluster_shared_storage-client-2: disconnected from gluster_shared_storage-client-2. Client process will keep trying to connect to glusterd until brick's port is available
[2016-05-11 15:56:05.292165] E [MSGID: 108006] [afr-common.c:4152:afr_notify] 0-gluster_shared_storage-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2016-05-11 15:56:05.295895] I [fuse-bridge.c:5166:fuse_graph_setup] 0-fuse: switched to graph 0
[2016-05-11 15:56:05.296679] I [fuse-bridge.c:4077:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.22 kernel 7.22
[2016-05-11 15:56:05.296828] I [MSGID: 108006] [afr-common.c:4261:afr_local_init] 0-gluster_shared_storage-replicate-0: no subvolumes up
[2016-05-11 15:56:05.297606] E [dht-helper.c:1602:dht_inode_ctx_time_update] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_discover+0x1ca) [0x7fbcaef8ad6a] -->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_lookup_dir_cbk+0x379) [0x7fbcaecf51d9] -->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_inode_ctx_time_update+0x210) [0x7fbcaeccd2d0] ) 0-gluster_shared_storage-dht: invalid argument: inode [Invalid argument]
[2016-05-11 15:56:05.298786] E [dht-helper.c:1602:dht_inode_ctx_time_update] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_discover+0x1ca) [0x7fbcaef8ad6a] -->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_lookup_dir_cbk+0x379) [0x7fbcaecf51d9] -->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_inode_ctx_time_update+0x210) [0x7fbcaeccd2d0] ) 0-gluster_shared_storage-dht: invalid argument: inode [Invalid argument]
[2016-05-11 15:56:05.298818] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 2: LOOKUP() / => -1 (Transport endpoint is not connected)
[2016-05-11 15:56:05.305894] E [dht-helper.c:1602:dht_inode_ctx_time_update] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_discover+0x1ca) [0x7fbcaef8ad6a] -->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_lookup_dir_cbk+0x379) [0x7fbcaecf51d9] -->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_inode_ctx_time_update+0x210) [0x7fbcaeccd2d0] ) 0-gluster_shared_storage-dht: invalid argument: inode [Invalid argument]
[2016-05-11 15:56:05.307751] I [fuse-bridge.c:5007:fuse_thread_proc] 0-fuse: unmounting /run/gluster/shared_storage

Since this seems to be an issue with gluster_shared_storage mount being lost, adjusting the components accordingly and request Avra to take a look.

--- Additional comment from Avra Sengupta on 2016-05-13 01:30:09 EDT ---

This is expected behaviour. We need to understand that the shared volume itself is hosted in these nodes, and all nodes mount it using one of the particular nodes. Now when all nodes are down, the shared storage volume is also essentially down. When the nodes come up, till the node whose entry is mentioned in /etc/fstab is up and serving, none of them will be able to connect to the shared storage. That node itself will never connect to the shared storage on reboot, as by the time /etc/fstab entry is replayed, the volume is not being served.

Comment 1 Worker Ant 2017-05-19 07:11:38 UTC

REVIEW: https://review.gluster.org/17339 (scripts/shared_storage : systemd helper scripts to mount shared storage post reboot) posted (#1) for review on master by jiffin tony Thottan (jthottan)

Comment 2 Worker Ant 2017-05-24 12:22:48 UTC

REVIEW: https://review.gluster.org/17339 (scripts/shared_storage : systemd helper scripts to mount shared storage post reboot) posted (#2) for review on master by jiffin tony Thottan (jthottan)

Comment 3 Worker Ant 2017-05-24 12:40:38 UTC

REVIEW: https://review.gluster.org/17339 (scripts/shared_storage : systemd helper scripts to mount shared storage post reboot) posted (#3) for review on master by jiffin tony Thottan (jthottan)

Comment 4 Worker Ant 2017-05-24 13:11:32 UTC

REVIEW: https://review.gluster.org/17339 (scripts/shared_storage : systemd helper scripts to mount shared storage post reboot) posted (#4) for review on master by jiffin tony Thottan (jthottan)

Comment 5 Worker Ant 2017-05-24 16:06:19 UTC

REVIEW: https://review.gluster.org/17339 (scripts/shared_storage : systemd helper scripts to mount shared storage post reboot) posted (#5) for review on master by jiffin tony Thottan (jthottan)

Comment 6 Worker Ant 2017-05-24 17:11:17 UTC

REVIEW: https://review.gluster.org/17339 (scripts/shared_storage : systemd helper scripts to mount shared storage post reboot) posted (#6) for review on master by jiffin tony Thottan (jthottan)

Comment 7 Worker Ant 2017-06-16 13:22:16 UTC

REVIEW: https://review.gluster.org/17339 (scripts/shared_storage : systemd helper scripts to mount shared storage post reboot) posted (#7) for review on master by jiffin tony Thottan (jthottan)

Comment 8 Worker Ant 2017-06-19 13:32:43 UTC

REVIEW: https://review.gluster.org/17339 (scripts/shared_storage : systemd helper scripts to mount shared storage post reboot) posted (#8) for review on master by jiffin tony Thottan (jthottan)

Comment 9 Worker Ant 2017-06-19 19:05:19 UTC

REVIEW: https://review.gluster.org/17339 (scripts/shared_storage : systemd helper scripts to mount shared storage post reboot) posted (#9) for review on master by jiffin tony Thottan (jthottan)

Comment 10 Worker Ant 2017-06-20 12:42:05 UTC

COMMIT: https://review.gluster.org/17339 committed in master by Kaleb KEITHLEY (kkeithle) 
------
commit 3183ca1bdee9cb0af22c017e3c610add8ff2b405
Author: Hendrik Visage <hvjunk>
Date:   Fri May 19 12:21:37 2017 +0530

    scripts/shared_storage : systemd helper scripts to mount shared storage post reboot
    
    Reported-by: Hendrik Visage <hvjunk>
    Change-Id: Ibcff56b00f45c8af54c1ae04974267c2180f5f63
    BUG: 1452527
    Signed-off-by: Jiffin Tony Thottan <jthottan>
    Reviewed-on: https://review.gluster.org/17339
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Niels de Vos <ndevos>
    Reviewed-by: Kaleb KEITHLEY <kkeithle>

Comment 11 Worker Ant 2017-06-30 11:56:00 UTC

REVIEW: https://review.gluster.org/17658 (systemd/glusterfssharedstorage : remove dependency for var-run-gluster-shared_storage) posted (#1) for review on master by jiffin tony Thottan (jthottan)

Comment 12 Worker Ant 2017-06-30 12:57:48 UTC

REVIEW: https://review.gluster.org/17658 (systemd/glusterfssharedstorage : remove dependency for var-run-gluster-shared_storage) posted (#2) for review on master by jiffin tony Thottan (jthottan)

Comment 13 Worker Ant 2017-07-03 12:19:23 UTC

REVIEW: https://review.gluster.org/17658 (systemd/glusterfssharedstorage : remove dependency for var-run-gluster-shared_storage) posted (#3) for review on master by jiffin tony Thottan (jthottan)

Comment 14 Worker Ant 2017-07-12 13:18:15 UTC

REVIEW: https://review.gluster.org/17658 (systemd/glusterfssharedstorage : remove dependency for var-run-gluster-shared_storage) posted (#4) for review on master by jiffin tony Thottan (jthottan)

Comment 15 Worker Ant 2017-07-17 11:41:04 UTC

COMMIT: https://review.gluster.org/17658 committed in master by Kaleb KEITHLEY (kkeithle) 
------
commit 4c410a46ef58512ba751db8750910a6d09ec3696
Author: Jiffin Tony Thottan <jthottan>
Date:   Fri Jun 30 17:11:46 2017 +0530

    systemd/glusterfssharedstorage : remove dependency for var-run-gluster-shared_storage
    
    Currently the script used by glusterfssharedstorage have dependency over
    var-run-gluster-shared_storage. But this service will be present only if
    node has rebooted. Also in reboot scenario , there is a chance that this
    service can be executed before creating var-run-gluster-shared_storage.
    In that case glusterfssharedstorage will get succeed even without mounting
    the shared storage
    
    Also the type of glusterfssharedstorage changed to "forking" so that it can
    be active(instead of dead) after the successful start.
    
    Change-Id: I1c02cc64946e534d845aa7ec7b72644bbe4d26f9
    BUG: 1452527
    Signed-off-by: Jiffin Tony Thottan <jthottan>
    Reviewed-on: https://review.gluster.org/17658
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: soumya k <skoduri>
    Reviewed-by: Kaleb KEITHLEY <kkeithle>

Comment 16 Shyamsundar 2017-09-05 17:30:44 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.0, please open a new bug report.

glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.