+++ This bug was initially created as a clone of Bug #1335090 +++ Description of problem: shared volume doesn't get mounted on one (maybe two) node after rebooting all nodes in cluster, resulting in missing symlink (/var/lib/nfs -> /var/run/gluster/shared_storage/nfs-ganesha/dhcp42-239.lab.eng.blr.redhat.com/nfs) . Version-Release number of selected component (if applicable): mainline How reproducible: Always Steps to Reproduce: 1. Create a 4 node ganesha cluster. 2. Make sure the shared volume is created and mounted on all the nodes of cluster and the symlink is created as below. [root@dhcp42-20 ~]# gluster volume status gluster_shared_storage Status of volume: gluster_shared_storage Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick dhcp42-239.lab.eng.blr.redhat.com:/va r/lib/glusterd/ss_brick 49155 0 Y 2293 Brick dhcp43-175.lab.eng.blr.redhat.com:/va r/lib/glusterd/ss_brick 49155 0 Y 2281 Brick dhcp42-20.lab.eng.blr.redhat.com:/var /lib/glusterd/ss_brick 49155 0 Y 2266 Self-heal Daemon on localhost N/A N/A Y 2257 Self-heal Daemon on dhcp42-239.lab.eng.blr. redhat.com N/A N/A Y 2287 Self-heal Daemon on dhcp43-175.lab.eng.blr. redhat.com N/A N/A Y 2253 Self-heal Daemon on dhcp42-196.lab.eng.blr. redhat.com N/A N/A Y 2258 Task Status of Volume gluster_shared_storage ------------------------------------------------------------------------------ There are no active volume tasks dhcp42-20.lab.eng.blr.redhat.com:/gluster_shared_storage 27740928 1697152 26043776 7% /run/gluster/shared_storage dhcp42-239.lab.eng.blr.redhat.com:/gluster_shared_storage 27740928 1697152 26043776 7% /run/gluster/shared_storage dhcp43-175.lab.eng.blr.redhat.com:/gluster_shared_storage 27740928 1697152 26043776 7% /run/gluster/shared_storage dhcp42-196.lab.eng.blr.redhat.com:/gluster_shared_storage 27740928 1697152 26043776 7% /run/gluster/shared_storage [root@dhcp42-20 ~]# ls -ld /var/lib/nfs lrwxrwxrwx. 1 root root 80 May 11 21:26 /var/lib/nfs -> /var/run/gluster/shared_storage/nfs-ganesha/dhcp42-20.lab.eng.blr.redhat.com/nfs [root@dhcp42-239 ~]# ls -ld /var/lib/nfs lrwxrwxrwx. 1 root root 81 May 11 21:26 /var/lib/nfs -> /var/run/gluster/shared_storage/nfs-ganesha/dhcp42-239.lab.eng.blr.redhat.com/nfs [root@dhcp43-175 ~]# ls -ld /var/lib/nfs lrwxrwxrwx. 1 root root 81 May 11 21:26 /var/lib/nfs -> /var/run/gluster/shared_storage/nfs-ganesha/dhcp43-175.lab.eng.blr.redhat.com/nfs [root@dhcp42-196 ~]# ls -ld /var/lib/nfs lrwxrwxrwx. 1 root root 81 May 11 21:19 /var/lib/nfs -> /var/run/gluster/shared_storage/nfs-ganesha/dhcp42-196.lab.eng.blr.redhat.com/nfs 3. Reboot all the nodes of the cluster. 4. Observe that on 2 of the 4 nodes, shared storage is not mounted. (most of the times it doesnt get mounted on any one node). 5.And the symlink from /var/lib/nfs doesn't get created because of this on these 2 nodes. 6. Both of these nodes have the entries in /etc/fstab and manually mounting the shared storage on these nodes works. Actual results: Shared volume doesn't get mounted on few nodes after rebooting all nodes in cluster. Expected results: Shared volume should get mounted on all the nodes after reboot Additional info: --- Additional comment from Soumya Koduri on 2016-05-11 07:38:20 EDT --- I see below error in node4 logs [2016-05-11 15:56:04.984079] E [MSGID: 114058] [client-handshake.c:1524:client_query_portmap_cbk] 0-gluster_shared_storage-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2016-05-11 15:56:04.984357] I [MSGID: 114018] [client.c:2030:client_rpc_notify] 0-gluster_shared_storage-client-1: disconnected from gluster_shared_storage-client-1. Client process will keep trying to connect to glusterd until brick's port is available [2016-05-11 15:56:04.984374] W [MSGID: 108001] [afr-common.c:4210:afr_notify] 0-gluster_shared_storage-replicate-0: Client-quorum is not met [2016-05-11 15:56:05.291773] E [MSGID: 114058] [client-handshake.c:1524:client_query_portmap_cbk] 0-gluster_shared_storage-client-2: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2016-05-11 15:56:05.292104] I [MSGID: 114018] [client.c:2030:client_rpc_notify] 0-gluster_shared_storage-client-2: disconnected from gluster_shared_storage-client-2. Client process will keep trying to connect to glusterd until brick's port is available [2016-05-11 15:56:05.292165] E [MSGID: 108006] [afr-common.c:4152:afr_notify] 0-gluster_shared_storage-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. [2016-05-11 15:56:05.295895] I [fuse-bridge.c:5166:fuse_graph_setup] 0-fuse: switched to graph 0 [2016-05-11 15:56:05.296679] I [fuse-bridge.c:4077:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.22 kernel 7.22 [2016-05-11 15:56:05.296828] I [MSGID: 108006] [afr-common.c:4261:afr_local_init] 0-gluster_shared_storage-replicate-0: no subvolumes up [2016-05-11 15:56:05.297606] E [dht-helper.c:1602:dht_inode_ctx_time_update] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_discover+0x1ca) [0x7fbcaef8ad6a] -->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_lookup_dir_cbk+0x379) [0x7fbcaecf51d9] -->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_inode_ctx_time_update+0x210) [0x7fbcaeccd2d0] ) 0-gluster_shared_storage-dht: invalid argument: inode [Invalid argument] [2016-05-11 15:56:05.298786] E [dht-helper.c:1602:dht_inode_ctx_time_update] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_discover+0x1ca) [0x7fbcaef8ad6a] -->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_lookup_dir_cbk+0x379) [0x7fbcaecf51d9] -->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_inode_ctx_time_update+0x210) [0x7fbcaeccd2d0] ) 0-gluster_shared_storage-dht: invalid argument: inode [Invalid argument] [2016-05-11 15:56:05.298818] W [fuse-bridge.c:766:fuse_attr_cbk] 0-glusterfs-fuse: 2: LOOKUP() / => -1 (Transport endpoint is not connected) [2016-05-11 15:56:05.305894] E [dht-helper.c:1602:dht_inode_ctx_time_update] (-->/usr/lib64/glusterfs/3.7.9/xlator/cluster/replicate.so(afr_discover+0x1ca) [0x7fbcaef8ad6a] -->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_lookup_dir_cbk+0x379) [0x7fbcaecf51d9] -->/usr/lib64/glusterfs/3.7.9/xlator/cluster/distribute.so(dht_inode_ctx_time_update+0x210) [0x7fbcaeccd2d0] ) 0-gluster_shared_storage-dht: invalid argument: inode [Invalid argument] [2016-05-11 15:56:05.307751] I [fuse-bridge.c:5007:fuse_thread_proc] 0-fuse: unmounting /run/gluster/shared_storage Since this seems to be an issue with gluster_shared_storage mount being lost, adjusting the components accordingly and request Avra to take a look. --- Additional comment from Avra Sengupta on 2016-05-13 01:30:09 EDT --- This is expected behaviour. We need to understand that the shared volume itself is hosted in these nodes, and all nodes mount it using one of the particular nodes. Now when all nodes are down, the shared storage volume is also essentially down. When the nodes come up, till the node whose entry is mentioned in /etc/fstab is up and serving, none of them will be able to connect to the shared storage. That node itself will never connect to the shared storage on reboot, as by the time /etc/fstab entry is replayed, the volume is not being served.
REVIEW: https://review.gluster.org/17339 (scripts/shared_storage : systemd helper scripts to mount shared storage post reboot) posted (#1) for review on master by jiffin tony Thottan (jthottan)
REVIEW: https://review.gluster.org/17339 (scripts/shared_storage : systemd helper scripts to mount shared storage post reboot) posted (#2) for review on master by jiffin tony Thottan (jthottan)
REVIEW: https://review.gluster.org/17339 (scripts/shared_storage : systemd helper scripts to mount shared storage post reboot) posted (#3) for review on master by jiffin tony Thottan (jthottan)
REVIEW: https://review.gluster.org/17339 (scripts/shared_storage : systemd helper scripts to mount shared storage post reboot) posted (#4) for review on master by jiffin tony Thottan (jthottan)
REVIEW: https://review.gluster.org/17339 (scripts/shared_storage : systemd helper scripts to mount shared storage post reboot) posted (#5) for review on master by jiffin tony Thottan (jthottan)
REVIEW: https://review.gluster.org/17339 (scripts/shared_storage : systemd helper scripts to mount shared storage post reboot) posted (#6) for review on master by jiffin tony Thottan (jthottan)
REVIEW: https://review.gluster.org/17339 (scripts/shared_storage : systemd helper scripts to mount shared storage post reboot) posted (#7) for review on master by jiffin tony Thottan (jthottan)
REVIEW: https://review.gluster.org/17339 (scripts/shared_storage : systemd helper scripts to mount shared storage post reboot) posted (#8) for review on master by jiffin tony Thottan (jthottan)
REVIEW: https://review.gluster.org/17339 (scripts/shared_storage : systemd helper scripts to mount shared storage post reboot) posted (#9) for review on master by jiffin tony Thottan (jthottan)
COMMIT: https://review.gluster.org/17339 committed in master by Kaleb KEITHLEY (kkeithle) ------ commit 3183ca1bdee9cb0af22c017e3c610add8ff2b405 Author: Hendrik Visage <hvjunk> Date: Fri May 19 12:21:37 2017 +0530 scripts/shared_storage : systemd helper scripts to mount shared storage post reboot Reported-by: Hendrik Visage <hvjunk> Change-Id: Ibcff56b00f45c8af54c1ae04974267c2180f5f63 BUG: 1452527 Signed-off-by: Jiffin Tony Thottan <jthottan> Reviewed-on: https://review.gluster.org/17339 Smoke: Gluster Build System <jenkins.org> NetBSD-regression: NetBSD Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Niels de Vos <ndevos> Reviewed-by: Kaleb KEITHLEY <kkeithle>
REVIEW: https://review.gluster.org/17658 (systemd/glusterfssharedstorage : remove dependency for var-run-gluster-shared_storage) posted (#1) for review on master by jiffin tony Thottan (jthottan)
REVIEW: https://review.gluster.org/17658 (systemd/glusterfssharedstorage : remove dependency for var-run-gluster-shared_storage) posted (#2) for review on master by jiffin tony Thottan (jthottan)
REVIEW: https://review.gluster.org/17658 (systemd/glusterfssharedstorage : remove dependency for var-run-gluster-shared_storage) posted (#3) for review on master by jiffin tony Thottan (jthottan)
REVIEW: https://review.gluster.org/17658 (systemd/glusterfssharedstorage : remove dependency for var-run-gluster-shared_storage) posted (#4) for review on master by jiffin tony Thottan (jthottan)
COMMIT: https://review.gluster.org/17658 committed in master by Kaleb KEITHLEY (kkeithle) ------ commit 4c410a46ef58512ba751db8750910a6d09ec3696 Author: Jiffin Tony Thottan <jthottan> Date: Fri Jun 30 17:11:46 2017 +0530 systemd/glusterfssharedstorage : remove dependency for var-run-gluster-shared_storage Currently the script used by glusterfssharedstorage have dependency over var-run-gluster-shared_storage. But this service will be present only if node has rebooted. Also in reboot scenario , there is a chance that this service can be executed before creating var-run-gluster-shared_storage. In that case glusterfssharedstorage will get succeed even without mounting the shared storage Also the type of glusterfssharedstorage changed to "forking" so that it can be active(instead of dead) after the successful start. Change-Id: I1c02cc64946e534d845aa7ec7b72644bbe4d26f9 BUG: 1452527 Signed-off-by: Jiffin Tony Thottan <jthottan> Reviewed-on: https://review.gluster.org/17658 Smoke: Gluster Build System <jenkins.org> CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: soumya k <skoduri> Reviewed-by: Kaleb KEITHLEY <kkeithle>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.0, please open a new bug report. glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html [2] https://www.gluster.org/pipermail/gluster-users/