Description of problem: gluster_shared_storage volume doesn't get mounted after disabling and enabling of shared storage. Version-Release number of selected component (if applicable): glusterfs-3.7.5-19 How reproducible: Not always Steps to Reproduce: 1.Create and enable the shared volume using below command on cluster nodes. gluster volume set all cluster.enable-shared-storage enable 2.Observe that the volume gets created and mounted at /run/gluster/shared_storage 3. Disable the shared volume using below command gluster volume set all cluster.enable-shared-storage disable 4.Observe that volume gets deleted and umounted from /run/gluster/shared_storage 5.Enabling the volume again creates the volume but doesn't mount it 6.Even manually mounting the volume fails with below messages in logs [2016-03-08 21:46:19.668458] I [rpc-clnt.c:1851:rpc_clnt_reconfig] 0-gluster_shared_storage-client-2: changing port to 49152 (from 0) [2016-03-08 21:46:19.672503] E [socket.c:2278:socket_connect_finish] 0-gluster_shared_storage-client-2: connection to 10.70.37.52:49152 failed (Connection refused) [2016-03-08 21:46:19.672534] E [MSGID: 108006] [afr-common.c:4015:afr_notify] 0-gluster_shared_storage-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. [2016-03-08 21:46:19.676102] I [fuse-bridge.c:5123:fuse_graph_setup] 0-fuse: switched to graph 0 [2016-03-08 21:46:19.676430] I [fuse-bridge.c:4040:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.22 kernel 7.22 [2016-03-08 21:46:19.676636] I [MSGID: 108006] [afr-common.c:4143:afr_local_init] 0-gluster_shared_storage-replicate-0: no subvolumes up [2016-03-08 21:46:19.677045] W [fuse-bridge.c:760:fuse_attr_cbk] 0-glusterfs-fuse: 2: LOOKUP() / => -1 (Transport endpoint is not connected) [2016-03-08 21:46:19.682185] I [fuse-bridge.c:4965:fuse_thread_proc] 0-fuse: unmounting /run/gluster/shared_storage The message "I [MSGID: 108006] [afr-common.c:4143:afr_local_init] 0-gluster_shared_storage-replicate-0: no subvolumes up" repeated 2 times between [2016-03-08 21:46:19.676636] and [2016-03-08 21:46:19.680038] [2016-03-08 21:46:19.683164] W [glusterfsd.c:1236:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f090d12bdc5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7f090e796905] -->/usr/sbin/glusterfs(cleanup_and_exit+0x69) [0x7f090e796789] ) 0-: received signum (15), shutting down [2016-03-08 21:46:19.683194] I [fuse-bridge.c:5669:fini] 0-fuse: Unmounting '/run/gluster/shared_storage'. [root@dhcp37-52 ~]# gluster volume set all c^Cster.enable-shared-storage enable 7. gluster volume status shows that the bricks of shared storage are online however it uses a different tcp port than what we see in gluster shared volume log. Actual results: gluster_shared_storage volume doesn't get mounted after disabling and enabling of shared storage. Expected results: Disabling and enabling the shared volume should mount the volume as well on /run/gluster/shared_storage and should not fail. Additional info:
it seems that while disabling cluster.enable-shared-storage option disconnect event did not happen. there is some problem in rpc. from brick logs: [2016-03-08 00:36:19.055944] I [MSGID: 106005] [glusterd-handler.c:4908:__glusterd_brick_rpc_notify] 0-management: Brick dhcp37-52.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick has disconnected from glusterd. [2016-03-08 00:36:19.056289] W [rpcsvc.c:270:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) for 10.70.36.49:65482 [2016-03-08 00:36:19.056316] E [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully Due to disconnect event it was not able to remove the entry of brick port.
since this issue is not reproducible every time frequency of reproducing of this issue is very rare. marking severity and priority of this issue low.
Disconnect did happen but pmap_signout didn't and that's why pmap_registry_remove () for port 49152 was not called and pmap.ports[49152].brickname was not NULLed out. This resulted pmap_registry_search to pick up the same port assuming this port is mapped to the current running brick process and communicate back to the client and resulting into a mount failure. We'd need to check two things: 1. Was there a signout event initiated from the brick? 2. If signout event was initiated did it reach to the program layer? If so why it couldn't process it. Probably all it will boil down to finding the reason of the following logs: [2016-03-08 00:36:19.056289] W [rpcsvc.c:270:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) for 10.70.36.49:65482 [2016-03-08 00:36:19.056316] E [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
signout event didn't happen because we didn't see any logs entry like "removing brick <brick name> on port <port number>" in glusterd logs. And when first shared_storage disable happen it didn't free 49152 port as we can see in the following log message [2016-03-08 23:19:35.637523] W [rpcsvc.c:270:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 12984 37 330) for 10.70.36.49:65499 [2016-03-08 23:19:35.637571] E [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete succ essfully in this log window he tried to disable shared_storage. because of above rpc error it could not able to execute actor_fn () means "__gluster_pmap_signout". *since this issue reproduced only one time and not reproducible anymore I should close this issue.* Feel free to re-open this issue if its reproduce again.