Bug 1316000 - gluster_shared_storage volume doesn't get mounted after disabling and enabling of shared storage.
Summary: gluster_shared_storage volume doesn't get mounted after disabling and enablin...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterd
Version: rhgs-3.1
Hardware: All
OS: Linux
low
low
Target Milestone: ---
: ---
Assignee: Gaurav Kumar Garg
QA Contact: storage-qa-internal@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-03-09 08:41 UTC by Shashank Raj
Modified: 2016-11-08 03:52 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-03-15 11:05:23 UTC
Embargoed:


Attachments (Terms of Use)

Description Shashank Raj 2016-03-09 08:41:43 UTC
Description of problem:
gluster_shared_storage volume doesn't get mounted after disabling and enabling of shared storage.

Version-Release number of selected component (if applicable):
glusterfs-3.7.5-19

How reproducible:
Not always

Steps to Reproduce:
1.Create and enable the shared volume using below command on cluster nodes.

gluster volume set all cluster.enable-shared-storage enable

2.Observe that the volume gets created and mounted at /run/gluster/shared_storage
3. Disable the shared volume using below command

gluster volume set all cluster.enable-shared-storage disable

4.Observe that volume gets deleted and umounted from /run/gluster/shared_storage

5.Enabling the volume again creates the volume but doesn't mount it
6.Even manually mounting the volume fails with below messages in logs

[2016-03-08 21:46:19.668458] I [rpc-clnt.c:1851:rpc_clnt_reconfig] 0-gluster_shared_storage-client-2: changing port to 49152 (from 0)
[2016-03-08 21:46:19.672503] E [socket.c:2278:socket_connect_finish] 0-gluster_shared_storage-client-2: connection to 10.70.37.52:49152 failed (Connection refused)
[2016-03-08 21:46:19.672534] E [MSGID: 108006] [afr-common.c:4015:afr_notify] 0-gluster_shared_storage-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2016-03-08 21:46:19.676102] I [fuse-bridge.c:5123:fuse_graph_setup] 0-fuse: switched to graph 0
[2016-03-08 21:46:19.676430] I [fuse-bridge.c:4040:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.22 kernel 7.22
[2016-03-08 21:46:19.676636] I [MSGID: 108006] [afr-common.c:4143:afr_local_init] 0-gluster_shared_storage-replicate-0: no subvolumes up
[2016-03-08 21:46:19.677045] W [fuse-bridge.c:760:fuse_attr_cbk] 0-glusterfs-fuse: 2: LOOKUP() / => -1 (Transport endpoint is not connected)
[2016-03-08 21:46:19.682185] I [fuse-bridge.c:4965:fuse_thread_proc] 0-fuse: unmounting /run/gluster/shared_storage
The message "I [MSGID: 108006] [afr-common.c:4143:afr_local_init] 0-gluster_shared_storage-replicate-0: no subvolumes up" repeated 2 times between [2016-03-08 21:46:19.676636] and [2016-03-08 21:46:19.680038]
[2016-03-08 21:46:19.683164] W [glusterfsd.c:1236:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7dc5) [0x7f090d12bdc5] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7f090e796905] -->/usr/sbin/glusterfs(cleanup_and_exit+0x69) [0x7f090e796789] ) 0-: received signum (15), shutting down
[2016-03-08 21:46:19.683194] I [fuse-bridge.c:5669:fini] 0-fuse: Unmounting '/run/gluster/shared_storage'.
[root@dhcp37-52 ~]# gluster volume set all c^Cster.enable-shared-storage enable

7. gluster volume status shows that the bricks of shared storage are online however it uses a different tcp port than what we see in gluster shared volume log.


Actual results:
gluster_shared_storage volume doesn't get mounted after disabling and enabling of shared storage.

Expected results:
Disabling and enabling the shared volume should mount the volume as well on /run/gluster/shared_storage and should not fail.

Additional info:

Comment 2 Gaurav Kumar Garg 2016-03-09 13:54:33 UTC
it seems that while disabling  cluster.enable-shared-storage option disconnect event did not happen. there is some problem in rpc.


from brick logs:

[2016-03-08 00:36:19.055944] I [MSGID: 106005] [glusterd-handler.c:4908:__glusterd_brick_rpc_notify] 0-management: Brick dhcp37-52.lab.eng.blr.redhat.com:/var/lib/glusterd/ss_brick has disconnected from glusterd.
[2016-03-08 00:36:19.056289] W [rpcsvc.c:270:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) for 10.70.36.49:65482
[2016-03-08 00:36:19.056316] E [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully


Due to disconnect event it was not able to remove the entry of brick port.

Comment 3 Gaurav Kumar Garg 2016-03-09 16:26:49 UTC
since this issue is not reproducible every time frequency of reproducing of this issue is very rare. marking severity and priority of this issue low.

Comment 4 Atin Mukherjee 2016-03-10 05:31:15 UTC
Disconnect did happen but pmap_signout didn't and that's why pmap_registry_remove () for port 49152 was not called and pmap.ports[49152].brickname was not NULLed out. This resulted pmap_registry_search to pick up the same port assuming this port is mapped to the current running brick process and communicate back to the client and resulting into a mount failure.

We'd need to check two things:

1. Was there a signout event initiated from the brick?
2. If signout event was initiated did it reach to the program layer? If so why it couldn't process it.

Probably all it will boil down to finding the reason of the following logs:

[2016-03-08 00:36:19.056289] W [rpcsvc.c:270:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 1298437 330) for 10.70.36.49:65482
[2016-03-08 00:36:19.056316] E [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully

Comment 5 Gaurav Kumar Garg 2016-03-15 11:05:23 UTC
signout event didn't happen because we didn't see any logs entry like "removing brick <brick name> on port <port number>" in glusterd logs. 

And when first shared_storage disable happen it didn't free 49152 port as we can see in the following log message

[2016-03-08 23:19:35.637523] W [rpcsvc.c:270:rpcsvc_program_actor] 0-rpc-service: RPC program not available (req 12984
37 330) for 10.70.36.49:65499
[2016-03-08 23:19:35.637571] E [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete succ
essfully


in this log window he tried to disable shared_storage. because of above rpc error it could not able to execute  actor_fn () means "__gluster_pmap_signout".

*since this issue reproduced only one time and not reproducible anymore I should close this issue.* 

Feel free to re-open this issue if its reproduce again.


Note You need to log in before you can comment on or make changes to this bug.