Bug 763637 (GLUSTER-1905)

Summary: Mounting volume is not working when any one server is down
Product: [Community] GlusterFS Reporter: Dhandapani <dgopal>
Component: glusterdAssignee: Amar Tumballi <amarts>
Status: CLOSED DUPLICATE QA Contact:
Severity: medium Docs Contact:
Priority: low    
Version: pre-2.0CC: amarts, gluster-bugs, lakshmipathi, vijay, vraman
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:

Description Dhandapani 2010-10-11 07:07:54 EDT
Reproduce scenario:

Create a volume with plain distribute type with 3 servers. Stop the volume. Bring down any one of the server. If I try start the volume then the mounting volume get hangs.
Comment 1 shishir gowda 2010-10-14 02:30:10 EDT
the client seems to be retrying to query the portmap of the downed server every few seconds causing the mount op to hang.

(gdb) bt
#0  client_query_portmap_cbk (req=0x7faecf6d9280, iov=0x7faecf6d92c0, count=1, myframe=0x7faed210f0c4) at client-handshake.c:746
#1  0x00007faed377b6f8 in rpc_clnt_handle_reply (clnt=0xccca58, pollin=0xcb8238) at rpc-clnt.c:752
#2  0x00007faed377ba57 in rpc_clnt_notify (trans=0xcccc78, mydata=0xccca88, event=RPC_TRANSPORT_MSG_RECEIVED, data=0xcb8238) at rpc-clnt.c:865
#3  0x00007faed3778e68 in rpc_transport_notify (this=0xcccc78, event=RPC_TRANSPORT_MSG_RECEIVED, data=0xcb8238) at rpc-transport.c:1142
#4  0x00007faed11fae90 in socket_event_poll_in (this=0xcccc78) at socket.c:1619
#5  0x00007faed11fb243 in socket_event_handler (fd=10, idx=3, data=0xcccc78, poll_in=1, poll_out=0, poll_err=0) at socket.c:1733
#6  0x00007faed39ccc2b in event_dispatch_epoll_handler (event_pool=0xcb2b48, events=0xcb7538, i=0) at event.c:812
#7  0x00007faed39cce3b in event_dispatch_epoll (event_pool=0xcb2b48) at event.c:876
#8  0x00007faed39cd1a3 in event_dispatch (event_pool=0xcb2b48) at event.c:984
#9  0x00000000004066fc in main (argc=5, argv=0x7fff11118db8) at glusterfsd.c:1410

the error msgs:

[2010-10-14 14:47:48.563727] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume
[2010-10-14 14:47:52.556309] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume
[2010-10-14 14:47:52.556855] I [client-handshake.c:699:select_server_supported_programs] new-client-1: Using Program GlusterFS-3.1.0, Num (1298437), Version (310)
[2010-10-14 14:47:52.557519] I [client-handshake.c:535:client_setvolume_cbk] new-client-1: Connected to 127.0.1.1:24018, attached to remote volume '/export/dir2'.
[2010-10-14 14:47:52.557772] I [client-handshake.c:699:select_server_supported_programs] new-client-0: Using Program GlusterFS-3.1.0, Num (1298437), Version (310)
[2010-10-14 14:47:52.558290] I [client-handshake.c:535:client_setvolume_cbk] new-client-0: Connected to 127.0.1.1:24017, attached to remote volume '/export/dir1'.
[2010-10-14 14:47:55.558037] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume
[2010-10-14 14:47:58.559514] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume
[2010-10-14 14:48:02.560978] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume
[2010-10-14 14:48:06.562368] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume
[2010-10-14 14:48:10.563552] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume
[2010-10-14 14:48:14.565056] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume
[2010-10-14 14:48:18.566797] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume
[2010-10-14 14:48:22.568069] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume
[2010-10-14 14:48:26.569469] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume
[2010-10-14 14:48:30.570912] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume
Comment 2 Anand Avati 2010-11-18 05:55:46 EST
PATCH: http://patches.gluster.com/patch/5713 in master (protocol/client: skip notify if query portmap is successful)
Comment 3 Lakshmipathi G 2010-11-21 01:06:56 EST
testing with 3.1.1qa9 ,created dht volume with 4 bricks and mount it to client.
now stopped the volume and rebooted brick2.

while trying to  start again,its not working.
brick1# gluster volume start qa9

brick1#ps aux | grep  gluste
root     21893  0.1  0.2  68252 17344 ?        Ssl  03:57   0:00 glusterd

#gluster volume info

Volume Name: qa9
Type: Distribute
Status: Stopped
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: 10.209.59.112:/mnt/310
Brick2: 10.208.47.224:/mnt/310
Brick3: 10.209.163.191:/mnt/310
Brick4: 10.208.186.47:/mnt/310

#gluster volume start qa9 force
Starting volume qa9 has been unsuccessful


log file 
------
[2010-11-21 03:58:39.164973] I [glusterd-handler.c:936:glusterd_handle_cli_start_volume] glusterd: Received start vol reqfor volume qa9
[2010-11-21 03:58:39.165028] I [glusterd-utils.c:232:glusterd_lock] glusterd: Cluster lock held by e8f5aa99-84d8-4bdb-be10-f79ce4e2734c
[2010-11-21 03:58:39.165044] I [glusterd-handler.c:2835:glusterd_op_txn_begin] glusterd: Acquired local lock
[2010-11-21 03:58:39.165149] I [glusterd3_1-mops.c:1091:glusterd3_1_cluster_lock] glusterd: Sent lock req to 3 peers
[2010-11-21 03:58:39.165678] I [glusterd3_1-mops.c:395:glusterd3_1_cluster_lock_cbk] glusterd: Received ACC from uuid: 3588f8a7-244e-4b2e-890f-ddfed8a9bf84
[2010-11-21 03:58:39.165703] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:58:39.165743] I [glusterd3_1-mops.c:395:glusterd3_1_cluster_lock_cbk] glusterd: Received ACC from uuid: 049df669-aca0-4182-9de7-10afc1bc1122
[2010-11-21 03:58:39.165755] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:58:39.165776] I [glusterd3_1-mops.c:395:glusterd3_1_cluster_lock_cbk] glusterd: Received ACC from uuid: 3f484f92-f435-4bda-9179-be1f1d68ea41
[2010-11-21 03:58:39.165788] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:58:39.166768] I [glusterd-utils.c:2101:glusterd_friend_find_by_hostname] glusterd: Friend 10.208.47.224 found.. state: 3
[2010-11-21 03:58:39.166787] I [glusterd-utils.c:2101:glusterd_friend_find_by_hostname] glusterd: Friend 10.209.163.191 found.. state: 3
[2010-11-21 03:58:39.166800] I [glusterd-utils.c:2101:glusterd_friend_find_by_hostname] glusterd: Friend 10.208.186.47 found.. state: 3
[2010-11-21 03:58:39.166855] I [glusterd3_1-mops.c:1233:glusterd3_1_stage_op] glusterd: Sent op req to 3 peers
[2010-11-21 03:58:39.168007] I [glusterd3_1-mops.c:594:glusterd3_1_stage_op_cbk] glusterd: Received ACC from uuid: 049df669-aca0-4182-9de7-10afc1bc1122
[2010-11-21 03:58:39.168025] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:58:39.168057] I [glusterd3_1-mops.c:594:glusterd3_1_stage_op_cbk] glusterd: Received ACC from uuid: 3f484f92-f435-4bda-9179-be1f1d68ea41
[2010-11-21 03:58:39.168072] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:58:39.168234] I [glusterd3_1-mops.c:594:glusterd3_1_stage_op_cbk] glusterd: Received ACC from uuid: 3588f8a7-244e-4b2e-890f-ddfed8a9bf84
[2010-11-21 03:58:39.168250] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:58:39.168323] I [glusterd-utils.c:971:glusterd_volume_start_glusterfs] : About to start glusterfs for brick 10.209.59.112:/mnt/310
[2010-11-21 03:58:39.286443] I [glusterd3_1-mops.c:1323:glusterd3_1_commit_op] glusterd: Sent op req to 3 peers
[2010-11-21 03:58:39.288953] I [glusterd-pmap.c:237:pmap_registry_bind] pmap: adding brick /mnt/310 on port 24010
[2010-11-21 03:58:39.417515] I [glusterd3_1-mops.c:717:glusterd3_1_commit_op_cbk] glusterd: Received ACC from uuid: 3588f8a7-244e-4b2e-890f-ddfed8a9bf84
[2010-11-21 03:58:39.417539] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:58:39.426714] I [glusterd3_1-mops.c:717:glusterd3_1_commit_op_cbk] glusterd: Received ACC from uuid: 049df669-aca0-4182-9de7-10afc1bc1122
[2010-11-21 03:58:39.426736] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:58:39.443323] I [glusterd3_1-mops.c:717:glusterd3_1_commit_op_cbk] glusterd: Received ACC from uuid: 3f484f92-f435-4bda-9179-be1f1d68ea41
[2010-11-21 03:58:39.443367] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:58:39.443433] I [glusterd3_1-mops.c:1145:glusterd3_1_cluster_unlock] glusterd: Sent unlock req to 3 peers
[2010-11-21 03:58:39.443712] I [glusterd3_1-mops.c:456:glusterd3_1_cluster_unlock_cbk] glusterd: Received ACC from uuid: 3f484f92-f435-4bda-9179-be1f1d68ea41
[2010-11-21 03:58:39.443730] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:58:39.444085] I [glusterd3_1-mops.c:456:glusterd3_1_cluster_unlock_cbk] glusterd: Received ACC from uuid: 049df669-aca0-4182-9de7-10afc1bc1122
[2010-11-21 03:58:39.444103] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:58:39.444132] I [glusterd3_1-mops.c:456:glusterd3_1_cluster_unlock_cbk] glusterd: Received ACC from uuid: 3588f8a7-244e-4b2e-890f-ddfed8a9bf84
[2010-11-21 03:58:39.444148] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:58:39.444167] I [glusterd-op-sm.c:4736:glusterd_op_txn_complete] glusterd: Cleared local lock
[2010-11-21 03:59:55.232350] I [glusterd-handler.c:965:glusterd_handle_cli_stop_volume] glusterd: Received stop vol reqfor volume qa9
[2010-11-21 03:59:55.232411] I [glusterd-utils.c:232:glusterd_lock] glusterd: Cluster lock held by e8f5aa99-84d8-4bdb-be10-f79ce4e2734c
[2010-11-21 03:59:55.232425] I [glusterd-handler.c:2835:glusterd_op_txn_begin] glusterd: Acquired local lock
[2010-11-21 03:59:55.232522] I [glusterd3_1-mops.c:1091:glusterd3_1_cluster_lock] glusterd: Sent lock req to 3 peers
[2010-11-21 03:59:55.233061] I [glusterd3_1-mops.c:395:glusterd3_1_cluster_lock_cbk] glusterd: Received ACC from uuid: 049df669-aca0-4182-9de7-10afc1bc1122
[2010-11-21 03:59:55.233084] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:59:55.233217] I [glusterd3_1-mops.c:395:glusterd3_1_cluster_lock_cbk] glusterd: Received ACC from uuid: 3f484f92-f435-4bda-9179-be1f1d68ea41
[2010-11-21 03:59:55.233239] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:59:55.233261] I [glusterd3_1-mops.c:395:glusterd3_1_cluster_lock_cbk] glusterd: Received ACC from uuid: 3588f8a7-244e-4b2e-890f-ddfed8a9bf84
[2010-11-21 03:59:55.233273] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:59:55.233343] I [glusterd3_1-mops.c:1233:glusterd3_1_stage_op] glusterd: Sent op req to 3 peers
[2010-11-21 03:59:55.233706] I [glusterd3_1-mops.c:594:glusterd3_1_stage_op_cbk] glusterd: Received ACC from uuid: 049df669-aca0-4182-9de7-10afc1bc1122
[2010-11-21 03:59:55.233724] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:59:55.233755] I [glusterd3_1-mops.c:594:glusterd3_1_stage_op_cbk] glusterd: Received ACC from uuid: 3588f8a7-244e-4b2e-890f-ddfed8a9bf84
[2010-11-21 03:59:55.233771] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:59:55.233813] I [glusterd3_1-mops.c:594:glusterd3_1_stage_op_cbk] glusterd: Received ACC from uuid: 3f484f92-f435-4bda-9179-be1f1d68ea41
[2010-11-21 03:59:55.233825] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:59:55.233842] I [glusterd-utils.c:2219:glusterd_brick_stop] : About to stop glusterfs for brick 10.209.59.112:/mnt/310
[2010-11-21 03:59:55.233912] I [glusterd-utils.c:854:glusterd_service_stop] : Stopping gluster brick running in pid: 21937
[2010-11-21 03:59:55.242369] I [glusterd-utils.c:854:glusterd_service_stop] : Stopping gluster nfsd running in pid: 21942
[2010-11-21 03:59:56.248626] I [glusterd3_1-mops.c:1323:glusterd3_1_commit_op] glusterd: Sent op req to 3 peers
[2010-11-21 03:59:56.248780] I [glusterd-pmap.c:281:pmap_registry_remove] pmap: removing brick /mnt/310 on port 24010
[2010-11-21 03:59:57.257197] I [glusterd3_1-mops.c:717:glusterd3_1_commit_op_cbk] glusterd: Received ACC from uuid: 049df669-aca0-4182-9de7-10afc1bc1122
[2010-11-21 03:59:57.257227] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:59:57.258344] I [glusterd3_1-mops.c:717:glusterd3_1_commit_op_cbk] glusterd: Received ACC from uuid: 3f484f92-f435-4bda-9179-be1f1d68ea41
[2010-11-21 03:59:57.258362] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:59:57.262791] I [glusterd3_1-mops.c:717:glusterd3_1_commit_op_cbk] glusterd: Received ACC from uuid: 3588f8a7-244e-4b2e-890f-ddfed8a9bf84
[2010-11-21 03:59:57.262812] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:59:57.262876] I [glusterd3_1-mops.c:1145:glusterd3_1_cluster_unlock] glusterd: Sent unlock req to 3 peers
[2010-11-21 03:59:57.263214] I [glusterd3_1-mops.c:456:glusterd3_1_cluster_unlock_cbk] glusterd: Received ACC from uuid: 049df669-aca0-4182-9de7-10afc1bc1122
[2010-11-21 03:59:57.263232] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:59:57.263302] I [glusterd3_1-mops.c:456:glusterd3_1_cluster_unlock_cbk] glusterd: Received ACC from uuid: 3588f8a7-244e-4b2e-890f-ddfed8a9bf84
[2010-11-21 03:59:57.263315] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:59:57.263402] I [glusterd3_1-mops.c:456:glusterd3_1_cluster_unlock_cbk] glusterd: Received ACC from uuid: 3f484f92-f435-4bda-9179-be1f1d68ea41
[2010-11-21 03:59:57.263418] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 03:59:57.263437] I [glusterd-op-sm.c:4736:glusterd_op_txn_complete] glusterd: Cleared local lock
[2010-11-21 04:00:11.54797] I [glusterd-handler.c:716:glusterd_handle_cli_get_volume] glusterd: Received get vol req
[2010-11-21 04:00:11.55503] I [glusterd-handler.c:716:glusterd_handle_cli_get_volume] glusterd: Received get vol req
[2010-11-21 04:00:20.628679] I [glusterd-handler.c:936:glusterd_handle_cli_start_volume] glusterd: Received start vol reqfor volume qa9
[2010-11-21 04:00:20.628732] I [glusterd-utils.c:232:glusterd_lock] glusterd: Cluster lock held by e8f5aa99-84d8-4bdb-be10-f79ce4e2734c
[2010-11-21 04:00:20.628746] I [glusterd-handler.c:2835:glusterd_op_txn_begin] glusterd: Acquired local lock
[2010-11-21 04:00:20.628845] I [glusterd3_1-mops.c:1091:glusterd3_1_cluster_lock] glusterd: Sent lock req to 3 peers
[2010-11-21 04:00:20.629769] I [glusterd3_1-mops.c:395:glusterd3_1_cluster_lock_cbk] glusterd: Received ACC from uuid: 049df669-aca0-4182-9de7-10afc1bc1122
[2010-11-21 04:00:20.629788] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 04:00:20.629812] I [glusterd3_1-mops.c:395:glusterd3_1_cluster_lock_cbk] glusterd: Received ACC from uuid: 3f484f92-f435-4bda-9179-be1f1d68ea41
[2010-11-21 04:00:20.629824] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster
[2010-11-21 04:00:48.569291] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x2af151fd8579] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x2af151fd7d2e] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x2af151fd7c9e]))) rpc-clnt: forced unwinding frame type(Mgmt 3.1) op(--(3)) called at 2010-11-21 04:00:20.628829
[2010-11-21 04:00:52.268747] E [socket.c:1657:socket_connect_finish] management: connection to 10.208.186.47:24007 failed (Connection refused)
Comment 4 Lakshmipathi G 2010-11-21 01:16:24 EST
but if killall glusterd from brick1/brick3/brick4 and start again it worked.

#gluster volume info

Volume Name: qa9
Type: Distribute
Status: Stopped
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: 10.209.59.112:/mnt/310
Brick2: 10.208.47.224:/mnt/310
Brick3: 10.209.163.191:/mnt/310
Brick4: 10.208.186.47:/mnt/310
[root@epsilon ~] gluster volume start qa9
Starting volume qa9 has been successful
[root@epsilon ~] gluster volume stop qa9
Stopping volume will make its data inaccessible. Do you want to Continue? (y/n) y
Stopping volume qa9 has been successful

here--> rebooted brick2

[root@epsilon ~] gluster volume start qa9
[root@epsilon ~] gluster volume start qa9 force
Starting volume qa9 has been unsuccessful


kill glusterd from all remaining bricks 1,3,4
[root@epsilon ~] killall glusterd

and start again -
[root@epsilon ~] glusterd
[root@epsilon ~] gluster volume start qa9
Starting volume qa9 has been successful
[root@epsilon ~] gluster volume info

Volume Name: qa9
Type: Distribute
Status: Started
Number of Bricks: 4
Transport-type: tcp
Bricks:
Brick1: 10.209.59.112:/mnt/310
Brick2: 10.208.47.224:/mnt/310
Brick3: 10.209.163.191:/mnt/310
Brick4: 10.208.186.47:/mnt/310
Comment 5 Amar Tumballi 2011-01-21 02:41:36 EST

*** This bug has been marked as a duplicate of bug 2005 ***