Reproduce scenario: Create a volume with plain distribute type with 3 servers. Stop the volume. Bring down any one of the server. If I try start the volume then the mounting volume get hangs.
the client seems to be retrying to query the portmap of the downed server every few seconds causing the mount op to hang. (gdb) bt #0 client_query_portmap_cbk (req=0x7faecf6d9280, iov=0x7faecf6d92c0, count=1, myframe=0x7faed210f0c4) at client-handshake.c:746 #1 0x00007faed377b6f8 in rpc_clnt_handle_reply (clnt=0xccca58, pollin=0xcb8238) at rpc-clnt.c:752 #2 0x00007faed377ba57 in rpc_clnt_notify (trans=0xcccc78, mydata=0xccca88, event=RPC_TRANSPORT_MSG_RECEIVED, data=0xcb8238) at rpc-clnt.c:865 #3 0x00007faed3778e68 in rpc_transport_notify (this=0xcccc78, event=RPC_TRANSPORT_MSG_RECEIVED, data=0xcb8238) at rpc-transport.c:1142 #4 0x00007faed11fae90 in socket_event_poll_in (this=0xcccc78) at socket.c:1619 #5 0x00007faed11fb243 in socket_event_handler (fd=10, idx=3, data=0xcccc78, poll_in=1, poll_out=0, poll_err=0) at socket.c:1733 #6 0x00007faed39ccc2b in event_dispatch_epoll_handler (event_pool=0xcb2b48, events=0xcb7538, i=0) at event.c:812 #7 0x00007faed39cce3b in event_dispatch_epoll (event_pool=0xcb2b48) at event.c:876 #8 0x00007faed39cd1a3 in event_dispatch (event_pool=0xcb2b48) at event.c:984 #9 0x00000000004066fc in main (argc=5, argv=0x7fff11118db8) at glusterfsd.c:1410 the error msgs: [2010-10-14 14:47:48.563727] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume [2010-10-14 14:47:52.556309] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume [2010-10-14 14:47:52.556855] I [client-handshake.c:699:select_server_supported_programs] new-client-1: Using Program GlusterFS-3.1.0, Num (1298437), Version (310) [2010-10-14 14:47:52.557519] I [client-handshake.c:535:client_setvolume_cbk] new-client-1: Connected to 127.0.1.1:24018, attached to remote volume '/export/dir2'. [2010-10-14 14:47:52.557772] I [client-handshake.c:699:select_server_supported_programs] new-client-0: Using Program GlusterFS-3.1.0, Num (1298437), Version (310) [2010-10-14 14:47:52.558290] I [client-handshake.c:535:client_setvolume_cbk] new-client-0: Connected to 127.0.1.1:24017, attached to remote volume '/export/dir1'. [2010-10-14 14:47:55.558037] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume [2010-10-14 14:47:58.559514] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume [2010-10-14 14:48:02.560978] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume [2010-10-14 14:48:06.562368] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume [2010-10-14 14:48:10.563552] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume [2010-10-14 14:48:14.565056] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume [2010-10-14 14:48:18.566797] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume [2010-10-14 14:48:22.568069] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume [2010-10-14 14:48:26.569469] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume [2010-10-14 14:48:30.570912] E [client-handshake.c:773:client_query_portmap_cbk] new-client-2: failed to get the port number for remote subvolume
PATCH: http://patches.gluster.com/patch/5713 in master (protocol/client: skip notify if query portmap is successful)
testing with 3.1.1qa9 ,created dht volume with 4 bricks and mount it to client. now stopped the volume and rebooted brick2. while trying to start again,its not working. brick1# gluster volume start qa9 brick1#ps aux | grep gluste root 21893 0.1 0.2 68252 17344 ? Ssl 03:57 0:00 glusterd #gluster volume info Volume Name: qa9 Type: Distribute Status: Stopped Number of Bricks: 4 Transport-type: tcp Bricks: Brick1: 10.209.59.112:/mnt/310 Brick2: 10.208.47.224:/mnt/310 Brick3: 10.209.163.191:/mnt/310 Brick4: 10.208.186.47:/mnt/310 #gluster volume start qa9 force Starting volume qa9 has been unsuccessful log file ------ [2010-11-21 03:58:39.164973] I [glusterd-handler.c:936:glusterd_handle_cli_start_volume] glusterd: Received start vol reqfor volume qa9 [2010-11-21 03:58:39.165028] I [glusterd-utils.c:232:glusterd_lock] glusterd: Cluster lock held by e8f5aa99-84d8-4bdb-be10-f79ce4e2734c [2010-11-21 03:58:39.165044] I [glusterd-handler.c:2835:glusterd_op_txn_begin] glusterd: Acquired local lock [2010-11-21 03:58:39.165149] I [glusterd3_1-mops.c:1091:glusterd3_1_cluster_lock] glusterd: Sent lock req to 3 peers [2010-11-21 03:58:39.165678] I [glusterd3_1-mops.c:395:glusterd3_1_cluster_lock_cbk] glusterd: Received ACC from uuid: 3588f8a7-244e-4b2e-890f-ddfed8a9bf84 [2010-11-21 03:58:39.165703] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:58:39.165743] I [glusterd3_1-mops.c:395:glusterd3_1_cluster_lock_cbk] glusterd: Received ACC from uuid: 049df669-aca0-4182-9de7-10afc1bc1122 [2010-11-21 03:58:39.165755] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:58:39.165776] I [glusterd3_1-mops.c:395:glusterd3_1_cluster_lock_cbk] glusterd: Received ACC from uuid: 3f484f92-f435-4bda-9179-be1f1d68ea41 [2010-11-21 03:58:39.165788] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:58:39.166768] I [glusterd-utils.c:2101:glusterd_friend_find_by_hostname] glusterd: Friend 10.208.47.224 found.. state: 3 [2010-11-21 03:58:39.166787] I [glusterd-utils.c:2101:glusterd_friend_find_by_hostname] glusterd: Friend 10.209.163.191 found.. state: 3 [2010-11-21 03:58:39.166800] I [glusterd-utils.c:2101:glusterd_friend_find_by_hostname] glusterd: Friend 10.208.186.47 found.. state: 3 [2010-11-21 03:58:39.166855] I [glusterd3_1-mops.c:1233:glusterd3_1_stage_op] glusterd: Sent op req to 3 peers [2010-11-21 03:58:39.168007] I [glusterd3_1-mops.c:594:glusterd3_1_stage_op_cbk] glusterd: Received ACC from uuid: 049df669-aca0-4182-9de7-10afc1bc1122 [2010-11-21 03:58:39.168025] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:58:39.168057] I [glusterd3_1-mops.c:594:glusterd3_1_stage_op_cbk] glusterd: Received ACC from uuid: 3f484f92-f435-4bda-9179-be1f1d68ea41 [2010-11-21 03:58:39.168072] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:58:39.168234] I [glusterd3_1-mops.c:594:glusterd3_1_stage_op_cbk] glusterd: Received ACC from uuid: 3588f8a7-244e-4b2e-890f-ddfed8a9bf84 [2010-11-21 03:58:39.168250] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:58:39.168323] I [glusterd-utils.c:971:glusterd_volume_start_glusterfs] : About to start glusterfs for brick 10.209.59.112:/mnt/310 [2010-11-21 03:58:39.286443] I [glusterd3_1-mops.c:1323:glusterd3_1_commit_op] glusterd: Sent op req to 3 peers [2010-11-21 03:58:39.288953] I [glusterd-pmap.c:237:pmap_registry_bind] pmap: adding brick /mnt/310 on port 24010 [2010-11-21 03:58:39.417515] I [glusterd3_1-mops.c:717:glusterd3_1_commit_op_cbk] glusterd: Received ACC from uuid: 3588f8a7-244e-4b2e-890f-ddfed8a9bf84 [2010-11-21 03:58:39.417539] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:58:39.426714] I [glusterd3_1-mops.c:717:glusterd3_1_commit_op_cbk] glusterd: Received ACC from uuid: 049df669-aca0-4182-9de7-10afc1bc1122 [2010-11-21 03:58:39.426736] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:58:39.443323] I [glusterd3_1-mops.c:717:glusterd3_1_commit_op_cbk] glusterd: Received ACC from uuid: 3f484f92-f435-4bda-9179-be1f1d68ea41 [2010-11-21 03:58:39.443367] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:58:39.443433] I [glusterd3_1-mops.c:1145:glusterd3_1_cluster_unlock] glusterd: Sent unlock req to 3 peers [2010-11-21 03:58:39.443712] I [glusterd3_1-mops.c:456:glusterd3_1_cluster_unlock_cbk] glusterd: Received ACC from uuid: 3f484f92-f435-4bda-9179-be1f1d68ea41 [2010-11-21 03:58:39.443730] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:58:39.444085] I [glusterd3_1-mops.c:456:glusterd3_1_cluster_unlock_cbk] glusterd: Received ACC from uuid: 049df669-aca0-4182-9de7-10afc1bc1122 [2010-11-21 03:58:39.444103] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:58:39.444132] I [glusterd3_1-mops.c:456:glusterd3_1_cluster_unlock_cbk] glusterd: Received ACC from uuid: 3588f8a7-244e-4b2e-890f-ddfed8a9bf84 [2010-11-21 03:58:39.444148] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:58:39.444167] I [glusterd-op-sm.c:4736:glusterd_op_txn_complete] glusterd: Cleared local lock [2010-11-21 03:59:55.232350] I [glusterd-handler.c:965:glusterd_handle_cli_stop_volume] glusterd: Received stop vol reqfor volume qa9 [2010-11-21 03:59:55.232411] I [glusterd-utils.c:232:glusterd_lock] glusterd: Cluster lock held by e8f5aa99-84d8-4bdb-be10-f79ce4e2734c [2010-11-21 03:59:55.232425] I [glusterd-handler.c:2835:glusterd_op_txn_begin] glusterd: Acquired local lock [2010-11-21 03:59:55.232522] I [glusterd3_1-mops.c:1091:glusterd3_1_cluster_lock] glusterd: Sent lock req to 3 peers [2010-11-21 03:59:55.233061] I [glusterd3_1-mops.c:395:glusterd3_1_cluster_lock_cbk] glusterd: Received ACC from uuid: 049df669-aca0-4182-9de7-10afc1bc1122 [2010-11-21 03:59:55.233084] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:59:55.233217] I [glusterd3_1-mops.c:395:glusterd3_1_cluster_lock_cbk] glusterd: Received ACC from uuid: 3f484f92-f435-4bda-9179-be1f1d68ea41 [2010-11-21 03:59:55.233239] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:59:55.233261] I [glusterd3_1-mops.c:395:glusterd3_1_cluster_lock_cbk] glusterd: Received ACC from uuid: 3588f8a7-244e-4b2e-890f-ddfed8a9bf84 [2010-11-21 03:59:55.233273] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:59:55.233343] I [glusterd3_1-mops.c:1233:glusterd3_1_stage_op] glusterd: Sent op req to 3 peers [2010-11-21 03:59:55.233706] I [glusterd3_1-mops.c:594:glusterd3_1_stage_op_cbk] glusterd: Received ACC from uuid: 049df669-aca0-4182-9de7-10afc1bc1122 [2010-11-21 03:59:55.233724] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:59:55.233755] I [glusterd3_1-mops.c:594:glusterd3_1_stage_op_cbk] glusterd: Received ACC from uuid: 3588f8a7-244e-4b2e-890f-ddfed8a9bf84 [2010-11-21 03:59:55.233771] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:59:55.233813] I [glusterd3_1-mops.c:594:glusterd3_1_stage_op_cbk] glusterd: Received ACC from uuid: 3f484f92-f435-4bda-9179-be1f1d68ea41 [2010-11-21 03:59:55.233825] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:59:55.233842] I [glusterd-utils.c:2219:glusterd_brick_stop] : About to stop glusterfs for brick 10.209.59.112:/mnt/310 [2010-11-21 03:59:55.233912] I [glusterd-utils.c:854:glusterd_service_stop] : Stopping gluster brick running in pid: 21937 [2010-11-21 03:59:55.242369] I [glusterd-utils.c:854:glusterd_service_stop] : Stopping gluster nfsd running in pid: 21942 [2010-11-21 03:59:56.248626] I [glusterd3_1-mops.c:1323:glusterd3_1_commit_op] glusterd: Sent op req to 3 peers [2010-11-21 03:59:56.248780] I [glusterd-pmap.c:281:pmap_registry_remove] pmap: removing brick /mnt/310 on port 24010 [2010-11-21 03:59:57.257197] I [glusterd3_1-mops.c:717:glusterd3_1_commit_op_cbk] glusterd: Received ACC from uuid: 049df669-aca0-4182-9de7-10afc1bc1122 [2010-11-21 03:59:57.257227] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:59:57.258344] I [glusterd3_1-mops.c:717:glusterd3_1_commit_op_cbk] glusterd: Received ACC from uuid: 3f484f92-f435-4bda-9179-be1f1d68ea41 [2010-11-21 03:59:57.258362] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:59:57.262791] I [glusterd3_1-mops.c:717:glusterd3_1_commit_op_cbk] glusterd: Received ACC from uuid: 3588f8a7-244e-4b2e-890f-ddfed8a9bf84 [2010-11-21 03:59:57.262812] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:59:57.262876] I [glusterd3_1-mops.c:1145:glusterd3_1_cluster_unlock] glusterd: Sent unlock req to 3 peers [2010-11-21 03:59:57.263214] I [glusterd3_1-mops.c:456:glusterd3_1_cluster_unlock_cbk] glusterd: Received ACC from uuid: 049df669-aca0-4182-9de7-10afc1bc1122 [2010-11-21 03:59:57.263232] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:59:57.263302] I [glusterd3_1-mops.c:456:glusterd3_1_cluster_unlock_cbk] glusterd: Received ACC from uuid: 3588f8a7-244e-4b2e-890f-ddfed8a9bf84 [2010-11-21 03:59:57.263315] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:59:57.263402] I [glusterd3_1-mops.c:456:glusterd3_1_cluster_unlock_cbk] glusterd: Received ACC from uuid: 3f484f92-f435-4bda-9179-be1f1d68ea41 [2010-11-21 03:59:57.263418] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 03:59:57.263437] I [glusterd-op-sm.c:4736:glusterd_op_txn_complete] glusterd: Cleared local lock [2010-11-21 04:00:11.54797] I [glusterd-handler.c:716:glusterd_handle_cli_get_volume] glusterd: Received get vol req [2010-11-21 04:00:11.55503] I [glusterd-handler.c:716:glusterd_handle_cli_get_volume] glusterd: Received get vol req [2010-11-21 04:00:20.628679] I [glusterd-handler.c:936:glusterd_handle_cli_start_volume] glusterd: Received start vol reqfor volume qa9 [2010-11-21 04:00:20.628732] I [glusterd-utils.c:232:glusterd_lock] glusterd: Cluster lock held by e8f5aa99-84d8-4bdb-be10-f79ce4e2734c [2010-11-21 04:00:20.628746] I [glusterd-handler.c:2835:glusterd_op_txn_begin] glusterd: Acquired local lock [2010-11-21 04:00:20.628845] I [glusterd3_1-mops.c:1091:glusterd3_1_cluster_lock] glusterd: Sent lock req to 3 peers [2010-11-21 04:00:20.629769] I [glusterd3_1-mops.c:395:glusterd3_1_cluster_lock_cbk] glusterd: Received ACC from uuid: 049df669-aca0-4182-9de7-10afc1bc1122 [2010-11-21 04:00:20.629788] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 04:00:20.629812] I [glusterd3_1-mops.c:395:glusterd3_1_cluster_lock_cbk] glusterd: Received ACC from uuid: 3f484f92-f435-4bda-9179-be1f1d68ea41 [2010-11-21 04:00:20.629824] I [glusterd-utils.c:2062:glusterd_friend_find_by_uuid] glusterd: Friend found.. state: Peer in Cluster [2010-11-21 04:00:48.569291] E [rpc-clnt.c:338:saved_frames_unwind] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_notify+0xb9) [0x2af151fd8579] (-->/usr/local/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) [0x2af151fd7d2e] (-->/usr/local/lib/libgfrpc.so.0(saved_frames_destroy+0xe) [0x2af151fd7c9e]))) rpc-clnt: forced unwinding frame type(Mgmt 3.1) op(--(3)) called at 2010-11-21 04:00:20.628829 [2010-11-21 04:00:52.268747] E [socket.c:1657:socket_connect_finish] management: connection to 10.208.186.47:24007 failed (Connection refused)
but if killall glusterd from brick1/brick3/brick4 and start again it worked. #gluster volume info Volume Name: qa9 Type: Distribute Status: Stopped Number of Bricks: 4 Transport-type: tcp Bricks: Brick1: 10.209.59.112:/mnt/310 Brick2: 10.208.47.224:/mnt/310 Brick3: 10.209.163.191:/mnt/310 Brick4: 10.208.186.47:/mnt/310 [root@epsilon ~] gluster volume start qa9 Starting volume qa9 has been successful [root@epsilon ~] gluster volume stop qa9 Stopping volume will make its data inaccessible. Do you want to Continue? (y/n) y Stopping volume qa9 has been successful here--> rebooted brick2 [root@epsilon ~] gluster volume start qa9 [root@epsilon ~] gluster volume start qa9 force Starting volume qa9 has been unsuccessful kill glusterd from all remaining bricks 1,3,4 [root@epsilon ~] killall glusterd and start again - [root@epsilon ~] glusterd [root@epsilon ~] gluster volume start qa9 Starting volume qa9 has been successful [root@epsilon ~] gluster volume info Volume Name: qa9 Type: Distribute Status: Started Number of Bricks: 4 Transport-type: tcp Bricks: Brick1: 10.209.59.112:/mnt/310 Brick2: 10.208.47.224:/mnt/310 Brick3: 10.209.163.191:/mnt/310 Brick4: 10.208.186.47:/mnt/310
*** This bug has been marked as a duplicate of bug 2005 ***