Created attachment 926773 [details] Logs Description of problem: Gluster CLI command "volume rebalance status" is returning below error Which it should not. "failed: Another transaction is in progress for rebalvol. Please try again after sometime" Version-Release number of selected component (if applicable): glusterfs-server-3.6.0.27-1.el6rhs.x86_64 How reproducible: Intermittent. From last 7 to 8 runs it had occurred once. Steps to Reproduce: We have used 3 nodes i.e. NODE1, NODE2 and NODE3 1. Create a DHT volume with 2 bricks using NODE1 and NODE2 2. Start it and mount on client using FUSE protocol 3. Create Data on it. 4. Add a spare brick (from NODE1 or NODE2) 5. Start rebalance 6. Peer probe NODE3 7. Run CLI command to get rebalance status -> it is returning failure here i.e. gluster volume rebalance <VOL Name> status Actual results: $gluster volume rebalance rebalvol status volume rebalance: rebalvol: failed: Another transaction is in progress for rebalvol. Please try again after sometime Expected results: Rebalance status should not fail. Additional info: All GlusterFS CLIs were executed at Node1 GlusterD logs from Node1 ++++++++++++++++++++++++ [2014-08-13 17:33:10.990849] I [glusterd-brick-ops.c:370:__glusterd_handle_add_brick] 0-management: Received add brick req [2014-08-13 17:33:11.004812] E [glusterd-brick-ops.c:1263:glusterd_op_stage_add_brick] 0-management: Unable to get replica count [2014-08-13 17:33:11.786017] I [run.c:190:runner_log] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(glusterd_op_commit_perform+0x5a) [0x7fb6b118bcea] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(+0x512f5) [0x7fb6b11892f5] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(glusterd_hooks_run_hooks+0x5a6) [0x7fb6b12076f6]))) 0-management: Ran script: /var/lib/glusterd/hooks/1/add-brick/pre/S28Quota-enable-root-xattr-heal.sh --volname=rebalvol --version=1 --volume-op=add-brick --gd-workdir=/var/lib/glusterd [2014-08-13 17:33:12.214302] I [glusterd-pmap.c:227:pmap_registry_bind] 0-pmap: adding brick /bricks/rebalvol-sparebrick on port 49153 [2014-08-13 17:33:12.217168] I [rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2014-08-13 17:33:13.066543] E [rpc-transport.c:481:rpc_transport_unref] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(glusterd_nodesvc_disconnect+0x4c) [0x7fb6b11989dc] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(glusterd_rpc_clnt_unref+0x35) [0x7fb6b11988b5] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_unref+0x63) [0x31c540d623]))) 0-rpc_transport: invalid argument: this [2014-08-13 17:33:13.070817] I [MSGID: 106006] [glusterd-handler.c:4280:__glusterd_nodesvc_rpc_notify] 0-management: nfs has disconnected from glusterd. [2014-08-13 17:33:14.123785] I [glusterd-utils.c:6300:glusterd_nfs_pmap_deregister] 0-: De-registered MOUNTV3 successfully [2014-08-13 17:33:14.124206] I [glusterd-utils.c:6305:glusterd_nfs_pmap_deregister] 0-: De-registered MOUNTV1 successfully [2014-08-13 17:33:14.124603] I [glusterd-utils.c:6310:glusterd_nfs_pmap_deregister] 0-: De-registered NFSV3 successfully [2014-08-13 17:33:14.124973] I [glusterd-utils.c:6315:glusterd_nfs_pmap_deregister] 0-: De-registered NLM v4 successfully [2014-08-13 17:33:14.125338] I [glusterd-utils.c:6320:glusterd_nfs_pmap_deregister] 0-: De-registered NLM v1 successfully [2014-08-13 17:33:14.125662] I [glusterd-utils.c:6325:glusterd_nfs_pmap_deregister] 0-: De-registered ACL v3 successfully [2014-08-13 17:33:14.140964] I [rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2014-08-13 17:33:14.141370] I [socket.c:2246:socket_event_handler] 0-transport: disconnecting now [2014-08-13 17:33:18.169921] I [glusterd-utils.c:11658:glusterd_generate_and_set_task_id] 0-management: Generated task-id 24d06cf5-acf7-445c-96a7-3075b0556091 for key rebalance-id [2014-08-13 17:33:23.632053] I [rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2014-08-13 17:33:28.949165] E [glusterd-utils.c:10429:glusterd_volume_rebalance_use_rsp_dict] 0-: failed to get index [2014-08-13 17:33:34.317150] I [glusterd-handler.c:1109:__glusterd_handle_cli_probe] 0-glusterd: Received CLI probe req rhsauto022.lab.eng.blr.redhat.com 24007 [2014-08-13 17:33:34.324759] I [glusterd-handler.c:3199:glusterd_probe_begin] 0-glusterd: Unable to find peerinfo for host: rhsauto022.lab.eng.blr.redhat.com (24007) [2014-08-13 17:33:34.728047] I [rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2014-08-13 17:33:34.731659] I [glusterd-handler.c:3180:glusterd_friend_add] 0-management: connect returned 0 [2014-08-13 17:33:35.047512] I [glusterd-rpc-ops.c:237:__glusterd_probe_cbk] 0-glusterd: Received probe resp from uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7, host: rhsauto022.lab.eng.blr.redhat.com [2014-08-13 17:33:35.366233] I [glusterd-rpc-ops.c:309:__glusterd_probe_cbk] 0-glusterd: Received resp to probe req [2014-08-13 17:33:35.366556] I [glusterd-rpc-ops.c:359:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7, host: rhsauto022.lab.eng.blr.redhat.com, port: 0 [2014-08-13 17:33:35.573717] I [glusterd-handshake.c:1011:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30000 [2014-08-13 17:33:35.754419] I [glusterd-handler.c:2611:__glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7 [2014-08-13 17:33:35.754512] I [glusterd-handler.c:2663:__glusterd_handle_probe_query] 0-glusterd: Responded to 10.70.36.252, op_ret: 0, op_errno: 0, ret: 0 [2014-08-13 17:33:35.756517] I [glusterd-handler.c:2315:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7 [2014-08-13 17:33:35.937837] I [glusterd-handler.c:3367:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to rhsauto022.lab.eng.blr.redhat.com (0), ret: 0 [2014-08-13 17:33:36.140254] I [glusterd-sm.c:496:glusterd_ac_send_friend_update] 0-: Added uuid: a52ccee9-45e9-4463-9ca7-aab5090df70e, host: rhsauto008.lab.eng.blr.redhat.com [2014-08-13 17:33:36.140286] I [glusterd-sm.c:496:glusterd_ac_send_friend_update] 0-: Added uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7, host: rhsauto022.lab.eng.blr.redhat.com [2014-08-13 17:33:36.428622] I [glusterd-handler.c:2477:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7 [2014-08-13 17:33:36.428662] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: 7be04f41-1414-4157-a732-b274e29b212d, hostname:10.70.36.249 [2014-08-13 17:33:36.428673] I [glusterd-handler.c:2531:__glusterd_handle_friend_update] 0-: Received my uuid as Friend [2014-08-13 17:33:36.428903] I [glusterd-rpc-ops.c:556:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7 [2014-08-13 17:33:36.574742] I [glusterd-rpc-ops.c:556:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: a52ccee9-45e9-4463-9ca7-aab5090df70e [2014-08-13 17:33:36.762227] I [glusterd-handler.c:2477:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7 [2014-08-13 17:33:36.762313] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: 7be04f41-1414-4157-a732-b274e29b212d, hostname:10.70.36.249 [2014-08-13 17:33:36.762331] I [glusterd-handler.c:2531:__glusterd_handle_friend_update] 0-: Received my uuid as Friend [2014-08-13 17:33:36.762348] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: a52ccee9-45e9-4463-9ca7-aab5090df70e, hostname:rhsauto008.lab.eng.blr.redhat.com [2014-08-13 17:33:37.710327] I [glusterd-handler.c:2477:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: a52ccee9-45e9-4463-9ca7-aab5090df70e [2014-08-13 17:33:37.710376] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: 7be04f41-1414-4157-a732-b274e29b212d, hostname:10.70.36.249 [2014-08-13 17:33:37.710387] I [glusterd-handler.c:2531:__glusterd_handle_friend_update] 0-: Received my uuid as Friend [2014-08-13 17:33:37.710397] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7, hostname:rhsauto022.lab.eng.blr.redhat.com [2014-08-13 17:33:14.141536] I [MSGID: 106006] [glusterd-handler.c:4280:__glusterd_nodesvc_rpc_notify] 0-management: nfs has disconnected from glusterd. [2014-08-13 17:34:58.966309] I [glusterd-handshake.c:812:__server_event_notify] 0-: received defrag status updated [2014-08-13 17:34:58.971107] W [socket.c:529:__socket_rwv] 0-management: readv on /var/run/gluster/gluster-rebalance-dc09a802-985a-4755-a805-6b4bb5083d70.sock failed (No data available) [2014-08-13 17:34:59.136440] E [rpc-transport.c:481:rpc_transport_unref] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(__glusterd_defrag_notify+0x301) [0x7fb6b11da8d1] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(glusterd_rpc_clnt_unref+0x35) [0x7fb6b11988b5] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_unref+0x63) [0x31c540d623]))) 0-rpc_transport: invalid argument: this [2014-08-13 17:34:59.136475] I [MSGID: 106007] [glusterd-rebalance.c:173:__glusterd_defrag_notify] 0-management: Rebalance process for volume rebalvol has disconnected. [2014-08-13 17:35:36.068702] W [glusterd-locks.c:547:glusterd_mgmt_v3_lock] 0-management: Lock for rebalvol held by 7be04f41-1414-4157-a732-b274e29b212d [2014-08-13 17:35:36.068742] E [glusterd-handler.c:687:glusterd_op_txn_begin] 0-management: Unable to acquire lock for rebalvol [2014-08-13 17:36:06.571213] W [glusterd-locks.c:547:glusterd_mgmt_v3_lock] 0-management: Lock for rebalvol held by 7be04f41-1414-4157-a732-b274e29b212d [2014-08-13 17:36:06.571280] E [glusterd-handler.c:687:glusterd_op_txn_begin] 0-management: Unable to acquire lock for rebalvol Rebalance log from Node1 ++++++++++++++++++++++++ [2014-08-13 17:33:23.972375] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 0-rebalvol-client-0: changing port to 49152 (from 0) [2014-08-13 17:33:23.975925] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 0-rebalvol-client-2: changing port to 49153 (from 0) [2014-08-13 17:33:23.979665] I [client-handshake.c:1415:select_server_supported_programs] 0-rebalvol-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2014-08-13 17:33:23.979902] I [client-handshake.c:1415:select_server_supported_programs] 0-rebalvol-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2014-08-13 17:33:23.980019] I [client-handshake.c:1200:client_setvolume_cbk] 0-rebalvol-client-2: Connected to rebalvol-client-2, attached to remote volume '/bricks/rebalvol-sparebrick'. [2014-08-13 17:33:23.980052] I [client-handshake.c:1212:client_setvolume_cbk] 0-rebalvol-client-2: Server and Client lk-version numbers are not same, reopening the fds [2014-08-13 17:33:23.980433] I [client-handshake.c:188:client_set_lk_version_cbk] 0-rebalvol-client-2: Server lk version = 1 [2014-08-13 17:33:23.980491] I [client-handshake.c:1200:client_setvolume_cbk] 0-rebalvol-client-0: Connected to rebalvol-client-0, attached to remote volume '/bricks/rebalvol_brick0'. [2014-08-13 17:33:23.980505] I [client-handshake.c:1212:client_setvolume_cbk] 0-rebalvol-client-0: Server and Client lk-version numbers are not same, reopening the fds [2014-08-13 17:33:23.980667] I [client-handshake.c:188:client_set_lk_version_cbk] 0-rebalvol-client-0: Server lk version = 1 [2014-08-13 17:33:28.949399] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 0-rebalvol-client-1: changing port to 49152 (from 0) [2014-08-13 17:33:28.955578] I [client-handshake.c:1415:select_server_supported_programs] 0-rebalvol-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330) [2014-08-13 17:33:28.956078] I [client-handshake.c:1200:client_setvolume_cbk] 0-rebalvol-client-1: Connected to rebalvol-client-1, attached to remote volume '/bricks/rebalvol_brick1'. [2014-08-13 17:33:28.956112] I [client-handshake.c:1212:client_setvolume_cbk] 0-rebalvol-client-1: Server and Client lk-version numbers are not same, reopening the fds [2014-08-13 17:33:28.956492] I [client-handshake.c:188:client_set_lk_version_cbk] 0-rebalvol-client-1: Server lk version = 1 [2014-08-13 17:33:28.968263] I [dht-common.c:2990:dht_setxattr] 0-rebalvol-dht: fixing the layout of / [2014-08-13 17:33:29.035714] I [dht-rebalance.c:1356:gf_defrag_migrate_data] 0-rebalvol-dht: migrate data called on / [2014-08-13 17:33:29.154975] I [dht-rebalance.c:1596:gf_defrag_migrate_data] 0-rebalvol-dht: Migration operation on dir / took 0.12 secs [2014-08-13 17:33:29.161667] I [dht-common.c:2990:dht_setxattr] 0-rebalvol-dht: fixing the layout of /small [2014-08-13 17:33:29.163608] I [dht-rebalance.c:1356:gf_defrag_migrate_data] 0-rebalvol-dht: migrate data called on /small [2014-08-13 17:33:29.218968] I [dht-rebalance.c:862:dht_migrate_file] 0-rebalvol-dht: /small/1.small: attempting to move from rebalvol-client-0 to rebalvol-client-2 [2014-08-13 17:33:29.224349] W [MSGID: 109023] [dht-rebalance.c:534:__dht_check_free_space] 0-rebalvol-dht: data movement attempted from node (rebalvol-client-0) with higher disk space to a node (rebalvol-client-2) with lesser disk space (/small/1.small) [2014-08-13 17:33:29.232294] I [dht-rebalance.c:862:dht_migrate_file] 0-rebalvol-dht: /small/6.small: attempting to move from rebalvol-client-0 to rebalvol-client-2 [2014-08-13 17:33:29.236714] W [MSGID: 109023] [dht-rebalance.c:534:__dht_check_free_space] 0-rebalvol-dht: data movement attempted from node (rebalvol-client-0) with higher disk space to a node (rebalvol-client-2) with lesser disk space (/small/6.small) [2014-08-13 17:33:29.240828] I [dht-rebalance.c:862:dht_migrate_file] 0-rebalvol-dht: /small/7.small: attempting to move from rebalvol-client-0 to rebalvol-client-2 [2014-08-13 17:33:29.245096] W [MSGID: 109023] [dht-rebalance.c:534:__dht_check_free_space] 0-rebalvol-dht: data movement attempted from node (rebalvol-client-0) with higher disk space to a node (rebalvol-client-2) with lesser disk space (/small/7.small) [2014-08-13 17:33:29.253847] I [dht-rebalance.c:862:dht_migrate_file] 0-rebalvol-dht: /small/17.small: attempting to move from rebalvol-client-0 to rebalvol-client-2 [2014-08-13 17:33:29.258661] W [MSGID: 109023] [dht-rebalance.c:534:__dht_check_free_space] 0-rebalvol-dht: data movement attempted from node (rebalvol-client-0) with higher disk space to a node (rebalvol-client-2) with lesser disk space (/small/17.small) [2014-08-13 17:33:29.263650] I [dht-rebalance.c:862:dht_migrate_file] 0-rebalvol-dht: /small/23.small: attempting to move from rebalvol-client-0 to rebalvol-client-2 [2014-08-13 17:33:29.267536] W [MSGID: 109023] [dht-rebalance.c:534:__dht_check_free_space] 0-rebalvol-dht: data movement attempted from node (rebalvol-client-0) with higher disk space to a node (rebalvol-client-2) with lesser disk space (/small/23.small) [2014-08-13 17:33:29.272298] I [dht-rebalance.c:862:dht_migrate_file] 0-rebalvol-dht: /small/27.small: attempting to move from rebalvol-client-0 to rebalvol-client-2 [2014-08-13 17:33:29.275211] W [MSGID: 109023] [dht-rebalance.c:534:__dht_check_free_space] 0-rebalvol-dht: data movement attempted from node (rebalvol-client-0) with higher disk space to a node (rebalvol-client-2) with lesser disk space (/small/27.small) [2014-08-13 17:33:29.280690] I [dht-rebalance.c:862:dht_migrate_file] 0-rebalvol-dht: /small/35.small: attempting to move from rebalvol-client-0 to rebalvol-client-2 [2014-08-13 17:33:29.283448] W [MSGID: 109023] [dht-rebalance.c:534:__dht_check_free_space] 0-rebalvol-dht: data movement attempted from node (rebalvol-client-0) with higher disk space to a node (rebalvol-client-2) with lesser disk space (/small/35.small) [2014-08-13 17:33:29.288682] I [dht-rebalance.c:862:dht_migrate_file] 0-rebalvol-dht: /small/40.small: attempting to move from rebalvol-client-0 to rebalvol-client-2 [2014-08-13 17:33:29.290961] W [MSGID: 109023] [dht-rebalance.c:534:__dht_check_free_space] 0-rebalvol-dht: data movement attempted from node (rebalvol-client-0) with higher disk space to a node (rebalvol-client-2) with lesser disk space (/small/40.small) [2014-08-13 17:33:29.294069] I [dht-rebalance.c:862:dht_migrate_file] 0-rebalvol-dht: /small/43.small: attempting to move from rebalvol-client-0 to rebalvol-client-2 [2014-08-13 17:33:29.296778] W [MSGID: 109023] [dht-rebalance.c:534:__dht_check_free_space] 0-rebalvol-dht: data movement attempted from node (rebalvol-client-0) with higher disk space to a node (rebalvol-client-2) with lesser disk space (/small/43.small) [2014-08-13 17:33:29.301337] I [dht-rebalance.c:862:dht_migrate_file] 0-rebalvol-dht: /small/47.small: attempting to move from rebalvol-client-0 to rebalvol-client-2 [2014-08-13 17:33:29.303833] W [MSGID: 109023] [dht-rebalance.c:534:__dht_check_free_space] 0-rebalvol-dht: data movement attempted from node (rebalvol-client-0) with higher disk space to a node (rebalvol GlusterD logs from Node3 ++++++++++++++++++++++++ [2014-08-13 10:51:57.522171] I [glusterd-handshake.c:1011:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30000 [2014-08-13 10:51:57.522276] E [store.c:432:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info, returned error: (No such file or directory) [2014-08-13 10:51:57.522322] I [glusterd.c:176:glusterd_uuid_generate_save] 0-management: generated UUID: 5a36b5ef-c0c3-4043-b964-825a9d4641d7 [2014-08-13 10:51:57.527903] I [glusterd-handler.c:2611:__glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: 7be04f41-1414-4157-a732-b274e29b212d [2014-08-13 10:51:57.530351] I [glusterd-handler.c:2639:__glusterd_handle_probe_query] 0-glusterd: Unable to find peerinfo for host: 10.70.36.249 (24007) [2014-08-13 10:51:57.532428] I [rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2014-08-13 10:51:57.535231] I [glusterd-handler.c:3180:glusterd_friend_add] 0-management: connect returned 0 [2014-08-13 10:51:57.535354] I [glusterd-handler.c:2663:__glusterd_handle_probe_query] 0-glusterd: Responded to 10.70.36.249, op_ret: 0, op_errno: 0, ret: 0 [2014-08-13 10:51:57.840143] I [glusterd-handler.c:2315:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 7be04f41-1414-4157-a732-b274e29b212d [2014-08-13 10:51:57.849774] E [glusterd-utils.c:6844:glusterd_brick_start] 0-management: Could not find peer on which brick rhsauto008.lab.eng.blr.redhat.com:/bricks/rebalvol_brick1 resides [2014-08-13 10:51:57.849819] E [glusterd-op-sm.c:1764:glusterd_start_bricks] 0-management: Failed to start rhsauto008.lab.eng.blr.redhat.com:/bricks/rebalvol_brick1 for rebalvol [2014-08-13 10:51:57.899227] I [rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2014-08-13 10:51:57.903021] I [glusterd-handler.c:3367:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.36.249 (0), ret: 0 [2014-08-13 10:51:57.907609] I [socket.c:2246:socket_event_handler] 0-transport: disconnecting now [2014-08-13 10:51:57.907662] I [MSGID: 106006] [glusterd-handler.c:4280:__glusterd_nodesvc_rpc_notify] 0-management: nfs has disconnected from glusterd. [2014-08-13 10:51:58.544595] I [glusterd-rpc-ops.c:237:__glusterd_probe_cbk] 0-glusterd: Received probe resp from uuid: 7be04f41-1414-4157-a732-b274e29b212d, host: 10.70.36.249 [2014-08-13 10:51:58.546883] I [glusterd-rpc-ops.c:309:__glusterd_probe_cbk] 0-glusterd: Received resp to probe req [2014-08-13 10:51:58.726575] I [glusterd-rpc-ops.c:359:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 7be04f41-1414-4157-a732-b274e29b212d, host: 10.70.36.249, port: 0 [2014-08-13 10:51:58.726638] I [glusterd-sm.c:496:glusterd_ac_send_friend_update] 0-: Added uuid: 7be04f41-1414-4157-a732-b274e29b212d, host: 10.70.36.249 [2014-08-13 10:51:58.929257] I [glusterd-handler.c:2477:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 7be04f41-1414-4157-a732-b274e29b212d [2014-08-13 10:51:58.929320] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: a52ccee9-45e9-4463-9ca7-aab5090df70e, hostname:rhsauto008.lab.eng.blr.redhat.com [2014-08-13 10:51:58.935047] I [rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600 [2014-08-13 10:51:58.938214] I [glusterd-handler.c:3180:glusterd_friend_add] 0-management: connect returned 0 [2014-08-13 10:51:58.938241] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7, hostname:rhsauto022.lab.eng.blr.redhat.com [2014-08-13 10:51:58.938254] I [glusterd-handler.c:2531:__glusterd_handle_friend_update] 0-: Received my uuid as Friend [2014-08-13 10:51:59.217750] I [glusterd-rpc-ops.c:556:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 7be04f41-1414-4157-a732-b274e29b212d [2014-08-13 10:51:59.219346] E [glusterd-op-sm.c:207:glusterd_get_txn_opinfo] 0-: Unable to get transaction opinfo for transaction ID : 36dd9790-9ddb-4b75-859b-581a2ff7be44 [2014-08-13 10:51:59.219363] E [glusterd-op-sm.c:6508:glusterd_op_sm] 0-management: Unable to get transaction's opinfo [2014-08-13 10:51:59.365567] I [glusterd-handshake.c:1011:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30000 [2014-08-13 10:51:59.546936] I [glusterd-handler.c:2315:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: a52ccee9-45e9-4463-9ca7-aab5090df70e [2014-08-13 10:51:59.548459] I [glusterd-handler.c:3367:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to rhsauto008.lab.eng.blr.redhat.com (0), ret: 0 [2014-08-13 10:51:59.550319] I [glusterd-sm.c:496:glusterd_ac_send_friend_update] 0-: Added uuid: 7be04f41-1414-4157-a732-b274e29b212d, host: 10.70.36.249 [2014-08-13 10:51:59.550344] I [glusterd-sm.c:496:glusterd_ac_send_friend_update] 0-: Added uuid: a52ccee9-45e9-4463-9ca7-aab5090df70e, host: rhsauto008.lab.eng.blr.redhat.com [2014-08-13 10:51:59.553374] I [glusterd-rpc-ops.c:556:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 7be04f41-1414-4157-a732-b274e29b212d [2014-08-13 10:52:00.386229] I [glusterd-rpc-ops.c:359:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: a52ccee9-45e9-4463-9ca7-aab5090df70e, host: rhsauto008.lab.eng.blr.redhat.com, port: 0 [2014-08-13 10:52:00.499070] I [glusterd-handler.c:2477:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: a52ccee9-45e9-4463-9ca7-aab5090df70e [2014-08-13 10:52:00.499136] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: 7be04f41-1414-4157-a732-b274e29b212d, hostname:10.70.36.249 [2014-08-13 10:52:00.499148] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7, hostname:rhsauto022.lab.eng.blr.redhat.com [2014-08-13 10:52:00.499156] I [glusterd-handler.c:2531:__glusterd_handle_friend_update] 0-: Received my uuid as Friend [2014-08-13 10:52:00.689968] I [glusterd-rpc-ops.c:556:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: a52ccee9-45e9-4463-9ca7-aab5090df70e [2014-08-13 12:55:50.573785] W [socket.c:529:__socket_rwv] 0-management: readv on 10.70.36.249:24007 failed (No data available) [2014-08-13 12:55:50.574120] I [MSGID: 106004] [glusterd-handler.c:4388:__glusterd_peer_rpc_notify] 0-management: Peer 7be04f41-1414-4157-a732-b274e29b212d, in Peer in Cluster state, has disconnected from glusterd. [2014-08-13 12:55:50.574180] W [glusterd-locks.c:632:glusterd_mgmt_v3_unlock] 0-management: Lock for vol rebalvol not held [2014-08-13 12:55:51.234271] W [glusterfsd.c:1182:cleanup_and_exit] (--> 0-: received signum (15), shutting down [2014-08-13 12:56:13.725432] I [MSGID: 100030] [glusterfsd.c:1998:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.6.0.27 (args: /usr/sbin/glusterd --pid-file=/var/run/glusterd.pid) [2014-08-13 12:56:13.732630] I [glusterd.c:1215:init] 0-management: Using /var/lib/glusterd as working directory [2014-08-13 12:56:13.736328] W [rdma.c:4194:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed (No such device) [2014-08-13 12:56:13.736355] E [rdma.c:4482:init] 0-rdma.management: Failed to initialize IB Device [2014-08-13 12:56:13.736367] E [rpc-transport.c:333:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed [2014-08-13 12:56:13.736425] W [rpcsvc.c:1524:rpcsvc_transport_create] 0-rpc-service: cannot create listener, initing the transport failed [2014-08-13 12:56:17.437364] E [store.c:432:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info, returned error: (No such file or directory) [2014-08-13 12:56:17.437445] E [store.c:432:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info, returned error: (No such file or directory) [2014-08-13 12:56:17.437475] I [glusterd-store.c:1960:glusterd_restore_op_version] 0-management: Detected new install. Setting op-version to maximum : 30000 [2014-08-13 12:56:17.437708] I [glusterd-store.c:3287:glusterd_store_retrieve_missed_snaps_list] 0-management: No missed snaps list. [2014-08-13 12:56:17.437845] E [store.c:432:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/options, returned error: (No such file or directory)
Lala, Is sosreports or gluterd logs on the nodes available ?
Sas, I have already attached glusterd and rebalance logs with the bug. Please check "attachment 926773 [details]".
Lala, Is it possible to add following things in BVT logs when you reproduce this bug again. 1. peer status 2. volume info 3. /var/lib/glusterd/ files can you add sos report if available. These logs are not enough to find the root cause of the problem. Regards Gaurav
Here is the Beaker Job link: https://beaker.engineering.redhat.com/jobs/720498 and I can see the volume info in the logs: Volume Name: rebalvol Type: Distribute Volume ID: dc09a802-985a-4755-a805-6b4bb5083d70 Status: Started Snap Volume: no Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: rhsauto019.lab.eng.blr.redhat.com:/bricks/rebalvol_brick0 Brick2: rhsauto008.lab.eng.blr.redhat.com:/bricks/rebalvol_brick1 Options Reconfigured: performance.readdir-ahead: on auto-delete: disable snap-max-soft-limit: 90 snap-max-hard-limit: 256 However the automation needs to be updated for /var/lib/glusterd/ files, peer status and sos reports. Will work on that
Lala, you are creating volume with 2 node NODE1 and NODE 2 which have host name "rhsauto019.lab.eng.blr.redhat.com" and " rhsauto008.lab.eng.blr.redhat.com " then why other uuid coming in the logs after probing NODE 3. glusterd log of node 1 which show other uuid are below: [2014-08-13 17:33:35.047512] I [glusterd-rpc-ops.c:237:__glusterd_probe_cbk] 0-glusterd: Received probe resp from uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7, host: rhsauto022.lab.eng.blr.redhat.com [2014-08-13 17:33:36.140254] I [glusterd-sm.c:496:glusterd_ac_send_friend_update] 0-: Added uuid: a52ccee9-45e9-4463-9ca7-aab5090df70e, host: rhsauto008.lab.eng.blr.redhat.com [2014-08-13 17:33:36.140286] I [glusterd-sm.c:496:glusterd_ac_send_friend_update] 0-: Added uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7, host: rhsauto022.lab.eng.blr.redhat.com [2014-08-13 17:33:36.428662] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: 7be04f41-1414-4157-a732-b274e29b212d, hostname:10.70.36.249 this means that other node may have issue some other task. that's why its creating lock for that transaction and showing below error message while issuing rebalance status command. "volume rebalance: rebalvol: failed: Another transaction is in progress for rebalvol. Please try again after sometime" could you give us "peer status" output so that we can find out how many nodes are there in cluster.
Gaurav, I have changed the automation to print the "gluster peer status", for this testcase as you have asked. So hopefully next time the issue comes, we will have answers. https://code.engineering.redhat.com/gerrit/#/c/31785/
Lala, definitely we will have answers when next time issue comes.
downstream patch link : https://code.engineering.redhat.com/gerrit/#/c/36252/
Haven't seen this issue on glusterfs-server-3.6.0.33-1.el6rhs. Hence marking this as verified
Atin, Kindly review the edited doc text and sign-off.
Doc text looks okay.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0038.html