1130158 – BVT: Gluster rebalance status returns failure

Bug 1130158 - BVT: Gluster rebalance status returns failure

Summary: BVT: Gluster rebalance status returns failure

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.0.3
Assignee:	Atin Mukherjee
QA Contact:	Lalatendu Mohanty
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1111566 1154635 1162694 1179136
TreeView+	depends on / blocked

Reported:	2014-08-14 12:59 UTC by Lalatendu Mohanty
Modified:	2015-05-13 17:42 UTC (History)
CC List:	14 users (show)
Fixed In Version:	glusterfs-3.6.0.32-1.el6rhs
Doc Type:	Bug Fix
Doc Text:	Previously, executing rebalance status command displayed "Another transaction is in progress" message after rebalance process is started which indicates that the cluster-wide lock is not released. Hence, further CLI commands were not allowed. With this fix, all error cases in the glusterd op state machine are handled properly, cluster wide lock is released, and further CLI commands are allowed.
Clone Of:
Clones:	1154635 (view as bug list)
Environment:
Last Closed:	2015-01-15 13:39:10 UTC
Embargoed:
Dependent Products:
Flags:	amukherj: needinfo-

Attachments	(Terms of Use)
Logs (165.45 KB, application/x-gzip) 2014-08-14 12:59 UTC, Lalatendu Mohanty	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2015:0038	0	normal	SHIPPED_LIVE	Red Hat Storage 3.0 enhancement and bug fix update #3	2015-01-15 18:35:28 UTC

Description Lalatendu Mohanty 2014-08-14 12:59:58 UTC

Created attachment 926773 [details]
Logs

Description of problem:

Gluster CLI command "volume rebalance status" is returning below error Which it should not.

"failed: Another transaction is in progress for rebalvol. Please try again after sometime"


Version-Release number of selected component (if applicable):

glusterfs-server-3.6.0.27-1.el6rhs.x86_64

How reproducible:

Intermittent. 

From last 7 to 8 runs it had occurred once.

Steps to Reproduce:

We have used 3 nodes i.e. NODE1, NODE2 and NODE3

1. Create a DHT volume with 2 bricks using NODE1 and NODE2
2. Start it and mount on client using FUSE protocol
3. Create Data on it.
4. Add a spare brick (from NODE1 or NODE2)
5. Start rebalance
6. Peer probe NODE3
7. Run CLI command to get rebalance status -> it is returning failure here
   i.e. gluster volume rebalance <VOL Name> status

Actual results:

$gluster volume rebalance rebalvol status

volume rebalance: rebalvol: failed: Another transaction is in progress for rebalvol. Please try again after sometime

Expected results:

Rebalance status should not fail.

Additional info:


All GlusterFS CLIs were executed at Node1

GlusterD logs from Node1
++++++++++++++++++++++++


[2014-08-13 17:33:10.990849] I [glusterd-brick-ops.c:370:__glusterd_handle_add_brick] 0-management: Received add brick req
[2014-08-13 17:33:11.004812] E [glusterd-brick-ops.c:1263:glusterd_op_stage_add_brick] 0-management: Unable to get replica count
[2014-08-13 17:33:11.786017] I [run.c:190:runner_log] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(glusterd_op_commit_perform+0x5a) [0x7fb6b118bcea] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(+0x512f5) [0x7fb6b11892f5] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(glusterd_hooks_run_hooks+0x5a6) [0x7fb6b12076f6]))) 0-management: Ran script: /var/lib/glusterd/hooks/1/add-brick/pre/S28Quota-enable-root-xattr-heal.sh --volname=rebalvol --version=1 --volume-op=add-brick --gd-workdir=/var/lib/glusterd
[2014-08-13 17:33:12.214302] I [glusterd-pmap.c:227:pmap_registry_bind] 0-pmap: adding brick /bricks/rebalvol-sparebrick on port 49153
[2014-08-13 17:33:12.217168] I [rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2014-08-13 17:33:13.066543] E [rpc-transport.c:481:rpc_transport_unref] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(glusterd_nodesvc_disconnect+0x4c) [0x7fb6b11989dc] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(glusterd_rpc_clnt_unref+0x35) [0x7fb6b11988b5] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_unref+0x63) [0x31c540d623]))) 0-rpc_transport: invalid argument: this
[2014-08-13 17:33:13.070817] I [MSGID: 106006] [glusterd-handler.c:4280:__glusterd_nodesvc_rpc_notify] 0-management: nfs has disconnected from glusterd.
[2014-08-13 17:33:14.123785] I [glusterd-utils.c:6300:glusterd_nfs_pmap_deregister] 0-: De-registered MOUNTV3 successfully
[2014-08-13 17:33:14.124206] I [glusterd-utils.c:6305:glusterd_nfs_pmap_deregister] 0-: De-registered MOUNTV1 successfully
[2014-08-13 17:33:14.124603] I [glusterd-utils.c:6310:glusterd_nfs_pmap_deregister] 0-: De-registered NFSV3 successfully
[2014-08-13 17:33:14.124973] I [glusterd-utils.c:6315:glusterd_nfs_pmap_deregister] 0-: De-registered NLM v4 successfully
[2014-08-13 17:33:14.125338] I [glusterd-utils.c:6320:glusterd_nfs_pmap_deregister] 0-: De-registered NLM v1 successfully
[2014-08-13 17:33:14.125662] I [glusterd-utils.c:6325:glusterd_nfs_pmap_deregister] 0-: De-registered ACL v3 successfully
[2014-08-13 17:33:14.140964] I [rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2014-08-13 17:33:14.141370] I [socket.c:2246:socket_event_handler] 0-transport: disconnecting now
[2014-08-13 17:33:18.169921] I [glusterd-utils.c:11658:glusterd_generate_and_set_task_id] 0-management: Generated task-id 24d06cf5-acf7-445c-96a7-3075b0556091 for key rebalance-id
[2014-08-13 17:33:23.632053] I [rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2014-08-13 17:33:28.949165] E [glusterd-utils.c:10429:glusterd_volume_rebalance_use_rsp_dict] 0-: failed to get index
[2014-08-13 17:33:34.317150] I [glusterd-handler.c:1109:__glusterd_handle_cli_probe] 0-glusterd: Received CLI probe req rhsauto022.lab.eng.blr.redhat.com 24007
[2014-08-13 17:33:34.324759] I [glusterd-handler.c:3199:glusterd_probe_begin] 0-glusterd: Unable to find peerinfo for host: rhsauto022.lab.eng.blr.redhat.com (24007)
[2014-08-13 17:33:34.728047] I [rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2014-08-13 17:33:34.731659] I [glusterd-handler.c:3180:glusterd_friend_add] 0-management: connect returned 0
[2014-08-13 17:33:35.047512] I [glusterd-rpc-ops.c:237:__glusterd_probe_cbk] 0-glusterd: Received probe resp from uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7, host: rhsauto022.lab.eng.blr.redhat.com
[2014-08-13 17:33:35.366233] I [glusterd-rpc-ops.c:309:__glusterd_probe_cbk] 0-glusterd: Received resp to probe req
[2014-08-13 17:33:35.366556] I [glusterd-rpc-ops.c:359:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7, host: rhsauto022.lab.eng.blr.redhat.com, port: 0
[2014-08-13 17:33:35.573717] I [glusterd-handshake.c:1011:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30000
[2014-08-13 17:33:35.754419] I [glusterd-handler.c:2611:__glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7
[2014-08-13 17:33:35.754512] I [glusterd-handler.c:2663:__glusterd_handle_probe_query] 0-glusterd: Responded to 10.70.36.252, op_ret: 0, op_errno: 0, ret: 0
[2014-08-13 17:33:35.756517] I [glusterd-handler.c:2315:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7
[2014-08-13 17:33:35.937837] I [glusterd-handler.c:3367:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to rhsauto022.lab.eng.blr.redhat.com (0), ret: 0
[2014-08-13 17:33:36.140254] I [glusterd-sm.c:496:glusterd_ac_send_friend_update] 0-: Added uuid: a52ccee9-45e9-4463-9ca7-aab5090df70e, host: rhsauto008.lab.eng.blr.redhat.com
[2014-08-13 17:33:36.140286] I [glusterd-sm.c:496:glusterd_ac_send_friend_update] 0-: Added uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7, host: rhsauto022.lab.eng.blr.redhat.com
[2014-08-13 17:33:36.428622] I [glusterd-handler.c:2477:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7
[2014-08-13 17:33:36.428662] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: 7be04f41-1414-4157-a732-b274e29b212d, hostname:10.70.36.249
[2014-08-13 17:33:36.428673] I [glusterd-handler.c:2531:__glusterd_handle_friend_update] 0-: Received my uuid as Friend
[2014-08-13 17:33:36.428903] I [glusterd-rpc-ops.c:556:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7
[2014-08-13 17:33:36.574742] I [glusterd-rpc-ops.c:556:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: a52ccee9-45e9-4463-9ca7-aab5090df70e
[2014-08-13 17:33:36.762227] I [glusterd-handler.c:2477:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7
[2014-08-13 17:33:36.762313] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: 7be04f41-1414-4157-a732-b274e29b212d, hostname:10.70.36.249
[2014-08-13 17:33:36.762331] I [glusterd-handler.c:2531:__glusterd_handle_friend_update] 0-: Received my uuid as Friend
[2014-08-13 17:33:36.762348] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: a52ccee9-45e9-4463-9ca7-aab5090df70e, hostname:rhsauto008.lab.eng.blr.redhat.com
[2014-08-13 17:33:37.710327] I [glusterd-handler.c:2477:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: a52ccee9-45e9-4463-9ca7-aab5090df70e
[2014-08-13 17:33:37.710376] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: 7be04f41-1414-4157-a732-b274e29b212d, hostname:10.70.36.249
[2014-08-13 17:33:37.710387] I [glusterd-handler.c:2531:__glusterd_handle_friend_update] 0-: Received my uuid as Friend
[2014-08-13 17:33:37.710397] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7, hostname:rhsauto022.lab.eng.blr.redhat.com
[2014-08-13 17:33:14.141536] I [MSGID: 106006] [glusterd-handler.c:4280:__glusterd_nodesvc_rpc_notify] 0-management: nfs has disconnected from glusterd.
[2014-08-13 17:34:58.966309] I [glusterd-handshake.c:812:__server_event_notify] 0-: received defrag status updated
[2014-08-13 17:34:58.971107] W [socket.c:529:__socket_rwv] 0-management: readv on /var/run/gluster/gluster-rebalance-dc09a802-985a-4755-a805-6b4bb5083d70.sock failed (No data available)
[2014-08-13 17:34:59.136440] E [rpc-transport.c:481:rpc_transport_unref] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(__glusterd_defrag_notify+0x301) [0x7fb6b11da8d1] (-->/usr/lib64/glusterfs/3.6.0.27/xlator/mgmt/glusterd.so(glusterd_rpc_clnt_unref+0x35) [0x7fb6b11988b5] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_unref+0x63) [0x31c540d623]))) 0-rpc_transport: invalid argument: this
[2014-08-13 17:34:59.136475] I [MSGID: 106007] [glusterd-rebalance.c:173:__glusterd_defrag_notify] 0-management: Rebalance process for volume rebalvol has disconnected.
[2014-08-13 17:35:36.068702] W [glusterd-locks.c:547:glusterd_mgmt_v3_lock] 0-management: Lock for rebalvol held by 7be04f41-1414-4157-a732-b274e29b212d
[2014-08-13 17:35:36.068742] E [glusterd-handler.c:687:glusterd_op_txn_begin] 0-management: Unable to acquire lock for rebalvol
[2014-08-13 17:36:06.571213] W [glusterd-locks.c:547:glusterd_mgmt_v3_lock] 0-management: Lock for rebalvol held by 7be04f41-1414-4157-a732-b274e29b212d
[2014-08-13 17:36:06.571280] E [glusterd-handler.c:687:glusterd_op_txn_begin] 0-management: Unable to acquire lock for rebalvol


Rebalance log from Node1
++++++++++++++++++++++++

[2014-08-13 17:33:23.972375] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 0-rebalvol-client-0: changing port to 49152 (from 0)
[2014-08-13 17:33:23.975925] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 0-rebalvol-client-2: changing port to 49153 (from 0)
[2014-08-13 17:33:23.979665] I [client-handshake.c:1415:select_server_supported_programs] 0-rebalvol-client-2: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2014-08-13 17:33:23.979902] I [client-handshake.c:1415:select_server_supported_programs] 0-rebalvol-client-0: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2014-08-13 17:33:23.980019] I [client-handshake.c:1200:client_setvolume_cbk] 0-rebalvol-client-2: Connected to rebalvol-client-2, attached to remote volume '/bricks/rebalvol-sparebrick'.
[2014-08-13 17:33:23.980052] I [client-handshake.c:1212:client_setvolume_cbk] 0-rebalvol-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2014-08-13 17:33:23.980433] I [client-handshake.c:188:client_set_lk_version_cbk] 0-rebalvol-client-2: Server lk version = 1
[2014-08-13 17:33:23.980491] I [client-handshake.c:1200:client_setvolume_cbk] 0-rebalvol-client-0: Connected to rebalvol-client-0, attached to remote volume '/bricks/rebalvol_brick0'.
[2014-08-13 17:33:23.980505] I [client-handshake.c:1212:client_setvolume_cbk] 0-rebalvol-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2014-08-13 17:33:23.980667] I [client-handshake.c:188:client_set_lk_version_cbk] 0-rebalvol-client-0: Server lk version = 1
[2014-08-13 17:33:28.949399] I [rpc-clnt.c:1759:rpc_clnt_reconfig] 0-rebalvol-client-1: changing port to 49152 (from 0)
[2014-08-13 17:33:28.955578] I [client-handshake.c:1415:select_server_supported_programs] 0-rebalvol-client-1: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2014-08-13 17:33:28.956078] I [client-handshake.c:1200:client_setvolume_cbk] 0-rebalvol-client-1: Connected to rebalvol-client-1, attached to remote volume '/bricks/rebalvol_brick1'.
[2014-08-13 17:33:28.956112] I [client-handshake.c:1212:client_setvolume_cbk] 0-rebalvol-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2014-08-13 17:33:28.956492] I [client-handshake.c:188:client_set_lk_version_cbk] 0-rebalvol-client-1: Server lk version = 1
[2014-08-13 17:33:28.968263] I [dht-common.c:2990:dht_setxattr] 0-rebalvol-dht: fixing the layout of /
[2014-08-13 17:33:29.035714] I [dht-rebalance.c:1356:gf_defrag_migrate_data] 0-rebalvol-dht: migrate data called on /
[2014-08-13 17:33:29.154975] I [dht-rebalance.c:1596:gf_defrag_migrate_data] 0-rebalvol-dht: Migration operation on dir / took 0.12 secs
[2014-08-13 17:33:29.161667] I [dht-common.c:2990:dht_setxattr] 0-rebalvol-dht: fixing the layout of /small
[2014-08-13 17:33:29.163608] I [dht-rebalance.c:1356:gf_defrag_migrate_data] 0-rebalvol-dht: migrate data called on /small
[2014-08-13 17:33:29.218968] I [dht-rebalance.c:862:dht_migrate_file] 0-rebalvol-dht: /small/1.small: attempting to move from rebalvol-client-0 to rebalvol-client-2
[2014-08-13 17:33:29.224349] W [MSGID: 109023] [dht-rebalance.c:534:__dht_check_free_space] 0-rebalvol-dht: data movement attempted from node (rebalvol-client-0) with higher disk space to a node (rebalvol-client-2) with lesser disk space (/small/1.small)
[2014-08-13 17:33:29.232294] I [dht-rebalance.c:862:dht_migrate_file] 0-rebalvol-dht: /small/6.small: attempting to move from rebalvol-client-0 to rebalvol-client-2
[2014-08-13 17:33:29.236714] W [MSGID: 109023] [dht-rebalance.c:534:__dht_check_free_space] 0-rebalvol-dht: data movement attempted from node (rebalvol-client-0) with higher disk space to a node (rebalvol-client-2) with lesser disk space (/small/6.small)
[2014-08-13 17:33:29.240828] I [dht-rebalance.c:862:dht_migrate_file] 0-rebalvol-dht: /small/7.small: attempting to move from rebalvol-client-0 to rebalvol-client-2
[2014-08-13 17:33:29.245096] W [MSGID: 109023] [dht-rebalance.c:534:__dht_check_free_space] 0-rebalvol-dht: data movement attempted from node (rebalvol-client-0) with higher disk space to a node (rebalvol-client-2) with lesser disk space (/small/7.small)
[2014-08-13 17:33:29.253847] I [dht-rebalance.c:862:dht_migrate_file] 0-rebalvol-dht: /small/17.small: attempting to move from rebalvol-client-0 to rebalvol-client-2
[2014-08-13 17:33:29.258661] W [MSGID: 109023] [dht-rebalance.c:534:__dht_check_free_space] 0-rebalvol-dht: data movement attempted from node (rebalvol-client-0) with higher disk space to a node (rebalvol-client-2) with lesser disk space (/small/17.small)
[2014-08-13 17:33:29.263650] I [dht-rebalance.c:862:dht_migrate_file] 0-rebalvol-dht: /small/23.small: attempting to move from rebalvol-client-0 to rebalvol-client-2
[2014-08-13 17:33:29.267536] W [MSGID: 109023] [dht-rebalance.c:534:__dht_check_free_space] 0-rebalvol-dht: data movement attempted from node (rebalvol-client-0) with higher disk space to a node (rebalvol-client-2) with lesser disk space (/small/23.small)
[2014-08-13 17:33:29.272298] I [dht-rebalance.c:862:dht_migrate_file] 0-rebalvol-dht: /small/27.small: attempting to move from rebalvol-client-0 to rebalvol-client-2
[2014-08-13 17:33:29.275211] W [MSGID: 109023] [dht-rebalance.c:534:__dht_check_free_space] 0-rebalvol-dht: data movement attempted from node (rebalvol-client-0) with higher disk space to a node (rebalvol-client-2) with lesser disk space (/small/27.small)
[2014-08-13 17:33:29.280690] I [dht-rebalance.c:862:dht_migrate_file] 0-rebalvol-dht: /small/35.small: attempting to move from rebalvol-client-0 to rebalvol-client-2
[2014-08-13 17:33:29.283448] W [MSGID: 109023] [dht-rebalance.c:534:__dht_check_free_space] 0-rebalvol-dht: data movement attempted from node (rebalvol-client-0) with higher disk space to a node (rebalvol-client-2) with lesser disk space (/small/35.small)
[2014-08-13 17:33:29.288682] I [dht-rebalance.c:862:dht_migrate_file] 0-rebalvol-dht: /small/40.small: attempting to move from rebalvol-client-0 to rebalvol-client-2
[2014-08-13 17:33:29.290961] W [MSGID: 109023] [dht-rebalance.c:534:__dht_check_free_space] 0-rebalvol-dht: data movement attempted from node (rebalvol-client-0) with higher disk space to a node (rebalvol-client-2) with lesser disk space (/small/40.small)
[2014-08-13 17:33:29.294069] I [dht-rebalance.c:862:dht_migrate_file] 0-rebalvol-dht: /small/43.small: attempting to move from rebalvol-client-0 to rebalvol-client-2
[2014-08-13 17:33:29.296778] W [MSGID: 109023] [dht-rebalance.c:534:__dht_check_free_space] 0-rebalvol-dht: data movement attempted from node (rebalvol-client-0) with higher disk space to a node (rebalvol-client-2) with lesser disk space (/small/43.small)
[2014-08-13 17:33:29.301337] I [dht-rebalance.c:862:dht_migrate_file] 0-rebalvol-dht: /small/47.small: attempting to move from rebalvol-client-0 to rebalvol-client-2
[2014-08-13 17:33:29.303833] W [MSGID: 109023] [dht-rebalance.c:534:__dht_check_free_space] 0-rebalvol-dht: data movement attempted from node (rebalvol-client-0) with higher disk space to a node (rebalvol


GlusterD logs from Node3
++++++++++++++++++++++++

[2014-08-13 10:51:57.522171] I [glusterd-handshake.c:1011:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30000
[2014-08-13 10:51:57.522276] E [store.c:432:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info, returned error: (No such file or directory)
[2014-08-13 10:51:57.522322] I [glusterd.c:176:glusterd_uuid_generate_save] 0-management: generated UUID: 5a36b5ef-c0c3-4043-b964-825a9d4641d7
[2014-08-13 10:51:57.527903] I [glusterd-handler.c:2611:__glusterd_handle_probe_query] 0-glusterd: Received probe from uuid: 7be04f41-1414-4157-a732-b274e29b212d
[2014-08-13 10:51:57.530351] I [glusterd-handler.c:2639:__glusterd_handle_probe_query] 0-glusterd: Unable to find peerinfo for host: 10.70.36.249 (24007)
[2014-08-13 10:51:57.532428] I [rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2014-08-13 10:51:57.535231] I [glusterd-handler.c:3180:glusterd_friend_add] 0-management: connect returned 0
[2014-08-13 10:51:57.535354] I [glusterd-handler.c:2663:__glusterd_handle_probe_query] 0-glusterd: Responded to 10.70.36.249, op_ret: 0, op_errno: 0, ret: 0
[2014-08-13 10:51:57.840143] I [glusterd-handler.c:2315:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: 7be04f41-1414-4157-a732-b274e29b212d
[2014-08-13 10:51:57.849774] E [glusterd-utils.c:6844:glusterd_brick_start] 0-management: Could not find peer on which brick rhsauto008.lab.eng.blr.redhat.com:/bricks/rebalvol_brick1 resides
[2014-08-13 10:51:57.849819] E [glusterd-op-sm.c:1764:glusterd_start_bricks] 0-management: Failed to start rhsauto008.lab.eng.blr.redhat.com:/bricks/rebalvol_brick1 for rebalvol
[2014-08-13 10:51:57.899227] I [rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2014-08-13 10:51:57.903021] I [glusterd-handler.c:3367:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to 10.70.36.249 (0), ret: 0
[2014-08-13 10:51:57.907609] I [socket.c:2246:socket_event_handler] 0-transport: disconnecting now
[2014-08-13 10:51:57.907662] I [MSGID: 106006] [glusterd-handler.c:4280:__glusterd_nodesvc_rpc_notify] 0-management: nfs has disconnected from glusterd.
[2014-08-13 10:51:58.544595] I [glusterd-rpc-ops.c:237:__glusterd_probe_cbk] 0-glusterd: Received probe resp from uuid: 7be04f41-1414-4157-a732-b274e29b212d, host: 10.70.36.249
[2014-08-13 10:51:58.546883] I [glusterd-rpc-ops.c:309:__glusterd_probe_cbk] 0-glusterd: Received resp to probe req
[2014-08-13 10:51:58.726575] I [glusterd-rpc-ops.c:359:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: 7be04f41-1414-4157-a732-b274e29b212d, host: 10.70.36.249, port: 0
[2014-08-13 10:51:58.726638] I [glusterd-sm.c:496:glusterd_ac_send_friend_update] 0-: Added uuid: 7be04f41-1414-4157-a732-b274e29b212d, host: 10.70.36.249
[2014-08-13 10:51:58.929257] I [glusterd-handler.c:2477:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: 7be04f41-1414-4157-a732-b274e29b212d
[2014-08-13 10:51:58.929320] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: a52ccee9-45e9-4463-9ca7-aab5090df70e, hostname:rhsauto008.lab.eng.blr.redhat.com
[2014-08-13 10:51:58.935047] I [rpc-clnt.c:969:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2014-08-13 10:51:58.938214] I [glusterd-handler.c:3180:glusterd_friend_add] 0-management: connect returned 0
[2014-08-13 10:51:58.938241] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7, hostname:rhsauto022.lab.eng.blr.redhat.com
[2014-08-13 10:51:58.938254] I [glusterd-handler.c:2531:__glusterd_handle_friend_update] 0-: Received my uuid as Friend
[2014-08-13 10:51:59.217750] I [glusterd-rpc-ops.c:556:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 7be04f41-1414-4157-a732-b274e29b212d
[2014-08-13 10:51:59.219346] E [glusterd-op-sm.c:207:glusterd_get_txn_opinfo] 0-: Unable to get transaction opinfo for transaction ID : 36dd9790-9ddb-4b75-859b-581a2ff7be44
[2014-08-13 10:51:59.219363] E [glusterd-op-sm.c:6508:glusterd_op_sm] 0-management: Unable to get transaction's opinfo
[2014-08-13 10:51:59.365567] I [glusterd-handshake.c:1011:__glusterd_mgmt_hndsk_versions_ack] 0-management: using the op-version 30000
[2014-08-13 10:51:59.546936] I [glusterd-handler.c:2315:__glusterd_handle_incoming_friend_req] 0-glusterd: Received probe from uuid: a52ccee9-45e9-4463-9ca7-aab5090df70e
[2014-08-13 10:51:59.548459] I [glusterd-handler.c:3367:glusterd_xfer_friend_add_resp] 0-glusterd: Responded to rhsauto008.lab.eng.blr.redhat.com (0), ret: 0
[2014-08-13 10:51:59.550319] I [glusterd-sm.c:496:glusterd_ac_send_friend_update] 0-: Added uuid: 7be04f41-1414-4157-a732-b274e29b212d, host: 10.70.36.249
[2014-08-13 10:51:59.550344] I [glusterd-sm.c:496:glusterd_ac_send_friend_update] 0-: Added uuid: a52ccee9-45e9-4463-9ca7-aab5090df70e, host: rhsauto008.lab.eng.blr.redhat.com
[2014-08-13 10:51:59.553374] I [glusterd-rpc-ops.c:556:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: 7be04f41-1414-4157-a732-b274e29b212d
[2014-08-13 10:52:00.386229] I [glusterd-rpc-ops.c:359:__glusterd_friend_add_cbk] 0-glusterd: Received ACC from uuid: a52ccee9-45e9-4463-9ca7-aab5090df70e, host: rhsauto008.lab.eng.blr.redhat.com, port: 0
[2014-08-13 10:52:00.499070] I [glusterd-handler.c:2477:__glusterd_handle_friend_update] 0-glusterd: Received friend update from uuid: a52ccee9-45e9-4463-9ca7-aab5090df70e
[2014-08-13 10:52:00.499136] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: 7be04f41-1414-4157-a732-b274e29b212d, hostname:10.70.36.249
[2014-08-13 10:52:00.499148] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7, hostname:rhsauto022.lab.eng.blr.redhat.com
[2014-08-13 10:52:00.499156] I [glusterd-handler.c:2531:__glusterd_handle_friend_update] 0-: Received my uuid as Friend
[2014-08-13 10:52:00.689968] I [glusterd-rpc-ops.c:556:__glusterd_friend_update_cbk] 0-management: Received ACC from uuid: a52ccee9-45e9-4463-9ca7-aab5090df70e
[2014-08-13 12:55:50.573785] W [socket.c:529:__socket_rwv] 0-management: readv on 10.70.36.249:24007 failed (No data available)
[2014-08-13 12:55:50.574120] I [MSGID: 106004] [glusterd-handler.c:4388:__glusterd_peer_rpc_notify] 0-management: Peer 7be04f41-1414-4157-a732-b274e29b212d, in Peer in Cluster state, has disconnected from glusterd.
[2014-08-13 12:55:50.574180] W [glusterd-locks.c:632:glusterd_mgmt_v3_unlock] 0-management: Lock for vol rebalvol not held
[2014-08-13 12:55:51.234271] W [glusterfsd.c:1182:cleanup_and_exit] (--> 0-: received signum (15), shutting down
[2014-08-13 12:56:13.725432] I [MSGID: 100030] [glusterfsd.c:1998:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.6.0.27 (args: /usr/sbin/glusterd --pid-file=/var/run/glusterd.pid)
[2014-08-13 12:56:13.732630] I [glusterd.c:1215:init] 0-management: Using /var/lib/glusterd as working directory
[2014-08-13 12:56:13.736328] W [rdma.c:4194:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed (No such device)
[2014-08-13 12:56:13.736355] E [rdma.c:4482:init] 0-rdma.management: Failed to initialize IB Device
[2014-08-13 12:56:13.736367] E [rpc-transport.c:333:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed
[2014-08-13 12:56:13.736425] W [rpcsvc.c:1524:rpcsvc_transport_create] 0-rpc-service: cannot create listener, initing the transport failed
[2014-08-13 12:56:17.437364] E [store.c:432:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info, returned error: (No such file or directory)
[2014-08-13 12:56:17.437445] E [store.c:432:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info, returned error: (No such file or directory)
[2014-08-13 12:56:17.437475] I [glusterd-store.c:1960:glusterd_restore_op_version] 0-management: Detected new install. Setting op-version to maximum : 30000
[2014-08-13 12:56:17.437708] I [glusterd-store.c:3287:glusterd_store_retrieve_missed_snaps_list] 0-management: No missed snaps list.
[2014-08-13 12:56:17.437845] E [store.c:432:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/options, returned error: (No such file or directory)

Comment 2 SATHEESARAN 2014-08-14 14:07:55 UTC

Lala,

Is sosreports or gluterd logs on the nodes available ?

Comment 3 Lalatendu Mohanty 2014-08-14 14:13:40 UTC

Sas,

I have already attached glusterd and rebalance logs with the bug. Please check "attachment 926773 [details]".

Comment 4 Gaurav Kumar Garg 2014-08-18 11:50:17 UTC

Lala, 

Is it possible to add following things in BVT logs when you reproduce this bug again.

1. peer status
2. volume info
3. /var/lib/glusterd/ files

can you add sos report if available. 

These logs are not enough to find the root cause of the problem. 


Regards 
Gaurav

Comment 5 Lalatendu Mohanty 2014-08-19 10:53:54 UTC

Here is the Beaker Job link: https://beaker.engineering.redhat.com/jobs/720498

and I can see the volume info in the logs:

Volume Name: rebalvol
Type: Distribute
Volume ID: dc09a802-985a-4755-a805-6b4bb5083d70
Status: Started
Snap Volume: no
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: rhsauto019.lab.eng.blr.redhat.com:/bricks/rebalvol_brick0
Brick2: rhsauto008.lab.eng.blr.redhat.com:/bricks/rebalvol_brick1
Options Reconfigured:
performance.readdir-ahead: on
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256

However the automation needs to be updated for /var/lib/glusterd/ files, peer status and sos reports. Will work on that

Comment 6 Gaurav Kumar Garg 2014-08-21 09:38:29 UTC

Lala, 

you are creating volume with 2 node NODE1 and NODE 2 which have host name "rhsauto019.lab.eng.blr.redhat.com" and " rhsauto008.lab.eng.blr.redhat.com " then why other uuid coming in the logs after probing NODE 3. 

glusterd log of node 1 which show other uuid are below: 

[2014-08-13 17:33:35.047512] I [glusterd-rpc-ops.c:237:__glusterd_probe_cbk] 0-glusterd: Received probe resp from uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7, host: rhsauto022.lab.eng.blr.redhat.com


[2014-08-13 17:33:36.140254] I [glusterd-sm.c:496:glusterd_ac_send_friend_update] 0-: Added uuid: a52ccee9-45e9-4463-9ca7-aab5090df70e, host: rhsauto008.lab.eng.blr.redhat.com
[2014-08-13 17:33:36.140286] I [glusterd-sm.c:496:glusterd_ac_send_friend_update] 0-: Added uuid: 5a36b5ef-c0c3-4043-b964-825a9d4641d7, host: rhsauto022.lab.eng.blr.redhat.com

[2014-08-13 17:33:36.428662] I [glusterd-handler.c:2522:__glusterd_handle_friend_update] 0-: Received uuid: 7be04f41-1414-4157-a732-b274e29b212d, hostname:10.70.36.249


this means that other node may have issue some other task. that's why its creating lock for that transaction  and showing below error message while issuing rebalance status command.
 "volume rebalance: rebalvol: failed: Another transaction is in progress for rebalvol. Please try again after sometime"

could you give us "peer status" output so that we can find out how many nodes are there in cluster.

Comment 7 Lalatendu Mohanty 2014-09-01 11:48:43 UTC

Gaurav,

I have changed the automation to print the "gluster peer status", for this testcase as you have asked. So hopefully next time the issue comes, we will have answers.

https://code.engineering.redhat.com/gerrit/#/c/31785/

Comment 8 Gaurav Kumar Garg 2014-09-01 12:56:11 UTC

Lala,

definitely we will have answers when next time issue comes.

Comment 10 Atin Mukherjee 2014-11-07 05:14:55 UTC

downstream patch link : https://code.engineering.redhat.com/gerrit/#/c/36252/

Comment 11 Lalatendu Mohanty 2014-11-26 11:11:17 UTC

Haven't seen this issue on glusterfs-server-3.6.0.33-1.el6rhs. Hence marking this as verified

Comment 12 Divya 2015-01-06 06:10:09 UTC

Atin,

Kindly review the edited doc text and sign-off.

Comment 13 Atin Mukherjee 2015-01-06 06:40:35 UTC

Doc text looks okay.

Comment 15 errata-xmlrpc 2015-01-15 13:39:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0038.html

Note You need to log in before you can comment on or make changes to this bug.