Description of problem: ************************************************** Created a 2x2 dis-rep volume.Mount it via cifs and create few directories and files on the mount point.Did volume set operation required for samba shares to be mounted via cifs.restarted glusterd on all the nodes and checked volume status. After executing volume status it shows following errors in the logs: *********************************************** volume req for volume newafr [2014-07-28 05:50:50.986840] E [glusterd-utils.c:10038:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (1) and remote tasks count (0) do not match. Not aggregating tasks status. [2014-07-28 05:50:50.986893] E [glusterd-syncop.c:1014:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from node/brick [2014-07-28 05:50:50.987082] E [glusterd-utils.c:10038:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (1) and remote tasks count (0) do not match. Not aggregating tasks status. [2014-07-28 05:50:50.987106] E [glusterd-syncop.c:1014:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from node/brick Version-Release number of selected component (if applicable): [root@dhcp159-210 glusterfs]# rpm -qa | grep glusterfs glusterfs-geo-replication-3.6.0.25-1.el6rhs.x86_64 glusterfs-fuse-3.6.0.25-1.el6rhs.x86_64 glusterfs-rdma-3.6.0.25-1.el6rhs.x86_64 glusterfs-cli-3.6.0.25-1.el6rhs.x86_64 glusterfs-libs-3.6.0.25-1.el6rhs.x86_64 glusterfs-3.6.0.25-1.el6rhs.x86_64 glusterfs-devel-3.6.0.25-1.el6rhs.x86_64 glusterfs-server-3.6.0.25-1.el6rhs.x86_64 glusterfs-debuginfo-3.6.0.25-1.el6rhs.x86_64 samba-glusterfs-3.6.9-168.4.el6rhs.x86_64 glusterfs-api-3.6.0.25-1.el6rhs.x86_64 glusterfs-api-devel-3.6.0.25-1.el6rhs.x86_64 How reproducible: tried once. Steps to Reproduce: 1.create a 2x2 dis-rep volume 2.Mount it via cifs 3.create few directories/files on the mount point. 4.Run arequal checksum. 5.do volume set operation on the volume which is mounted. 6.Service glusterd restart 7.execute gluster vol status 8.Check the volume logs. Actual results: ************************************* the logs shows following error : volume req for volume newafr [2014-07-28 05:50:50.986840] E [glusterd-utils.c:10038:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (1) and remote tasks count (0) do not match. Not aggregating tasks status. [2014-07-28 05:50:50.986893] E [glusterd-syncop.c:1014:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from node/brick [2014-07-28 05:50:50.987082] E [glusterd-utils.c:10038:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (1) and remote tasks count (0) do not match. Not aggregating tasks status. [2014-07-28 05:50:50.987106] E [glusterd-syncop.c:1014:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from node/brick Expected results: There should not be such errors on execution of gluster vol status Additional info: ********************************* Volume Name: newafr Type: Distributed-Replicate Volume ID: bd60f186-4bb0-49fa-bdd8-521e07e1b728 Status: Started Snap Volume: no Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.16.159.197:/rhs/brick1/newafr/b1 Brick2: 10.16.159.210:/rhs/brick1/newafr/b2 Brick3: 10.16.159.197:/rhs/brick1/newafr/b3 Brick4: 10.16.159.210:/rhs/brick1/newafr/b4 Options Reconfigured: performance.readdir-ahead: on storage.batch-fsync-delay-usec: 0 server.allow-insecure: on performance.stat-prefetch: off auto-delete: disable snap-max-soft-limit: 90 snap-max-hard-limit: 256 [root@dhcp159-210 glusterfs]# gluster vol status newafr Status of volume: newafr Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.16.159.197:/rhs/brick1/newafr/b1 49167 Y 2829 Brick 10.16.159.210:/rhs/brick1/newafr/b2 49165 Y 5690 Brick 10.16.159.197:/rhs/brick1/newafr/b3 49168 Y 2834 Brick 10.16.159.210:/rhs/brick1/newafr/b4 49166 Y 5746 NFS Server on localhost 2049 Y 24418 Self-heal Daemon on localhost N/A Y 24425 NFS Server on 10.16.159.236 2049 Y 21568 Self-heal Daemon on 10.16.159.236 N/A Y 21575 NFS Server on 10.16.159.239 2049 Y 16899 Self-heal Daemon on 10.16.159.239 N/A Y 16906 NFS Server on 10.16.159.197 2049 Y 17658 Self-heal Daemon on 10.16.159.197 N/A Y 17665
Surabhi, Can you please attach the sosreports of all the nodes? Have you executed remove-brick/rebalance or replace-brick in between as this mismatch can be seen when u execute any of these operations. --Atin
For this particular test when these errors were observed ,remove-brick and rebalance is not been executed but there were several tests executed before which included remove-brick/rebalance operation.
This issue is caused by peers not participating in the rebalance not storing the rebalance task. When a rebalance task is started, the task details are stored in the node_state.info file. But this store was being performed only on nodes on which rebalance process is started. On the non-participating nodes, the task information would not be stored and would be only present in memory. This meant the information was lost when Glusterd is restarted, which leads to the above situation of having error logs. A simple reproducer for this is, 1. Create a 3 node cluster 2. Create a distribute volume with bricks only on 2 of the peers. 3. Start rebalance on the volume. 4. Restart the 3rd peer. 5. Run 'volume status' from either of the first 2 peers. This is not really a serious issue as it doesn't affect any operations. But I will fix it.
Downstream patch : https://code.engineering.redhat.com/gerrit/#/c/35725/
Tested the issue with the steps in comment5 1. Created 3 node cluster 2. Create a distribute volume with bricks in first 2 nodes and start the volume 3. Start rebalance on the volume 4. Restarted 'glusterd' on the third node ( node3 ) Rebalance status is now persisted on node_state.info file. There are no error messages as "Failed to aggregate response", in glusterd logs in any of the nodes in the cluster
Kaushal, Please review the edited doc text and sign-off.
Divya, doc text looks fine.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0038.html