Bug 1123732 - Executing volume status for 2X2 dis-rep volume leads to "Failed to aggregate response from node/brick " errors in logs
Summary: Executing volume status for 2X2 dis-rep volume leads to "Failed to aggregate ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: glusterd
Version: rhgs-3.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: RHGS 3.0.3
Assignee: Kaushal
QA Contact: SATHEESARAN
URL:
Whiteboard:
Depends On: 1157979
Blocks: 1162694 1182807 1183309
TreeView+ depends on / blocked
 
Reported: 2014-07-28 07:04 UTC by surabhi
Modified: 2015-12-10 10:35 UTC (History)
10 users (show)

Fixed In Version: glusterfs-3.6.0.31-1
Doc Type: Bug Fix
Doc Text:
Previously, the rebalance state of a volume was not being saved on peers where rebalance was not started, that is, peers which do not contain bricks belonging to the volume. Hence, if glusterd processes were restarted on these peers, running a volume status command lead to the occurrence of error logs in the glusterd log files. With this fix, these error logs no longer appear in glusterd logs.
Clone Of:
: 1157979 (view as bug list)
Environment:
Last Closed: 2015-01-15 13:39:02 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:0038 normal SHIPPED_LIVE Red Hat Storage 3.0 enhancement and bug fix update #3 2015-01-15 18:35:28 UTC

Description surabhi 2014-07-28 07:04:04 UTC
Description of problem:
**************************************************
Created a 2x2 dis-rep volume.Mount it via cifs and create few directories and files on the mount point.Did volume set operation required for samba shares to be mounted via cifs.restarted glusterd on all the nodes and checked volume status.
After executing volume status it shows following errors in the logs:
***********************************************
volume req for volume newafr
[2014-07-28 05:50:50.986840] E [glusterd-utils.c:10038:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (1) and remote tasks count (0) do not match. Not aggregating tasks status.
[2014-07-28 05:50:50.986893] E [glusterd-syncop.c:1014:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick
[2014-07-28 05:50:50.987082] E [glusterd-utils.c:10038:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (1) and remote tasks count (0) do not match. Not aggregating tasks status.
[2014-07-28 05:50:50.987106] E [glusterd-syncop.c:1014:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick


Version-Release number of selected component (if applicable):
[root@dhcp159-210 glusterfs]# rpm -qa | grep glusterfs
glusterfs-geo-replication-3.6.0.25-1.el6rhs.x86_64
glusterfs-fuse-3.6.0.25-1.el6rhs.x86_64
glusterfs-rdma-3.6.0.25-1.el6rhs.x86_64
glusterfs-cli-3.6.0.25-1.el6rhs.x86_64
glusterfs-libs-3.6.0.25-1.el6rhs.x86_64
glusterfs-3.6.0.25-1.el6rhs.x86_64
glusterfs-devel-3.6.0.25-1.el6rhs.x86_64
glusterfs-server-3.6.0.25-1.el6rhs.x86_64
glusterfs-debuginfo-3.6.0.25-1.el6rhs.x86_64
samba-glusterfs-3.6.9-168.4.el6rhs.x86_64
glusterfs-api-3.6.0.25-1.el6rhs.x86_64
glusterfs-api-devel-3.6.0.25-1.el6rhs.x86_64


How reproducible:
tried once.

Steps to Reproduce:
1.create a 2x2 dis-rep volume
2.Mount it via cifs
3.create few directories/files on the mount point.
4.Run arequal checksum.
5.do volume set operation on the volume which is mounted.
6.Service glusterd restart
7.execute gluster vol status 
8.Check the volume logs.

Actual results:
*************************************
the logs shows following error :

volume req for volume newafr
[2014-07-28 05:50:50.986840] E [glusterd-utils.c:10038:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (1) and remote tasks count (0) do not match. Not aggregating tasks status.
[2014-07-28 05:50:50.986893] E [glusterd-syncop.c:1014:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick
[2014-07-28 05:50:50.987082] E [glusterd-utils.c:10038:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (1) and remote tasks count (0) do not match. Not aggregating tasks status.
[2014-07-28 05:50:50.987106] E [glusterd-syncop.c:1014:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick

Expected results:
There should not be such errors on execution of gluster vol status

Additional info:
*********************************
Volume Name: newafr
Type: Distributed-Replicate
Volume ID: bd60f186-4bb0-49fa-bdd8-521e07e1b728
Status: Started
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.16.159.197:/rhs/brick1/newafr/b1
Brick2: 10.16.159.210:/rhs/brick1/newafr/b2
Brick3: 10.16.159.197:/rhs/brick1/newafr/b3
Brick4: 10.16.159.210:/rhs/brick1/newafr/b4
Options Reconfigured:
performance.readdir-ahead: on
storage.batch-fsync-delay-usec: 0
server.allow-insecure: on
performance.stat-prefetch: off
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256

[root@dhcp159-210 glusterfs]# gluster vol status newafr
Status of volume: newafr
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick 10.16.159.197:/rhs/brick1/newafr/b1		49167	Y	2829
Brick 10.16.159.210:/rhs/brick1/newafr/b2		49165	Y	5690
Brick 10.16.159.197:/rhs/brick1/newafr/b3		49168	Y	2834
Brick 10.16.159.210:/rhs/brick1/newafr/b4		49166	Y	5746
NFS Server on localhost					2049	Y	24418
Self-heal Daemon on localhost				N/A	Y	24425
NFS Server on 10.16.159.236				2049	Y	21568
Self-heal Daemon on 10.16.159.236			N/A	Y	21575
NFS Server on 10.16.159.239				2049	Y	16899
Self-heal Daemon on 10.16.159.239			N/A	Y	16906
NFS Server on 10.16.159.197				2049	Y	17658
Self-heal Daemon on 10.16.159.197			N/A	Y	17665

Comment 2 Atin Mukherjee 2014-07-28 11:06:19 UTC
Surabhi,

Can you please attach the sosreports of all the nodes? Have you executed remove-brick/rebalance or replace-brick in between as this mismatch can be seen when u execute any of these operations.

--Atin

Comment 3 surabhi 2014-07-28 12:36:48 UTC
For this particular test when these errors were observed ,remove-brick and rebalance is not been executed but there were several tests executed before which included remove-brick/rebalance operation.

Comment 5 Kaushal 2014-10-28 07:31:35 UTC
This issue is caused by peers not participating in the rebalance not storing the rebalance task. When a rebalance task is started, the task details are stored in the node_state.info file. But this store was being performed only on nodes on which rebalance process is started. On the non-participating nodes, the task information would not be stored and would be only present in memory. This meant the information was lost when Glusterd is restarted, which leads to the above situation of having error logs.

A simple reproducer for this is,
1. Create a 3 node cluster
2. Create a distribute volume with bricks only on 2 of the peers.
3. Start rebalance on the volume.
4. Restart the 3rd peer.
5. Run 'volume status' from either of the first 2 peers.

This is not really a serious issue as it doesn't affect any operations. But I will fix it.

Comment 6 Atin Mukherjee 2014-10-31 05:28:48 UTC
Downstream patch : https://code.engineering.redhat.com/gerrit/#/c/35725/

Comment 7 SATHEESARAN 2014-11-24 16:38:02 UTC
Tested the issue with the steps in comment5
1. Created 3 node cluster
2. Create a distribute volume with bricks in first 2 nodes and start the volume
3. Start rebalance on the volume
4. Restarted 'glusterd' on the third node ( node3 )

Rebalance status is now persisted on node_state.info file.
There are no error messages as "Failed to aggregate response", in glusterd logs in any of the nodes in the cluster

Comment 8 Divya 2015-01-06 06:21:01 UTC
Kaushal,

Please review the edited doc text and sign-off.

Comment 9 Kaushal 2015-01-13 07:08:00 UTC
Divya, doc text looks fine.

Comment 11 errata-xmlrpc 2015-01-15 13:39:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0038.html


Note You need to log in before you can comment on or make changes to this bug.