Bug 1123732

Summary: Executing volume status for 2X2 dis-rep volume leads to "Failed to aggregate response from node/brick " errors in logs
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: surabhi <sbhaloth>
Component: glusterdAssignee: Kaushal <kaushal>
Status: CLOSED ERRATA QA Contact: SATHEESARAN <sasundar>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.0CC: amukherj, david.macdonald, divya, kaushal, nlevinki, sasundar, sbhaloth, ssamanta, vagarwal, vbellur
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.0.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.6.0.31-1 Doc Type: Bug Fix
Doc Text:
Previously, the rebalance state of a volume was not being saved on peers where rebalance was not started, that is, peers which do not contain bricks belonging to the volume. Hence, if glusterd processes were restarted on these peers, running a volume status command lead to the occurrence of error logs in the glusterd log files. With this fix, these error logs no longer appear in glusterd logs.
Story Points: ---
Clone Of:
: 1157979 (view as bug list) Environment:
Last Closed: 2015-01-15 13:39:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1157979    
Bug Blocks: 1162694, 1182807, 1183309    

Description surabhi 2014-07-28 07:04:04 UTC
Description of problem:
**************************************************
Created a 2x2 dis-rep volume.Mount it via cifs and create few directories and files on the mount point.Did volume set operation required for samba shares to be mounted via cifs.restarted glusterd on all the nodes and checked volume status.
After executing volume status it shows following errors in the logs:
***********************************************
volume req for volume newafr
[2014-07-28 05:50:50.986840] E [glusterd-utils.c:10038:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (1) and remote tasks count (0) do not match. Not aggregating tasks status.
[2014-07-28 05:50:50.986893] E [glusterd-syncop.c:1014:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick
[2014-07-28 05:50:50.987082] E [glusterd-utils.c:10038:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (1) and remote tasks count (0) do not match. Not aggregating tasks status.
[2014-07-28 05:50:50.987106] E [glusterd-syncop.c:1014:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick


Version-Release number of selected component (if applicable):
[root@dhcp159-210 glusterfs]# rpm -qa | grep glusterfs
glusterfs-geo-replication-3.6.0.25-1.el6rhs.x86_64
glusterfs-fuse-3.6.0.25-1.el6rhs.x86_64
glusterfs-rdma-3.6.0.25-1.el6rhs.x86_64
glusterfs-cli-3.6.0.25-1.el6rhs.x86_64
glusterfs-libs-3.6.0.25-1.el6rhs.x86_64
glusterfs-3.6.0.25-1.el6rhs.x86_64
glusterfs-devel-3.6.0.25-1.el6rhs.x86_64
glusterfs-server-3.6.0.25-1.el6rhs.x86_64
glusterfs-debuginfo-3.6.0.25-1.el6rhs.x86_64
samba-glusterfs-3.6.9-168.4.el6rhs.x86_64
glusterfs-api-3.6.0.25-1.el6rhs.x86_64
glusterfs-api-devel-3.6.0.25-1.el6rhs.x86_64


How reproducible:
tried once.

Steps to Reproduce:
1.create a 2x2 dis-rep volume
2.Mount it via cifs
3.create few directories/files on the mount point.
4.Run arequal checksum.
5.do volume set operation on the volume which is mounted.
6.Service glusterd restart
7.execute gluster vol status 
8.Check the volume logs.

Actual results:
*************************************
the logs shows following error :

volume req for volume newafr
[2014-07-28 05:50:50.986840] E [glusterd-utils.c:10038:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (1) and remote tasks count (0) do not match. Not aggregating tasks status.
[2014-07-28 05:50:50.986893] E [glusterd-syncop.c:1014:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick
[2014-07-28 05:50:50.987082] E [glusterd-utils.c:10038:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (1) and remote tasks count (0) do not match. Not aggregating tasks status.
[2014-07-28 05:50:50.987106] E [glusterd-syncop.c:1014:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick

Expected results:
There should not be such errors on execution of gluster vol status

Additional info:
*********************************
Volume Name: newafr
Type: Distributed-Replicate
Volume ID: bd60f186-4bb0-49fa-bdd8-521e07e1b728
Status: Started
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.16.159.197:/rhs/brick1/newafr/b1
Brick2: 10.16.159.210:/rhs/brick1/newafr/b2
Brick3: 10.16.159.197:/rhs/brick1/newafr/b3
Brick4: 10.16.159.210:/rhs/brick1/newafr/b4
Options Reconfigured:
performance.readdir-ahead: on
storage.batch-fsync-delay-usec: 0
server.allow-insecure: on
performance.stat-prefetch: off
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256

[root@dhcp159-210 glusterfs]# gluster vol status newafr
Status of volume: newafr
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick 10.16.159.197:/rhs/brick1/newafr/b1		49167	Y	2829
Brick 10.16.159.210:/rhs/brick1/newafr/b2		49165	Y	5690
Brick 10.16.159.197:/rhs/brick1/newafr/b3		49168	Y	2834
Brick 10.16.159.210:/rhs/brick1/newafr/b4		49166	Y	5746
NFS Server on localhost					2049	Y	24418
Self-heal Daemon on localhost				N/A	Y	24425
NFS Server on 10.16.159.236				2049	Y	21568
Self-heal Daemon on 10.16.159.236			N/A	Y	21575
NFS Server on 10.16.159.239				2049	Y	16899
Self-heal Daemon on 10.16.159.239			N/A	Y	16906
NFS Server on 10.16.159.197				2049	Y	17658
Self-heal Daemon on 10.16.159.197			N/A	Y	17665

Comment 2 Atin Mukherjee 2014-07-28 11:06:19 UTC
Surabhi,

Can you please attach the sosreports of all the nodes? Have you executed remove-brick/rebalance or replace-brick in between as this mismatch can be seen when u execute any of these operations.

--Atin

Comment 3 surabhi 2014-07-28 12:36:48 UTC
For this particular test when these errors were observed ,remove-brick and rebalance is not been executed but there were several tests executed before which included remove-brick/rebalance operation.

Comment 5 Kaushal 2014-10-28 07:31:35 UTC
This issue is caused by peers not participating in the rebalance not storing the rebalance task. When a rebalance task is started, the task details are stored in the node_state.info file. But this store was being performed only on nodes on which rebalance process is started. On the non-participating nodes, the task information would not be stored and would be only present in memory. This meant the information was lost when Glusterd is restarted, which leads to the above situation of having error logs.

A simple reproducer for this is,
1. Create a 3 node cluster
2. Create a distribute volume with bricks only on 2 of the peers.
3. Start rebalance on the volume.
4. Restart the 3rd peer.
5. Run 'volume status' from either of the first 2 peers.

This is not really a serious issue as it doesn't affect any operations. But I will fix it.

Comment 6 Atin Mukherjee 2014-10-31 05:28:48 UTC
Downstream patch : https://code.engineering.redhat.com/gerrit/#/c/35725/

Comment 7 SATHEESARAN 2014-11-24 16:38:02 UTC
Tested the issue with the steps in comment5
1. Created 3 node cluster
2. Create a distribute volume with bricks in first 2 nodes and start the volume
3. Start rebalance on the volume
4. Restarted 'glusterd' on the third node ( node3 )

Rebalance status is now persisted on node_state.info file.
There are no error messages as "Failed to aggregate response", in glusterd logs in any of the nodes in the cluster

Comment 8 Divya 2015-01-06 06:21:01 UTC
Kaushal,

Please review the edited doc text and sign-off.

Comment 9 Kaushal 2015-01-13 07:08:00 UTC
Divya, doc text looks fine.

Comment 11 errata-xmlrpc 2015-01-15 13:39:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0038.html