1157979 – Executing volume status for 2X2 dis-rep volume leads to "Failed to aggregate response from node/brick " errors in logs

Bug 1157979 - Executing volume status for 2X2 dis-rep volume leads to "Failed to aggregate response from node/brick " errors in logs

Summary: Executing volume status for 2X2 dis-rep volume leads to "Failed to aggregate ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kaushal
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1123732
TreeView+	depends on / blocked

Reported:	2014-10-28 07:36 UTC by Kaushal
Modified:	2015-05-14 17:44 UTC (History)
CC List:	10 users (show)
Fixed In Version:	glusterfs-3.7.0
Clone Of:	1123732
Environment:
Last Closed:	2015-05-14 17:28:08 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Kaushal 2014-10-28 07:36:16 UTC

+++ This bug was initially created as a clone of Bug #1123732 +++

Description of problem:
**************************************************
Created a 2x2 dis-rep volume.Mount it via cifs and create few directories and files on the mount point.Did volume set operation required for samba shares to be mounted via cifs.restarted glusterd on all the nodes and checked volume status.
After executing volume status it shows following errors in the logs:
***********************************************
volume req for volume newafr
[2014-07-28 05:50:50.986840] E [glusterd-utils.c:10038:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (1) and remote tasks count (0) do not match. Not aggregating tasks status.
[2014-07-28 05:50:50.986893] E [glusterd-syncop.c:1014:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick
[2014-07-28 05:50:50.987082] E [glusterd-utils.c:10038:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (1) and remote tasks count (0) do not match. Not aggregating tasks status.
[2014-07-28 05:50:50.987106] E [glusterd-syncop.c:1014:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick


How reproducible:
tried once.

Steps to Reproduce:
1.create a 2x2 dis-rep volume
2.Mount it via cifs
3.create few directories/files on the mount point.
4.Run arequal checksum.
5.do volume set operation on the volume which is mounted.
6.Service glusterd restart
7.execute gluster vol status 
8.Check the volume logs.

Actual results:
*************************************
the logs shows following error :

volume req for volume newafr
[2014-07-28 05:50:50.986840] E [glusterd-utils.c:10038:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (1) and remote tasks count (0) do not match. Not aggregating tasks status.
[2014-07-28 05:50:50.986893] E [glusterd-syncop.c:1014:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick
[2014-07-28 05:50:50.987082] E [glusterd-utils.c:10038:glusterd_volume_status_aggregate_tasks_status] 0-management: Local tasks count (1) and remote tasks count (0) do not match. Not aggregating tasks status.
[2014-07-28 05:50:50.987106] E [glusterd-syncop.c:1014:_gd_syncop_commit_op_cbk] 0-management: Failed to aggregate response from  node/brick

Expected results:
There should not be such errors on execution of gluster vol status

Additional info:
*********************************
Volume Name: newafr
Type: Distributed-Replicate
Volume ID: bd60f186-4bb0-49fa-bdd8-521e07e1b728
Status: Started
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: srv1:/rhs/brick1/newafr/b1
Brick2: srv2:/rhs/brick1/newafr/b2
Brick3: srv1:/rhs/brick1/newafr/b3
Brick4: srv2:/rhs/brick1/newafr/b4
Options Reconfigured:
performance.readdir-ahead: on
storage.batch-fsync-delay-usec: 0
server.allow-insecure: on
performance.stat-prefetch: off
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256

[root@srv2 glusterfs]# gluster vol status newafr
Status of volume: newafr
Gluster process						Port	Online	Pid
------------------------------------------------------------------------------
Brick srv1:/rhs/brick1/newafr/b1		        49167	Y	2829
Brick srv2:/rhs/brick1/newafr/b2		        49165	Y	5690
Brick srv1:/rhs/brick1/newafr/b3		        49168	Y	2834
Brick srv2:/rhs/brick1/newafr/b4		        49166	Y	5746
NFS Server on localhost					2049	Y	24418
Self-heal Daemon on localhost				N/A	Y	24425
NFS Server on srv3				        2049	Y	21568
Self-heal Daemon on srv3			        N/A	Y	21575
NFS Server on srv4				        2049	Y	16899
Self-heal Daemon on srv4			        N/A	Y	16906
NFS Server on srv1				        2049	Y	17658
Self-heal Daemon on srv1			        N/A	Y	17665


--- Additional comment from Atin Mukherjee on 2014-07-28 16:36:19 IST ---

Surabhi,

Can you please attach the sosreports of all the nodes? Have you executed remove-brick/rebalance or replace-brick in between as this mismatch can be seen when u execute any of these operations.

--Atin

--- Additional comment from surabhi on 2014-07-28 18:06:48 IST ---

For this particular test when these errors were observed ,remove-brick and rebalance is not been executed but there were several tests executed before which included remove-brick/rebalance operation.


--- Additional comment from Kaushal on 2014-10-28 13:01:35 IST ---

This issue is caused by peers not participating in the rebalance not storing the rebalance task. When a rebalance task is started, the task details are stored in the node_state.info file. But this store was being performed only on nodes on which rebalance process is started. On the non-participating nodes, the task information would not be stored and would be only present in memory. This meant the information was lost when Glusterd is restarted, which leads to the above situation of having error logs.

A simple reproducer for this is,
1. Create a 3 node cluster
2. Create a distribute volume with bricks only on 2 of the peers.
3. Start rebalance on the volume.
4. Restart the 3rd peer.
5. Run 'volume status' from either of the first 2 peers.

This is not really a serious issue as it doesn't affect any operations. But I will fix it.

Comment 1 Anand Avati 2014-10-29 07:02:43 UTC

REVIEW: http://review.gluster.org/8998 (glusterd: Store rebalance state on all peers) posted (#1) for review on master by Kaushal M (kaushal)

Comment 2 Anand Avati 2014-10-30 05:27:36 UTC

COMMIT: http://review.gluster.org/8998 committed in master by Krishnan Parthasarathi (kparthas) 
------
commit 96e1c33b681b34124bdc78174a21865623c9795b
Author: Kaushal M <kaushal>
Date:   Tue Oct 28 13:06:50 2014 +0530

    glusterd: Store rebalance state on all peers
    
    The rebalance state was being saved only on the peers participating in
    the rebalance on a rebalance start. This change makes sure all nodes
    save the rebalance state.
    
    Change-Id: I436e5c34bcfb88f7da7378cec807328ce32397bc
    BUG: 1157979
    Signed-off-by: Kaushal M <kaushal>
    Reviewed-on: http://review.gluster.org/8998
    Reviewed-by: Atin Mukherjee <amukherj>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Krishnan Parthasarathi <kparthas>
    Tested-by: Krishnan Parthasarathi <kparthas>

Comment 3 Niels de Vos 2015-05-14 17:28:08 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 4 Niels de Vos 2015-05-14 17:35:40 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 5 Niels de Vos 2015-05-14 17:38:02 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 6 Niels de Vos 2015-05-14 17:44:20 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.