Bug 1123732
Summary: | Executing volume status for 2X2 dis-rep volume leads to "Failed to aggregate response from node/brick " errors in logs | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | surabhi <sbhaloth> | |
Component: | glusterd | Assignee: | Kaushal <kaushal> | |
Status: | CLOSED ERRATA | QA Contact: | SATHEESARAN <sasundar> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | rhgs-3.0 | CC: | amukherj, david.macdonald, divya, kaushal, nlevinki, sasundar, sbhaloth, ssamanta, vagarwal, vbellur | |
Target Milestone: | --- | Keywords: | ZStream | |
Target Release: | RHGS 3.0.3 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.6.0.31-1 | Doc Type: | Bug Fix | |
Doc Text: |
Previously, the rebalance state of a volume was not being saved on peers where rebalance was not started, that is, peers which do not contain bricks belonging to the volume. Hence, if glusterd processes were restarted on these peers, running a volume status command lead to the occurrence of error logs in the glusterd log files. With this fix, these error logs no longer appear in glusterd logs.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1157979 (view as bug list) | Environment: | ||
Last Closed: | 2015-01-15 13:39:02 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1157979 | |||
Bug Blocks: | 1162694, 1182807, 1183309 |
Description
surabhi
2014-07-28 07:04:04 UTC
Surabhi, Can you please attach the sosreports of all the nodes? Have you executed remove-brick/rebalance or replace-brick in between as this mismatch can be seen when u execute any of these operations. --Atin For this particular test when these errors were observed ,remove-brick and rebalance is not been executed but there were several tests executed before which included remove-brick/rebalance operation. This issue is caused by peers not participating in the rebalance not storing the rebalance task. When a rebalance task is started, the task details are stored in the node_state.info file. But this store was being performed only on nodes on which rebalance process is started. On the non-participating nodes, the task information would not be stored and would be only present in memory. This meant the information was lost when Glusterd is restarted, which leads to the above situation of having error logs. A simple reproducer for this is, 1. Create a 3 node cluster 2. Create a distribute volume with bricks only on 2 of the peers. 3. Start rebalance on the volume. 4. Restart the 3rd peer. 5. Run 'volume status' from either of the first 2 peers. This is not really a serious issue as it doesn't affect any operations. But I will fix it. Downstream patch : https://code.engineering.redhat.com/gerrit/#/c/35725/ Tested the issue with the steps in comment5 1. Created 3 node cluster 2. Create a distribute volume with bricks in first 2 nodes and start the volume 3. Start rebalance on the volume 4. Restarted 'glusterd' on the third node ( node3 ) Rebalance status is now persisted on node_state.info file. There are no error messages as "Failed to aggregate response", in glusterd logs in any of the nodes in the cluster Kaushal, Please review the edited doc text and sign-off. Divya, doc text looks fine. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0038.html |