Bug 1294410 - Friend update floods can render the cluster incapable of handling other commands
Friend update floods can render the cluster incapable of handling other commands
Status: CLOSED CURRENTRELEASE
Product: GlusterFS
Classification: Community
Component: glusterd (Show other bugs)
3.7.0
Unspecified Unspecified
unspecified Severity unspecified
: ---
: ---
Assigned To: bugs@gluster.org
:
Depends On: 1292749
Blocks: 1291386
  Show dependency treegraph
 
Reported: 2015-12-28 00:39 EST by Atin Mukherjee
Modified: 2016-04-19 03:52 EDT (History)
5 users (show)

See Also:
Fixed In Version: glusterfs-3.7.7
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1292749
Environment:
Last Closed: 2016-04-19 03:52:20 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Atin Mukherjee 2015-12-28 00:39:56 EST
+++ This bug was initially created as a clone of Bug #1292749 +++

A flood of glusterd friend updates happen whenever a glusterd restarts and re-establishes all it's connections.

In a large cluster (100s) nodes, this would go on for several minutes. During this period the cluster isn't able to respond to commands. Simple local commands, like `gluster volume list` will take relatively very long time to complete.

When a large number of nodes come back up simultaneously, say due to a network problem, this flood can last for a long time, longer than expected.

--- Additional comment from Vijay Bellur on 2015-12-18 04:44:02 EST ---

REVIEW: http://review.gluster.org/12999 (glusterd: reduce friend update flood) posted (#1) for review on master by Kaushal M (kaushal@redhat.com)

--- Additional comment from Vijay Bellur on 2015-12-22 00:09:18 EST ---

REVIEW: http://review.gluster.org/12999 (glusterd: reduce friend update flood) posted (#2) for review on master by Kaushal M (kaushal@redhat.com)

--- Additional comment from Vijay Bellur on 2015-12-22 22:52:32 EST ---

COMMIT: http://review.gluster.org/12999 committed in master by Atin Mukherjee (amukherj@redhat.com) 
------
commit f624abd6885752eeaa8d07101ff00f52af48de26
Author: Kaushal M <kaushal@redhat.com>
Date:   Thu Dec 17 11:13:36 2015 +0530

    glusterd: reduce friend update flood
    
    When in a befriended state, glusterd would broadcast friend updates to
    all other peers whenver a ACC or LOCAL_ACC event occurred.
    
    When a downed glusterd came back up and established connections again,
    this lead to a flood of friend updates to happen on the order of N^2 (N
    is the number of peers in the cluster)
    
    In larger clusters this was problematic, and could lead to very long
    times for the cluster to settle down when a peer came back up. Multiple
    peers coming back up at the same time would compound the problem.
    
    Broadcasting of friend updates doesn't have much use in places other
    that during a peer probe. Instead of broadcasting friend updates on
    connection re-establishment, updates can just be exchanged between the
    peers involved in the connection.
    
    This patch changes the glusterd friend state-machine to send updates
    only to the required peer for ACC or LOCAL_ACC events when in befriended
    state. The number of updates sent now is in the order of N.
    
    For a 10 node cluster, the number of updates reduced by 5 times. When
    creating the 10 node cluster, the updates reduced from ~500 to ~150.
    When a glusterd restarted, the number of exchanges reduced from ~160 to
    ~35.
    
    BUG: 1292749
    Change-Id: Ib6072090c7069b081d018cdaa3dc878819ab1d18
    Signed-off-by: Kaushal M <kaushal@redhat.com>
    Reviewed-on: http://review.gluster.org/12999
    Reviewed-by: Atin Mukherjee <amukherj@redhat.com>
    Tested-by: NetBSD Build System <jenkins@build.gluster.org>
    Tested-by: Gluster Build System <jenkins@build.gluster.com>
Comment 1 Vijay Bellur 2015-12-28 01:23:59 EST
REVIEW: http://review.gluster.org/13095 (glusterd: reduce friend update flood) posted (#1) for review on release-3.7 by Gaurav Kumar Garg (ggarg@redhat.com)
Comment 2 Vijay Bellur 2015-12-29 01:26:04 EST
COMMIT: http://review.gluster.org/13095 committed in release-3.7 by Atin Mukherjee (amukherj@redhat.com) 
------
commit c0cc93dfe6fc63caeae9448dc689adcf13ea3aae
Author: Gaurav Kumar Garg <garg.gaurav52@gmail.com>
Date:   Mon Dec 28 11:46:54 2015 +0530

    glusterd: reduce friend update flood
    
    This patch is backport of: http://review.gluster.org/#/c/12999/
    
    When in a befriended state, glusterd would broadcast friend updates to
    all other peers whenver a ACC or LOCAL_ACC event occurred.
    
    When a downed glusterd came back up and established connections again,
    this lead to a flood of friend updates to happen on the order of N^2 (N
    is the number of peers in the cluster)
    
    In larger clusters this was problematic, and could lead to very long
    times for the cluster to settle down when a peer came back up. Multiple
    peers coming back up at the same time would compound the problem.
    
    Broadcasting of friend updates doesn't have much use in places other
    that during a peer probe. Instead of broadcasting friend updates on
    connection re-establishment, updates can just be exchanged between the
    peers involved in the connection.
    
    This patch changes the glusterd friend state-machine to send updates
    only to the required peer for ACC or LOCAL_ACC events when in befriended
    state. The number of updates sent now is in the order of N.
    
    For a 10 node cluster, the number of updates reduced by 5 times. When
    creating the 10 node cluster, the updates reduced from ~500 to ~150.
    When a glusterd restarted, the number of exchanges reduced from ~160 to
    ~35.
    
      >> BUG: 1292749
      >> Change-Id: Ib6072090c7069b081d018cdaa3dc878819ab1d18
      >> Signed-off-by: Kaushal M <kaushal@redhat.com>
      >> Reviewed-on: http://review.gluster.org/12999
      >> Reviewed-by: Atin Mukherjee <amukherj@redhat.com>
      >> Tested-by: NetBSD Build System <jenkins@build.gluster.org>
      >> Tested-by: Gluster Build System <jenkins@build.gluster.com>
    
    Change-Id: I389de2cc224f0ed627d98ae062209dd4f93e3b19
    BUG: 1294410
    Signed-off-by: Gaurav Kumar Garg <ggarg@redhat.com>
    Signed-off-by: Kaushal M <kaushal@redhat.com>
    Reviewed-on: http://review.gluster.org/13095
    Reviewed-by: Atin Mukherjee <amukherj@redhat.com>
    Tested-by: NetBSD Build System <jenkins@build.gluster.org>
    Tested-by: Gluster Build System <jenkins@build.gluster.com>
Comment 3 Kaushal 2016-04-19 03:52:20 EDT
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.7, please open a new bug report.

glusterfs-3.7.7 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://www.gluster.org/pipermail/gluster-users/2016-February/025292.html
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.