Bugzilla (bugzilla.redhat.com) will be under maintenance for infrastructure upgrades and will not be available on July 31st between 12:30 AM - 05:30 AM UTC. We appreciate your understanding and patience. You can follow status.redhat.com for details.
Bug 1650115 - glusterd requests are timing out in a brick multiplex setup
Summary: glusterd requests are timing out in a brick multiplex setup
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: mainline
Hardware: All
OS: All
unspecified
high
Target Milestone: ---
Assignee: Mohammed Rafi KC
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1649651
TreeView+ depends on / blocked
 
Reported: 2018-11-15 11:42 UTC by Mohammed Rafi KC
Modified: 2019-04-09 17:50 UTC (History)
2 users (show)

Fixed In Version: glusterfs-6.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-25 16:31:57 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gluster.org Gerrit 21651 0 None Merged glusterd/mux: Optimize brick disconnect handler code 2018-11-18 06:11:15 UTC

Description Mohammed Rafi KC 2018-11-15 11:42:00 UTC
Description of problem:

When there is a large number of volumes in a brick multiplex setup, glusterd is taking more time to process a brick disconnect. Because of that other gluster requests are queued in the list.


Version-Release number of selected component (if applicable):


How reproducible:

100%

Steps to Reproduce:
1.create a gluster cluster
2.enable brick multiplex
3.create 1500 volumes (1*3)
4.stop a single brick process
5. Execute gluster cli commands like peer status

Actual results:

peer status failed

Expected results:

peer status should be able to show the correct peer status

Additional info:

Comment 1 Mohammed Rafi KC 2018-11-15 11:43:09 UTC
RCA done by Atin,
<snipet>
When we kill a brick and glusterd gets a disconnect event we get into a code snippet which turns out to be a very costly loop with too many iterations and with such scale where ~1300 volumes are configured, this thread take minutes which causes the other requests to queue up.

                if (is_brick_mx_enabled()) {                                        
                        cds_list_for_each_entry (brick_proc, &conf->brick_procs,
                                                 brick_proc_list) {                 
                                cds_list_for_each_entry (brickinfo_tmp,             
                                                         &brick_proc->bricks,   
                                                         brick_list) {              
                                        if (strcmp (brickinfo_tmp->path,            
                                                    brickinfo->path) == 0) {    
                                                ret  = glusterd_mark_bricks_stopped_by_proc
                                                       (brick_proc);                
                                                if (ret) {                          
                                                        gf_msg(THIS->name,          
                                                               GF_LOG_ERROR, 0, 
                                                               GD_MSG_BRICK_STOP_FAIL,
                                                               "Unable to stop "
                                                               "bricks of process"
                                                               " to which brick(%s)"
                                                               " belongs",          
                                                               brickinfo->path);
                                                        goto out;                   
                                                }                                   
                                                temp = 1;                           
                                                break;                              
                                        }                                           
                                }                                                   
                                if (temp == 1)                                      
                                        break;                                      
                        }                                                           
                }

</snipet>

Comment 2 Worker Ant 2018-11-15 11:50:04 UTC
REVIEW: https://review.gluster.org/21651 (glusterd/mux: Optimize brick cleanup code) posted (#1) for review on master by mohammed rafi  kc

Comment 3 Worker Ant 2018-11-18 06:11:14 UTC
REVIEW: https://review.gluster.org/21651 (glusterd/mux: Optimize brick disconnect handler code) posted (#6) for review on master by Atin Mukherjee

Comment 4 Shyamsundar 2019-03-25 16:31:57 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-6.0, please open a new bug report.

glusterfs-6.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] https://lists.gluster.org/pipermail/announce/2019-March/000120.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.