Bug 1352279

Summary: [scale]: Bricks not started after node reboot.
Product: [Community] GlusterFS Reporter: Atin Mukherjee <amukherj>
Component: glusterdAssignee: Atin Mukherjee <amukherj>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: mainlineCC: bsrirama, bugs, kaushal, pprakash, rcyriac, storage-qa-internal
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.9.0 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1336267
: 1352817 1352833 (view as bug list) Environment:
Last Closed: 2017-03-27 18:14:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1336267, 1352817, 1352833    

Comment 1 Atin Mukherjee 2016-07-03 10:32:15 UTC
Description of problem:
=======================
After rebooting the nodes which are hosting 400 volumes bricks, failed to start some of the volume bricks.


Errors in glutserd logs:
=======================
[2016-05-13 08:16:04.924247] E [socket.c:2393:socket_connect_finish]
0-glusterfs: connection to 10.70.36.45:24007 failed (Connection timed out)
[2016-05-13 08:16:05.128728] E [glusterfsd-mgmt.c:1907:mgmt_rpc_notify]
0-glusterfsd-mgmt: failed to connect with remote-host:
rhs-client21.lab.eng.blr.redhat.com (Transport endpoint is not connected)
[2016-05-13 08:16:05.340730] I [glusterfsd-mgmt.c:1913:mgmt_rpc_notify]
0-glusterfsd-mgmt: Exhausted all volfile


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.7.9-4.


How reproducible:
=================
Always


Steps to Reproduce:
===================
1. Have two RHGS node with 16 GB RAM each.
2. Create 400 1*2 volumes using both the nodes and start all the volumes.
3. Reboot the nodes and check all volume bricks are running.

Actual results:
===============
Bricks not starting after node reboot.

Expected results:
=================
Bricks should start after rebooting of nodes.

Comment 2 Vijay Bellur 2016-07-03 10:33:15 UTC
REVIEW: http://review.gluster.org/14849 (glusterd: compare uuid instead of hostname address resolution) posted (#1) for review on master by Atin Mukherjee (amukherj)

Comment 3 Vijay Bellur 2016-07-05 07:23:39 UTC
COMMIT: http://review.gluster.org/14849 committed in master by Kaushal M (kaushal) 
------
commit 633e6fe265bc2de42dade58dc6a15c285957da76
Author: Atin Mukherjee <amukherj>
Date:   Sun Jul 3 15:51:20 2016 +0530

    glusterd: compare uuid instead of hostname address resolution
    
    In glusterd_get_brickinfo () brick's hostname is address resolved. This adds an
    unnecessary latency since it uses calls like getaddrinfo (). Instead given the
    local brick's uuid is already known a comparison of MY_UUID and brickinfo->uuid
    is much more light weight than the previous approach.
    
    On a scale testing where cluster hosting ~400 volumes spanning across 4 nodes,
    if a node goes for a reboot, few of the bricks don't come up. After few days of
    analysis its found that glusterd_pmap_sigin () was taking signficant amount of
    latency and further code walthrough revealed this unnecessary address
    resolution. Applying this fix solves the issue and now all the brick processes
    come up on a node reboot.
    
    Change-Id: I299b8660ce0da6f3f739354f5c637bc356d82133
    BUG: 1352279
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: http://review.gluster.org/14849
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Prashanth Pai <ppai>
    Reviewed-by: Samikshan Bairagya <samikshan>
    Reviewed-by: Kaushal M <kaushal>

Comment 4 Shyamsundar 2017-03-27 18:14:27 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.9.0, please open a new bug report.

glusterfs-3.9.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2016-November/029281.html
[2] https://www.gluster.org/pipermail/gluster-users/