1352817 – [scale]: Bricks not started after node reboot.

Bug 1352817 - [scale]: Bricks not started after node reboot.

Summary: [scale]: Bricks not started after node reboot.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	3.8.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1352279 1352833
Blocks:	1336267 glusterfs-3.8.1
TreeView+	depends on / blocked

Reported:	2016-07-05 07:25 UTC by Atin Mukherjee
Modified:	2016-07-08 14:42 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-3.8.1
Clone Of:	1352279
Environment:
Last Closed:	2016-07-08 14:42:35 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Atin Mukherjee 2016-07-05 07:25:54 UTC

+++ This bug was initially created as a clone of Bug #1352279 +++

+++ This bug was initially created as a clone of Bug #1336267 +++

Description of problem:
=======================
After rebooting the nodes which are hosting 400 volumes bricks, failed to start some of the volume bricks.


Errors in glutserd logs:
=======================
[2016-05-13 08:16:04.924247] E [socket.c:2393:socket_connect_finish]
0-glusterfs: connection to 10.70.36.45:24007 failed (Connection timed out)
[2016-05-13 08:16:05.128728] E [glusterfsd-mgmt.c:1907:mgmt_rpc_notify]
0-glusterfsd-mgmt: failed to connect with remote-host:
rhs-client21.lab.eng.blr.redhat.com (Transport endpoint is not connected)
[2016-05-13 08:16:05.340730] I [glusterfsd-mgmt.c:1913:mgmt_rpc_notify]
0-glusterfsd-mgmt: Exhausted all volfile


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.7.9-4.


How reproducible:
=================
Always


Steps to Reproduce:
===================
1. Have two RHGS node with 16 GB RAM each.
2. Create 400 1*2 volumes using both the nodes and start all the volumes.
3. Reboot the nodes and check all volume bricks are running.

Actual results:
===============
Bricks not starting after node reboot.

Expected results:
=================
Bricks should start after rebooting of nodes.


Additional info:

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-05-16 00:08:02 EDT ---

This bug is automatically being proposed for the current z-stream release of Red Hat Gluster Storage 3 by setting the release flag 'rhgs‑3.1.z' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Atin Mukherjee on 2016-05-16 00:26:52 EDT ---

This looks like that GlusterD is not able to communicate with bricks due to lack of multi threaded e-poll support in GlusterD.

[2016-05-13 08:16:04.924247] E [socket.c:2393:socket_connect_finish]
0-glusterfs: connection to 10.70.36.45:24007 failed (Connection timed out)
[2016-05-13 08:16:05.128728] E [glusterfsd-mgmt.c:1907:mgmt_rpc_notify]
0-glusterfsd-mgmt: failed to connect with remote-host:
rhs-client21.lab.eng.blr.redhat.com (Transport endpoint is not connected)
[2016-05-13 08:16:05.340730] I [glusterfsd-mgmt.c:1913:mgmt_rpc_notify]
0-glusterfsd-mgmt: Exhausted all volfile

>From the above log (especially the first one) this indicates that the
same brick process failed to connect to glusterd and the connection got
timed out. This can happen in a situation where there is lot of back
pressure on the other side. Since GlusterD is limited to a single
threaded e-poll communication with the brick processes happen over a
single path and hence while glusterd tried to start 400 odd brick
processes there were 400 RPC connections to handle and that's why few of
the brick process got to hear from GlusterD and they came up but others
did not.

With out a brick multiplexing feature in place, scaling volumes is going to be always a challenge with different set of problems.

Moving it to 3.2.

--- Additional comment from Rejy M Cyriac on 2016-05-25 02:01:56 EDT ---

The fix for this BZ will NOT be available in time for the 1.0 Release of RHGS Container Converged with OpenShift. Therefore this BZ is being removed from the related Tracker BZ 1332128

--- Additional comment from Atin Mukherjee on 2016-06-28 08:29:24 EDT ---

I tried to enable MT-epoll on a set up of 4 nodes, 400 volumes with bricks spanning over all the nodes. After rebooting a node, all the gluster brick processes didn't come up and same error message was seen in few brick log file. So on a nutshell, MT-epoll is not going to solve this scalability issue. Its the big lock which is causing the threads to block and time out.

--- Additional comment from Atin Mukherjee on 2016-07-03 06:30:47 EDT ---

Surprisingly, big lock is not a culprit here. Its the pmap_signin from the brick processes which was consuming lot of glusterd's bandwidth and a code walk through revealed that we were doing an unnecessary address resolution which was not needed. Applying the fix solves this problem and I could see that on rebooting, glusterd is able to bring up all the brick processes.

--- Additional comment from Atin Mukherjee on 2016-07-03 06:32:15 EDT ---

Description of problem:
=======================
After rebooting the nodes which are hosting 400 volumes bricks, failed to start some of the volume bricks.


Errors in glutserd logs:
=======================
[2016-05-13 08:16:04.924247] E [socket.c:2393:socket_connect_finish]
0-glusterfs: connection to 10.70.36.45:24007 failed (Connection timed out)
[2016-05-13 08:16:05.128728] E [glusterfsd-mgmt.c:1907:mgmt_rpc_notify]
0-glusterfsd-mgmt: failed to connect with remote-host:
rhs-client21.lab.eng.blr.redhat.com (Transport endpoint is not connected)
[2016-05-13 08:16:05.340730] I [glusterfsd-mgmt.c:1913:mgmt_rpc_notify]
0-glusterfsd-mgmt: Exhausted all volfile


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.7.9-4.


How reproducible:
=================
Always


Steps to Reproduce:
===================
1. Have two RHGS node with 16 GB RAM each.
2. Create 400 1*2 volumes using both the nodes and start all the volumes.
3. Reboot the nodes and check all volume bricks are running.

Actual results:
===============
Bricks not starting after node reboot.

Expected results:
=================
Bricks should start after rebooting of nodes.

--- Additional comment from Vijay Bellur on 2016-07-03 06:33:15 EDT ---

REVIEW: http://review.gluster.org/14849 (glusterd: compare uuid instead of hostname address resolution) posted (#1) for review on master by Atin Mukherjee (amukherj)

--- Additional comment from Vijay Bellur on 2016-07-05 03:23:39 EDT ---

COMMIT: http://review.gluster.org/14849 committed in master by Kaushal M (kaushal) 
------
commit 633e6fe265bc2de42dade58dc6a15c285957da76
Author: Atin Mukherjee <amukherj>
Date:   Sun Jul 3 15:51:20 2016 +0530

    glusterd: compare uuid instead of hostname address resolution
    
    In glusterd_get_brickinfo () brick's hostname is address resolved. This adds an
    unnecessary latency since it uses calls like getaddrinfo (). Instead given the
    local brick's uuid is already known a comparison of MY_UUID and brickinfo->uuid
    is much more light weight than the previous approach.
    
    On a scale testing where cluster hosting ~400 volumes spanning across 4 nodes,
    if a node goes for a reboot, few of the bricks don't come up. After few days of
    analysis its found that glusterd_pmap_sigin () was taking signficant amount of
    latency and further code walthrough revealed this unnecessary address
    resolution. Applying this fix solves the issue and now all the brick processes
    come up on a node reboot.
    
    Change-Id: I299b8660ce0da6f3f739354f5c637bc356d82133
    BUG: 1352279
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: http://review.gluster.org/14849
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Prashanth Pai <ppai>
    Reviewed-by: Samikshan Bairagya <samikshan>
    Reviewed-by: Kaushal M <kaushal>

Comment 1 Vijay Bellur 2016-07-05 08:38:40 UTC

REVIEW: http://review.gluster.org/14860 (glusterd: compare uuid instead of hostname address resolution) posted (#1) for review on release-3.8 by Atin Mukherjee (amukherj)

Comment 2 Vijay Bellur 2016-07-05 14:27:03 UTC

REVIEW: http://review.gluster.org/14860 (glusterd: compare uuid instead of hostname address resolution) posted (#2) for review on release-3.8 by Niels de Vos (ndevos)

Comment 3 Vijay Bellur 2016-07-05 14:43:27 UTC

COMMIT: http://review.gluster.org/14860 committed in release-3.8 by Atin Mukherjee (amukherj) 
------
commit 4c012e223f89f9515cd3f8ebec1197ec1594218c
Author: Atin Mukherjee <amukherj>
Date:   Sun Jul 3 15:51:20 2016 +0530

    glusterd: compare uuid instead of hostname address resolution
    
    Backport of http://review.gluster.org/14849
    
    In glusterd_get_brickinfo () brick's hostname is address resolved. This adds an
    unnecessary latency since it uses calls like getaddrinfo (). Instead given the
    local brick's uuid is already known a comparison of MY_UUID and brickinfo->uuid
    is much more light weight than the previous approach.
    
    On a scale testing where cluster hosting ~400 volumes spanning across 4 nodes,
    if a node goes for a reboot, few of the bricks don't come up. After few days of
    analysis its found that glusterd_pmap_sigin () was taking signficant amount of
    latency and further code walthrough revealed this unnecessary address
    resolution. Applying this fix solves the issue and now all the brick processes
    come up on a node reboot.
    
    Backport of commit 633e6fe265bc2de42dade58dc6a15c285957da76:
    > Change-Id: I299b8660ce0da6f3f739354f5c637bc356d82133
    > BUG: 1352279
    > Signed-off-by: Atin Mukherjee <amukherj>
    > Reviewed-on: http://review.gluster.org/14849
    > Smoke: Gluster Build System <jenkins.org>
    > NetBSD-regression: NetBSD Build System <jenkins.org>
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: Prashanth Pai <ppai>
    > Reviewed-by: Samikshan Bairagya <samikshan>
    > Reviewed-by: Kaushal M <kaushal>
    
    Change-Id: I299b8660ce0da6f3f739354f5c637bc356d82133
    BUG: 1352817
    Signed-off-by: Atin Mukherjee <amukherj>
    Reviewed-on: http://review.gluster.org/14860
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Niels de Vos <ndevos>
    Smoke: Gluster Build System <jenkins.org>

Comment 4 Niels de Vos 2016-07-08 14:42:35 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.1, please open a new bug report.

glusterfs-3.8.1 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.packaging/156
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.