1336267 – [scale]: Bricks not started after node reboot.

Bug 1336267 - [scale]: Bricks not started after node reboot.

Summary: [scale]: Bricks not started after node reboot.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	glusterd
Sub Component:
Version:	rhgs-3.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Atin Mukherjee
QA Contact:	Byreddy
Docs Contact:
URL:
Whiteboard:
Depends On:	1352279 1352817 1352833
Blocks:	1351522 1351530
TreeView+	depends on / blocked

Reported:	2016-05-16 04:07 UTC by Byreddy
Modified:	2020-03-11 15:07 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.8.4-1
Doc Type:	Bug Fix
Doc Text:	When a node is rebooted, each brick has its identity verified as it resumes operation. Previously, this was done with address resolution. However, this meant that when a cluster had a large number of bricks, there were a very large number of address lookups, which created contention and sometimes meant that bricks failed to restart. This update changes the brick verification method to use a brick's UUID rather than its address, which reduces contention and ensures that all brick processes restart after a reboot.
Clone Of:
Clones:	1352279 (view as bug list)
Environment:
Last Closed:	2017-03-23 05:30:49 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1337477	unspecified	CLOSED	[Volume Scale] Volume start failed with "Error : Request timed out" after successfully creating & starting around 290 gl...	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1337495	unspecified	CLOSED	[Volume Scale] gluster node randomly going to Disconnected state after scaling to more than 290 gluster volumes	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1337836	unspecified	CLOSED	[Volume Scale] heketi-cli should not attempt to stop and delete a volume as soon as it receives a CLI timeout (120sec) b...	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2017:0486	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Internal Links: 1337477 1337495 1337836

Description Byreddy 2016-05-16 04:07:58 UTC

Description of problem:
=======================
After rebooting the nodes which are hosting 400 volumes bricks, failed to start some of the volume bricks.


Errors in glutserd logs:
=======================
[2016-05-13 08:16:04.924247] E [socket.c:2393:socket_connect_finish]
0-glusterfs: connection to 10.70.36.45:24007 failed (Connection timed out)
[2016-05-13 08:16:05.128728] E [glusterfsd-mgmt.c:1907:mgmt_rpc_notify]
0-glusterfsd-mgmt: failed to connect with remote-host:
rhs-client21.lab.eng.blr.redhat.com (Transport endpoint is not connected)
[2016-05-13 08:16:05.340730] I [glusterfsd-mgmt.c:1913:mgmt_rpc_notify]
0-glusterfsd-mgmt: Exhausted all volfile


Version-Release number of selected component (if applicable):
=============================================================
glusterfs-3.7.9-4.


How reproducible:
=================
Always


Steps to Reproduce:
===================
1. Have two RHGS node with 16 GB RAM each.
2. Create 400 1*2 volumes using both the nodes and start all the volumes.
3. Reboot the nodes and check all volume bricks are running.

Actual results:
===============
Bricks not starting after node reboot.

Expected results:
=================
Bricks should start after rebooting of nodes.


Additional info:

Comment 2 Atin Mukherjee 2016-05-16 04:26:52 UTC

This looks like that GlusterD is not able to communicate with bricks due to lack of multi threaded e-poll support in GlusterD.

[2016-05-13 08:16:04.924247] E [socket.c:2393:socket_connect_finish]
0-glusterfs: connection to 10.70.36.45:24007 failed (Connection timed out)
[2016-05-13 08:16:05.128728] E [glusterfsd-mgmt.c:1907:mgmt_rpc_notify]
0-glusterfsd-mgmt: failed to connect with remote-host:
rhs-client21.lab.eng.blr.redhat.com (Transport endpoint is not connected)
[2016-05-13 08:16:05.340730] I [glusterfsd-mgmt.c:1913:mgmt_rpc_notify]
0-glusterfsd-mgmt: Exhausted all volfile

>From the above log (especially the first one) this indicates that the
same brick process failed to connect to glusterd and the connection got
timed out. This can happen in a situation where there is lot of back
pressure on the other side. Since GlusterD is limited to a single
threaded e-poll communication with the brick processes happen over a
single path and hence while glusterd tried to start 400 odd brick
processes there were 400 RPC connections to handle and that's why few of
the brick process got to hear from GlusterD and they came up but others
did not.

With out a brick multiplexing feature in place, scaling volumes is going to be always a challenge with different set of problems.

Moving it to 3.2.

Comment 4 Atin Mukherjee 2016-06-28 12:29:24 UTC

I tried to enable MT-epoll on a set up of 4 nodes, 400 volumes with bricks spanning over all the nodes. After rebooting a node, all the gluster brick processes didn't come up and same error message was seen in few brick log file. So on a nutshell, MT-epoll is not going to solve this scalability issue. Its the big lock which is causing the threads to block and time out.

Comment 5 Atin Mukherjee 2016-07-03 10:30:47 UTC

Surprisingly, big lock is not a culprit here. Its the pmap_signin from the brick processes which was consuming lot of glusterd's bandwidth and a code walk through revealed that we were doing an unnecessary address resolution which was not needed. Applying the fix solves this problem and I could see that on rebooting, glusterd is able to bring up all the brick processes.

Comment 6 Atin Mukherjee 2016-07-03 10:56:13 UTC

http://review.gluster.org/#/c/14849/ posted for review.

Comment 8 Atin Mukherjee 2016-09-17 14:37:15 UTC

Upstream mainline : http://review.gluster.org/14849
Upstream 3.8 : http://review.gluster.org/14860

And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.

Comment 13 Byreddy 2016-11-09 09:32:52 UTC

Verified this bug using the build  glusterfs-3.8.4-3, Reported issue is not seen any more.

Created 500 1*2 volumes using two nodes having 32G RAM and done stop and start of volumes, it worked well and done node reboots to check the issue reported,all the volume bricks were up.

Moving to verified state.

Comment 15 Atin Mukherjee 2017-03-06 04:58:38 UTC

LGTM :)

Comment 17 errata-xmlrpc 2017-03-23 05:30:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.