Bug 1416327

Summary:	Failure on remove-brick when each server has a lot of bricks
Product:	[Community] GlusterFS	Reporter:	Xavi Hernandez <jahernan>
Component:	disperse	Assignee:	Xavi Hernandez <jahernan>
Status:	CLOSED DUPLICATE	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	mainline	CC:	aspandey, bugs, nbalacha, pkarampu, sheggodu
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-10-25 08:29:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Xavi Hernandez 2017-01-25 09:50:32 UTC

Description of problem:

When a remove brick is attempted on a distributed-disperse 4+1 volume with more than hundred bricks on each server, the rebalance process fails.

Version-Release number of selected component (if applicable): mainline


How reproducible:

Occasionally

Steps to Reproduce:
1. Create a big distributed-disperse 4+1 volume using 2 servers and more than 100 bricks on each server
2. Start the volume
3. Try to remove one of the disperse-sets (5 bricks)

Actual results:

Sometimes the rebalance process fails

Expected results:

Rebalance shouldn't fail

Additional info:

Comment 1 Xavi Hernandez 2017-01-30 07:33:06 UTC

The discussion I've had with Pranith and Ashish about this problem:

On 30/01/17 08:23, Xavier Hernandez wrote:
> Hi Ashish,
> 
> On 30/01/17 08:04, Ashish Pandey wrote:
>>
>> Hi Xavi,
>>
>> Our QA team has also filed a bug similar to this. The only diffrenec is
>> that the setup they used is only 3 * (4+2)
>> and they removed 6 bricks.
>> https://bugzilla.redhat.com/show_bug.cgi?id=1417535
> 
> The messages look quite similar.
> 
>>
>> Do you think it is only possible with lots of bricks or it could also
>> happened with less number  of bricks
> 
> The issue happens when the system is unable to establish the connections
> in less time than the timeout (currently 10 seconds). Probably this can
> happen if there are a lot of connections to make or the system is very
> busy.
> 
>> Also, could you please update the upstream bug with your solution and
>> this mail discussion?
> 
> I'll do.
> 
> Xavi
> 
>>
>> Ashish
>>
>> ------------------------------------------------------------------------
>> *From: *"Pranith Kumar Karampuri" <pkarampu>
>> *To: *"Xavier Hernandez" <xhernandez>
>> *Cc: *"Ashish Pandey" <aspandey>
>> *Sent: *Wednesday, January 25, 2017 6:52:57 PM
>> *Subject: *Re: Start timeout for ec/afr
>>
>>
>>
>> On Wed, Jan 25, 2017 at 5:17 PM, Xavier Hernandez <xhernandez
>> <xhernandez>> wrote:
>>
>>     On 25/01/17 12:28, Pranith Kumar Karampuri wrote:
>>
>>
>>
>>         On Wed, Jan 25, 2017 at 4:49 PM, Xavier Hernandez
>>         <xhernandez <xhernandez>
>>         <xhernandez <xhernandez>>>
>>         wrote:
>>
>>             On 25/01/17 12:08, Pranith Kumar Karampuri wrote:
>>
>>                 Wow, scale problem :-).
>>
>>                 It can happen this way with mounts also right? Why are we only
>>                 considering rebalance process only?
>>
>>
>>             The problem can happen with mounts also, but it's less visible.
>>             Waiting a little solves the problem. However an automated task that
>>             mount the volume and does something immediately after that can
>>             suffer the same problem.
>>
>>                 The reason we did this timeout business is to
>>                 prevent users from getting frustrated at the time of mount
>>                 waiting for it to happen. Rebalance can take that extra minute or
>>                 two till it gets a ping timeout before being operational. So if
>>                 this issue is only with rebalance we can do something different.
>>
>>
>>             Is there a way to detect if we are running as a mount, self-heal,
>>             rebalance, ... ?
>>
>>
>>         Glusterd launches all these processes. So we can launch them with custom
>>         options. Check glusterd_handle_defrag_start() for example.
>>
>>
>>     That's an interesting option.
>>
>>
>>
>>             We could have different setting for each environment. However I
>>             still think that succeeding a mount when the volume is not fully
>>             available is not a good solution.
>>
>>             I think the mount should wait until the volume is as ready as
>>             possible. We can define a timeout for this to avoid an indefinite
>>             wait, but this timeout should be way longer than the current 10 seconds.
>>
>>             On the other hand, when enough bricks are online, we don't need to
>>             force the user to wait for a full timeout if a brick is really down.
>>             In this case a smaller timeout of 5-10 seconds would be enough to
>>             see if there are more bricks available before declaring the volume up.
>>
>>
>>         Will a bigger number of bricks break through the barriers and we will
>>         have to adjust the numbers again?
>>
>>     It can happen. That's why I would make the second timeout configurable.
>>
>>     For example, in a distributed-disperse volume if there aren't enough
>>     bricks online to bring up at least one ec subvolume, mount will have
>>     to wait for the first timeout. It could be 1 minute (or fixed to 10
>>     times the second timeout, for example). This is a big delay, but we
>>     are dealing with a very rare scenario. Probably the mount hang would
>>     be the least of the problems.
>>
>>     If there are few bricks online, enough to bring up one ec subvolume,
>>     then that subvolume will answer reasonably fast. At most the
>>     connection delay + the second timeout value (this could be 5-10
>>     seconds by default, but configurable). DHT brings up itself as soon
>>     as at least one of the subvolumes comes up. So we are ok here, we
>>     don't need to do that each single ec subvolume report its state in a
>>     fast way.
>>
>>     If all bricks are online, the mount will have to wait only the time
>>     needed to connect to all bricks. No timeouts will be applied here.
>>
>>
>> Looks good to me.
>>
>>
>>     Xavi
>>
>>             Xavi
>>
>>
>>
>>                 On Wed, Jan 25, 2017 at 3:55 PM, Xavier Hernandez
>>                 <xhernandez <xhernandez>
>>         <xhernandez <xhernandez>>
>>                 <xhernandez
>>         <xhernandez> <xhernandez
>>         <xhernandez>>>>
>>
>>                 wrote:
>>
>>                     Hi,
>>
>>                     currently we have a start timeout for ec and afr that work very
>>                     similarly, if not equal. Basically, when PARENT_UP event is
>>                     received, the timer is started. If we receive CHILD_UP/CHILD_DOWN
>>                     events from all children, the timer is cancelled and the appropriate
>>                     event is propagated. If not all bricks have answered when the
>>                     timeout expires, we propagate CHILD_UP/CHILD_DOWN depending on how
>>                     many up children we have.
>>
>>                     There's an issue when one server has a lot of bricks. In this case
>>                     the connection to enough bricks to bring the volume up could take
>>                     more time than the 10 hardcoded seconds (I've filed a bug for this:
>>                     https://bugzilla.redhat.com/show_bug.cgi?id=1416327)
>>
>>
>>                     For mounts this is not a problem. Even if not enough bricks have
>>                     answered in time, the volume will be mounted and eventually the
>>                     remaining bricks will be connected and accessible.
>>
>>                     However when a rebalance process is started. It immediately tries to
>>                     do operations on the volume once all DHT's subvolumes have answered.
>>                     If they answer as CHILD_DOWN, the rebalance fails without waiting
>>                     for the subvolumes to come online.
>>
>>                     To solve this I've been thinking on the following change for the
>>                     first start of a volume:
>>
>>                     1. Start a timer when PARENT_UP is received
>>
>>                     This will be a worst case timer. It would set it at least to a
>>                     minute. However I'm not sure if this is really necessary. Maybe
>>                     protocol/client answers relatively fast even if no connection can be
>>                     established.
>>
>>                     2. Start a second timer when the minimum amount of bricks are up
>>
>>                     Once we know that the volume can really be started, we'll still wait
>>                     a little more to allow remaining bricks to connect. I would set this
>>                     timeout configurable and with a default value of 5 seconds.
>>
>>                     I think there's no good reason to propagate a CHILD_DOWN event
>>                     before we really know if the volume will be down.
>>
>>                     This solves the rebalance problem and allows for some margin of time
>>                     for bricks to become online before operating on the volume (this
>>                     avoids that operations send just after the CHILD_UP even could cause
>>                     inconsistencies that self-heal will have to solve once the remaining
>>                     bricks come online).
>>
>>                     In this case a mount won't succeed until the main timeout (or
>>                     protocol/client timeout) when not enough bricks are available, but I
>>                     think this is acceptable.
>>
>>                     What do you think ?
>>
>>                 --
>>                 Pranith
>>
>>         --
>>         Pranith
>>
>> -- 
>> Pranith
>>
>

Comment 2 Xavi Hernandez 2018-10-25 08:29:43 UTC

Most probably this is related to bug #1564600. The previous small value for the connection backlog (only 10) caused connection failures and retries when there were a lot of simultaneous connections, like in this case. The reconnect timer was causing some connections to be delayed too much, beyond the 10 seconds timeout that EC uses.

Closing is bug as a duplicate of 1564600.

*** This bug has been marked as a duplicate of bug 1564600 ***