963223 – Re-inserting a server in a v3.3.2qa2 distributed-replicate volume DOSes the volume

Bug 963223 - Re-inserting a server in a v3.3.2qa2 distributed-replicate volume DOSes the volume

Summary: Re-inserting a server in a v3.3.2qa2 distributed-replicate volume DOSes the v...

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	core
Sub Component:
Version:	pre-release
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	bugs@gluster.org
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-05-15 12:37 UTC by hans
Modified:	2015-10-22 15:40 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-10-22 15:40:20 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description hans 2013-05-15 12:37:38 UTC

Description of problem:

The volume becomes unresponsive when a server in a 4-server distributed-replicate 14x2 cluster is brought back into cluster after it had been down for 2 days.

This is probably caused by the self-heal system.

Version-Release number of selected component (if applicable):

3.3git-v3.3.2qa2-3-g3490689
 
Actual results:

Volume is unresponsive.

Expected results:

A working volume with only slightly higher access times.

Additional info:

Before stor2 was brought back online :

stor1: gluster filehandles: 550 , load: 0.87 0.76 0.64 1/385 9521
stor3: gluster filehandles: 649 , load: 0.40 0.94 1.21 1/494 16570
stor4: gluster filehandles: 573 , load: 0.58 0.55 0.51 1/439 27743

First 5 minutes after stor2 was brought back online showed :

stor1: gluster filehandles: 596 , load: 0.52 0.73 0.72 1/385 10320
stor2: gluster filehandles: 759 , load: 28.09 18.32 8.09 25/294 2455
stor3: gluster filehandles: 774 , load: 11.76 7.44 3.92 1/499 17568
stor4: gluster filehandles: 683 , load: 4.74 3.55 1.86 1/439 28438

After 5 minutes I shut down stor2 to make the volume available again.

Comment 1 hans 2013-05-20 23:33:35 UTC

In addition : this issue is not about data transfer saturation but rather about IO per second saturation on the individual harddisks.
Thus moving to faster ethernet or infiniband won't help.

We see gluster management operations saturating brick IO/s in several cases : replace-brick, rebalance and now self-heal.

In order to prevent this behaviour we must throttle gluster management traffic to, say, 100 IOPS per brick. Ideally it'll be configurable to either a hardcoded limit or a percentage.

Comment 2 hans 2013-05-30 12:02:16 UTC

In addition : moving the stor2 server to a new DNS name stor5 and IP address and force-replace a single brick from the stor2 to stor5 still DOSes the entire volume.

In addition : disabling the self-heal daemon on the volume does NOT help. Bringing the new server with the single brick up still DOSes the entire volume.

Comment 3 Kaleb KEITHLEY 2015-10-22 15:40:20 UTC

pre-release version is ambiguous and about to be removed as a choice.

If you believe this is still a bug, please change the status back to NEW and choose the appropriate, applicable version for it.

Note You need to log in before you can comment on or make changes to this bug.