Bug 1166063 - diff heal makes the system unusable
Summary: diff heal makes the system unusable
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: replicate
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Pranith Kumar K
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-11-20 10:58 UTC by Pranith Kumar K
Modified: 2016-06-22 09:24 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-06-22 09:24:37 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Pranith Kumar K 2014-11-20 10:58:53 UTC
Description of problem:
Here is the mail sent by Lindsay Mathieson:

2 Node replicate setup,

Everything has been stable for days untill I had occasion to reboot
one of the nodes. Since then (past hour) glusterfsd has been pegging
the CPU(s), utilization ranging from 1% to 1000% !

On average its around 500%

This is a vm server, so there are only 27 VM images for a total of
800GB. Its an Intel E5-2620 (12 Cores) with 32GB ECC RAM

- What does glusterfsd do?

- What can I do to fix this?

thanks,
------------------------

We found that the root cause is that mount started self-heal of all the VMs which are doing diff self-heal, i.e. checksums are consuming high CPU on the bricks which lead to the issue. We need a way to throttle the number of parallel self-heals.


Version-Release number of selected component (if applicable):


How reproducible:
always

Steps to Reproduce:
1. Have a lot of VMs on replicated volume
2. Bring one brick down and do some write activity on all the VMs
3. Bring the brick back up while the VM operations are in progress
4. This will lead to self-heal of all the VMs by the mount.
5. That will cause high CPU usage on bricks because of checksums.

Expected results:
Bricks should not use so much CPU. There should be some kind of throttling

Comment 1 Pranith Kumar K 2016-06-22 09:24:37 UTC
With sharding and full self-heal algorithm this problem doesn't happen. So closing this.


Note You need to log in before you can comment on or make changes to this bug.