Bug 1166063

Summary: diff heal makes the system unusable
Product: [Community] GlusterFS Reporter: Pranith Kumar K <pkarampu>
Component: replicateAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: mainlineCC: bugs, lindsay.mathieson
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-06-22 09:24:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Pranith Kumar K 2014-11-20 10:58:53 UTC
Description of problem:
Here is the mail sent by Lindsay Mathieson:

2 Node replicate setup,

Everything has been stable for days untill I had occasion to reboot
one of the nodes. Since then (past hour) glusterfsd has been pegging
the CPU(s), utilization ranging from 1% to 1000% !

On average its around 500%

This is a vm server, so there are only 27 VM images for a total of
800GB. Its an Intel E5-2620 (12 Cores) with 32GB ECC RAM

- What does glusterfsd do?

- What can I do to fix this?

thanks,
------------------------

We found that the root cause is that mount started self-heal of all the VMs which are doing diff self-heal, i.e. checksums are consuming high CPU on the bricks which lead to the issue. We need a way to throttle the number of parallel self-heals.


Version-Release number of selected component (if applicable):


How reproducible:
always

Steps to Reproduce:
1. Have a lot of VMs on replicated volume
2. Bring one brick down and do some write activity on all the VMs
3. Bring the brick back up while the VM operations are in progress
4. This will lead to self-heal of all the VMs by the mount.
5. That will cause high CPU usage on bricks because of checksums.

Expected results:
Bricks should not use so much CPU. There should be some kind of throttling

Comment 1 Pranith Kumar K 2016-06-22 09:24:37 UTC
With sharding and full self-heal algorithm this problem doesn't happen. So closing this.