Bug 717010

Summary: customer reports interactivity regression between 6.0 and 6.1 kernels with high IO
Product: Red Hat Enterprise Linux 6 Reporter: Steven Dake <sdake>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED DUPLICATE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: high    
Version: 6.1CC: aquini, asalkeld, cluster-maint, cww, djansa, fdinitto, jcastillo, lhh, nicolas, omer.sen, rnelson, sbradley, sdake, vgoyal
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 709758 Environment:
Last Closed: 2011-10-17 19:53:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 709758    
Bug Blocks:    
Attachments:
Description Flags
sar data none

Comment 13 Ricky Nelson 2011-09-07 20:16:41 UTC
Created attachment 521997 [details]
sar data

Comment 19 Larry Woodman 2011-09-13 18:14:00 UTC
How easy is it to reproduce this problem?  If I can reproduce it myself I can try to see what kernel version in 6.1 the problem started happening in.  If I cant reproduce it maybe I can give someone a series of 6.1 development/interm kernels so we can try to pinpoint a change thats resposible for the regression.

Larry Woodman

Comment 20 Nicolas Ross 2011-09-13 18:36:07 UTC
You can also refer to bug # 709758 and our support case # 477247 for this issue.

But, basicly when we discovered this, our cluster was just begining to take place, it was as simple as :

- 8 nodes cluster
- Fibre channel attached storage
- Some service (like apache/php, with no trafic) running on top of gfs2, where the FS is mounted on more than one node.

Wait a few hours, and it was enought to see a good spike in cpu, and during the issue, ssh console was not usable at all (i.e. slow as hell to respond).

Comment 21 Larry Woodman 2011-09-19 14:14:04 UTC
Is this bug a duplicate of BZ709758 ???


Larry

Comment 22 Steven Dake 2011-09-19 15:09:07 UTC
Larry,

709758 is the corosync part of this bug where we converted mutex to spinlocks.  After we did this, the customer still had problems where the system would still be unusable periodically but only with RHEL6.1 kernels.  The customer was able to identify the system misbehaved with only a specific kernel (with the updated packages from 709758).

The RHEL process requires separate bugs for different components even if they have the same problem.

Comment 23 Steven Dake 2011-09-19 18:33:06 UTC
comment #22 should read "converted spinlocks to mutexes".

Comment 29 Ricky Nelson 2011-10-17 19:53:52 UTC

*** This bug has been marked as a duplicate of bug 710265 ***