168700 – GFS latency issues after running tests for an hour

Bug 168700 - GFS latency issues after running tests for an hour

Summary: GFS latency issues after running tests for an hour

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	3
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Jonathan Earl Brassow
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-09-19 16:56 UTC by Adam "mantis" Manthei
Modified:	2010-01-12 03:07 UTC (History)
CC List:	3 users (show)
Fixed In Version:	RHBA-2006-0269
Clone Of:
Environment:
Last Closed:	2006-03-27 18:11:10 UTC
Embargoed:

Attachments	(Terms of Use)
stastics gathering script (3.91 KB, text/plain) 2005-09-19 16:59 UTC, Adam "mantis" Manthei	no flags	Details
debuging counters (25.65 KB, text/plain) 2005-09-19 17:02 UTC, Adam "mantis" Manthei	no flags	Details
script to reproduce latency (4.09 KB, text/plain) 2005-10-17 21:09 UTC, Jonathan Earl Brassow	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2006:0269	0	normal	SHIPPED_LIVE	GFS bug fix update	2006-03-27 05:00:00 UTC

Description Adam "mantis" Manthei 2005-09-19 16:56:24 UTC

Description of problem:
GFS has been exhibiting some strange latency behavior.  After a few hours of a
customer running their TIBCO application, the filesystem write access to the
filesystem becomes extrememly slow.  

The customer is able to make the latency disappear by doing a couple of
different things:

1.  unmounting the filesystem and remounting it on nodes that expreincing latency.

2.  running updatedb on all the nodes in the cluster at the same time.


Version-Release number of selected component (if applicable):
GFS-modules-smp-6.0.2.25
GFS-6.0.2.25
kernel-smp-2.4.21-32.0.1.EL


How reproducible:
We've only seen it on one customer's sight, but they are able to reporduce it
ususally within 4 hours

Steps to Reproduce:
1. run customers TIBCO test
2.
3.
  

Actual results:
After the customer's test starts exhibiting the latency, simply running 
   `dd if=/dev/zero of=/gfs/filename count=16384 bs=8092`
will demonstrate considerable latency.


Expected results:
no performance degredation

Additional info:

Comment 1 Adam "mantis" Manthei 2005-09-19 16:59:12 UTC

Created attachment 118992 [details]
stastics gathering script

This script will gather system stastics from many different counters, including
gfs_tool, bulm_tool, iostats, vmstat and ps.

Comment 2 Adam "mantis" Manthei 2005-09-19 17:02:10 UTC

Created attachment 118993 [details]
debuging counters

this patch adds more debuging counters to GFS to get a better idea of what is
going on durring time of high latency.	This reports additional information
about various different gfs functions, including fcntl/flock , llseek, write,
open, statfs and more. It also adds a statfs message to print when
"debug_statfs" is enabled as a mount option.

Comment 3 Adam "mantis" Manthei 2005-09-19 17:07:14 UTC

One test that was run that seemed to help a little was increasing the number of
lt_lock_partitions.  This caused the latency to not show up until after 10 hours
of load.  However, it still eventually slowed down too.

Comment 4 Jonathan Earl Brassow 2005-10-04 17:43:37 UTC

We believe the problem is an unexpected side-effect of an optimization (of all things).  The file system 
is divided up into portions called resource groups.  These resource groups contain the allocation 
bitmaps for portions of the addressable space.  Machines develop "preferences" for resource groups so 
as not to compete for the same areas of the disk as other machines.

We believe that some machines are developing long lists of preferential resource groups as they 
allocate files.  If another machine comes along and deletes those files, it must acquire the locks for 
them - revoking the locks from the machine who prefers those resource groups.  When the first 
machine wishes to begin allocating again, it _tries_ for locks on resource groups in its preferred list.  If 
it fails to acquire them (which it will if the deallocating machine has them), it simply moves on to the 
next in the preferred list.  As preferential lists grow over time, so do the number of lock requests (due 
to the 'try' nature of the lock operation).  This ultimately - we believe - is causing your latency.

We are currently testing a patch which allows users to set a threshold for the number of "try" failures.  If 
the threshold is hit, either the resource group is removed from the preferential list, or the lock is 
aquired without the TRY flag.  The action taken is user configurable, but is lock-without-try by default.

Still need to test whether this is reproducable on RHEL4.

Comment 5 Jonathan Earl Brassow 2005-10-11 20:49:28 UTC

Patches are available and about to undergo regression testing.

Comment 6 Jonathan Earl Brassow 2005-10-17 21:09:20 UTC

Created attachment 120087 [details]
script to reproduce latency

Comment 9 Red Hat Bugzilla 2006-03-27 18:11:10 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0269.html

Note You need to log in before you can comment on or make changes to this bug.