Bug 168700
Summary: | GFS latency issues after running tests for an hour | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Adam "mantis" Manthei <amanthei> | ||||||||
Component: | gfs | Assignee: | Jonathan Earl Brassow <jbrassow> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | GFS Bugs <gfs-bugs> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 3 | CC: | axel.thimm, cfeist, jbrassow | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | RHBA-2006-0269 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2006-03-27 18:11:10 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Adam "mantis" Manthei
2005-09-19 16:56:24 UTC
Created attachment 118992 [details]
stastics gathering script
This script will gather system stastics from many different counters, including
gfs_tool, bulm_tool, iostats, vmstat and ps.
Created attachment 118993 [details]
debuging counters
this patch adds more debuging counters to GFS to get a better idea of what is
going on durring time of high latency. This reports additional information
about various different gfs functions, including fcntl/flock , llseek, write,
open, statfs and more. It also adds a statfs message to print when
"debug_statfs" is enabled as a mount option.
One test that was run that seemed to help a little was increasing the number of lt_lock_partitions. This caused the latency to not show up until after 10 hours of load. However, it still eventually slowed down too. We believe the problem is an unexpected side-effect of an optimization (of all things). The file system is divided up into portions called resource groups. These resource groups contain the allocation bitmaps for portions of the addressable space. Machines develop "preferences" for resource groups so as not to compete for the same areas of the disk as other machines. We believe that some machines are developing long lists of preferential resource groups as they allocate files. If another machine comes along and deletes those files, it must acquire the locks for them - revoking the locks from the machine who prefers those resource groups. When the first machine wishes to begin allocating again, it _tries_ for locks on resource groups in its preferred list. If it fails to acquire them (which it will if the deallocating machine has them), it simply moves on to the next in the preferred list. As preferential lists grow over time, so do the number of lock requests (due to the 'try' nature of the lock operation). This ultimately - we believe - is causing your latency. We are currently testing a patch which allows users to set a threshold for the number of "try" failures. If the threshold is hit, either the resource group is removed from the preferential list, or the lock is aquired without the TRY flag. The action taken is user configurable, but is lock-without-try by default. Still need to test whether this is reproducable on RHEL4. Patches are available and about to undergo regression testing. Created attachment 120087 [details]
script to reproduce latency
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0269.html |