683155 – gfs2: creating large files suddenly slow to a crawl

Bug 683155 - gfs2: creating large files suddenly slow to a crawl

Summary: gfs2: creating large files suddenly slow to a crawl

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.7
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Robert Peterson
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	704046 (view as bug list)
Depends On:
Blocks:	690237 690239
TreeView+	depends on / blocked

Reported:	2011-03-08 17:23 UTC by Robert Peterson
Modified:	2018-11-14 17:25 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	GFS2 (Global File System 2) keeps track of the list of resource groups to allow better performance when allocating blocks. Previously, when the user created a large file in GFS2, GFS2 could have run out of allocation space because it was confined to the recently-used resource groups. With this update, GFS2 uses the MRU (Most Recently Used) list instead of the list of the recently-used resource groups. The MRU list allows GFS2 to use all available resource groups and if a large span of blocks is in use, GFS2 uses allocation blocks of another resource group.
Clone Of:
Environment:
Last Closed:	2011-07-21 09:57:31 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Proposed patch (5.85 KB, patch) 2011-03-08 18:02 UTC, Robert Peterson	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:1065	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.7 kernel security and bug fix update	2011-07-21 09:21:37 UTC

Description Robert Peterson 2011-03-08 17:23:28 UTC

Description of problem:
When you create a large file in gfs2, it can suddenly slow
down to a crawl.  Performance can drop dramatically and not
recover.

Version-Release number of selected component (if applicable):
RHEL5.x

How reproducible:
Always

Steps to Reproduce:
1. lvcreate -l 500G 
2. mkfs.gfs2 -O -t bobs_roth:roth_lv -p lock_dlm -j 16 /dev/roth_vg/roth_lv
3. mount -tgfs2 /dev/roth_vg/roth_lv /mnt/gfs2
4. dd if=/dev/zero of=/mnt/gfs2/zeroes bs=1M count=512000
  
Actual results:
When the file hits 32G, it will stop making progress.

Expected results:
The file should continue to grow.

Additional info:
This upstream patch fixes the problem:
http://git.kernel.org/?p=linux/kernel/git/steve/gfs2-2.6-nmw.git;a=commitdiff;h=9cabcdbd4638cf884839ee4cd15780800c223b90

Comment 1 Robert Peterson 2011-03-08 17:33:11 UTC

Requesting ack flags for 5.7.

Comment 2 RHEL Program Management 2011-03-08 17:40:35 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 3 Robert Peterson 2011-03-08 18:02:56 UTC

Created attachment 482968 [details]
Proposed patch

Here is the initial RHEL5 patch for this issue.

Comment 4 Robert Peterson 2011-03-08 21:19:23 UTC

I built and (quickly) tested a kernel that contains this patch.
The kernel is: 2.6.18-247.el5.bz683155.x86_64.rpm
I uploaded it to my people page:
http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/*247*

It contains these patches
gfs2: filesystem sporadic slow performance problem with large files [683155]
gfs2: avoid hangs while reclaiming unlinked metadata                [656032]
gfs2: release rgrp glocks properly on unlink error                  [656032]
gfs2: auto-tune glock hold time                                     [650494]
make queue_delayed_work execute immediately if delay==0             [650494]

Comment 6 Steve Whitehouse 2011-03-09 11:57:58 UTC

I have confirmed that this doesn't occur in RHEL6 or upstream so is specific to RHEL5.

Comment 7 Robert Peterson 2011-03-10 16:01:56 UTC

I posted this patch to rhkernel-list for review.  It was tested
on my roth cluster and received positive customer feedback.
Changing status to POST.

Comment 10 Jarod Wilson 2011-03-16 18:02:45 UTC

Patch(es) available in kernel-2.6.18-248.el5
Detailed testing feedback is always welcomed.

Comment 22 Alan Brown 2011-05-13 21:47:41 UTC

How's progress for this going into the release kernels?

Comment 23 Steve Whitehouse 2011-05-16 09:12:47 UTC

It is scheduled for 5.7

Comment 24 Steve Whitehouse 2011-05-16 10:30:49 UTC

*** Bug 704046 has been marked as a duplicate of this bug. ***

Comment 25 Alan Brown 2011-05-17 18:39:07 UTC

Can we have a hotfix kernel for evaluation please?

Comment 26 Robert Peterson 2011-05-17 19:24:59 UTC

Hi Alan,

Feel free to try the kernel rpms located on my people page:

http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/kernel*

This kernel has not undergone Red Hat's Quality testing process,
so there are no guarantees, but it contains all the latest GFS2
fixes.

Comment 27 Bryn M. Reeves 2011-05-18 11:03:58 UTC

Please note that as Bob mentions, this is a test kernel rather than a hotfix (hotfixes can only be issues once we have verified that a change addresses the reported problem; we can issue a test kernel via a support case and any supported hotfix also needs to be requested/provided via the support tickets also.

I don't see a support case linked to this bug for you - please could you file one (or update the existing one if there is one already and for some reason it has not been linked to this bug - just a very brief note referencing bug 683155)?

Thanks,

Comment 28 Alan Brown 2011-05-18 11:13:18 UTC

Bob, 

Does this follow on from the -256 kernel in
http://people.redhat.com/jwilson/el5/256.el5/ or is it a parallel development
stream?

(The -256 kernel has a kdump fix we need, as well as possibly incorporating
fixes for bz #666080 (checking on that at the moment))


Bryn,

I've noted these BZs in tx 00333468 as I suspect they have a bearing on the hangs we're seeing (but not necessarily the slow read performance - which appears to be directory structure related)

Thanks
AB

Comment 29 Alan Brown 2011-05-18 18:05:56 UTC

Bryn: see also Tx 00335837 - identical symptoms to the start of this BZ

Comment 30 Nate Straz 2011-05-26 12:57:03 UTC

Checked the file size and calculated I/O rate while streaming a large file:

2.6.18-238.el5
filename                bytes/second    file size
/mnt/dash0/filler       68036061.866667 31331344384
/mnt/dash0/filler       68036061.866667 33372426240
/mnt/dash0/filler       3697322.666667  33483345920
/mnt/dash0/filler       695773.866667   33504219136
/mnt/dash0/filler       832443.733333   33529192448
/mnt/dash0/filler       834901.333333   33554239488


2.6.18-262.el5
/mnt/dash0/filler         74798421      30412554240
/mnt/dash0/filler         71687373      32563175424
/mnt/dash0/filler         68123853      34606891008
/mnt/dash0/filler         73819750      36821483520
/mnt/dash0/filler         70883738      38947995648
/mnt/dash0/filler         70185916      41053573120
/mnt/dash0/filler         73120700      43247194112
...
/mnt/dash0/filler         67897071      520676028416
/mnt/dash0/filler         68036062      522717110272
500000+0 records in
500000+0 records out

Comment 31 Alan Brown 2011-05-26 13:09:32 UTC

I see similar results on -262

Comment 32 Martin Prpič 2011-06-02 13:32:31 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
GFS2 (Global File System 2) keeps track of the list of resource groups to allow better performance when allocating blocks. Previously, when the user created a large file in GFS2, GFS2 could have run out of allocation space because it was confined to the recently-used resource groups. With this update, GFS2 uses the MRU (Most Recently Used) list instead of the list of the recently-used resource groups. The MRU list allows GFS2 to use all available resource groups and if a large span of blocks is in use, GFS2 uses allocation blocks of another resource group.

Comment 33 errata-xmlrpc 2011-07-21 09:57:31 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html

Comment 34 Red Hat Bugzilla 2013-10-04 00:26:53 UTC

Removing external tracker bug with the id 'DOC-65879' as it is not valid for this tracker

Comment 35 Red Hat Bugzilla 2013-10-04 00:27:01 UTC

Removing external tracker bug with the id 'DOC-61679' as it is not valid for this tracker

Note You need to log in before you can comment on or make changes to this bug.