223500 – gfs2_fsck runs slower than previous version

Bug 223500 - gfs2_fsck runs slower than previous version

Summary: gfs2_fsck runs slower than previous version

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	gfs2-utils
Sub Component:
Version:	5.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Robert Peterson
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:	222871 223506
Blocks:
TreeView+	depends on / blocked

Reported:	2007-01-19 19:10 UTC by Robert Peterson
Modified:	2010-01-12 03:37 UTC (History)
CC List:	3 users (show)
Fixed In Version:	RHBA-2007-0579
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-11-07 18:04:29 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2007:0579	0	normal	SHIPPED_LIVE	gfs2-utils bug fix update	2007-11-08 14:13:05 UTC

Description Robert Peterson 2007-01-19 19:10:20 UTC

+++ This bug was initially created as a clone of Bug #222871 +++
Crosswrite from GFS1 to GFS2 for RHEL5

Description of problem:

Our test case for gfs_fsck stress has not changed between RHEL4 and RHEL5.
The time it takes to run has doubled.

RHEL4: ~40 minutes
<start tag="gfs_fsck_stress" pid="6745" time="Wed Jan  3 17:29:14 2007" type="cmd"/>
<pass tag="gfs_fsck_stress" pid="6745" time="Wed Jan  3 18:11:25 2007" type="cmd"/>

RHEL5: ~90 minutes
<start tag="gfs_fsck_stress" pid="8381" time="Mon Jan 15 16:16:57 2007" type="cmd"/>
<pass tag="gfs_fsck_stress" pid="8381" time="Mon Jan 15 17:49:32 2007" type="cmd"/>

Version-Release number of selected component (if applicable):
gfs2-utils-0.1.24-1.el5.i386
gfs-utils-0.1.10-1.el5.i386
kernel-2.6.18-1.3002.el5.i686
kmod-gfs-0.1.16-2.2.6.18_1.3002.el5.i686

How reproducible:
Every time.

Steps to Reproduce:
1. run gfs_fsck_stress from sts
2.
3.
  
Actual results:
See above times.

Expected results:

It should take roughly the same amount of time.

Additional info:

I need to add more intermediate timestamps to the test case so see which parts
are taking longer.

-- Additional comment from rpeterso on 2007-01-18 12:42 EST --
My initial tests reflect just the opposite, that RHEL5 is much faster.  
I ran gfs_fsck -y on an 8GB GFS file system from my two trin clusters
(one RHEL4, the other RHEL5) on nearly identical hardware, including
the same file system on the same SAN, visible to both clusters.
Here are the results:

System  RHEL   REAL       USER       SYS
------- ---- ---------  --------  ---------
trin-09   5  0m20.567s  0m4.248s  0m2.580s
trin-10   5  0m20.423s  0m4.121s  0m2.673s
trin-11   5  0m20.268s  0m4.146s  0m2.584s

trin-12   4  0m29.904s  0m4.343s  0m12.152s
trin-13   4  0m32.040s  0m4.389s  0m12.112s
trin-14   4  0m30.140s  0m4.286s  0m12.216s

As you can see, RHEL5 performed the same fsck on the same GFS
file system about a third (33%) faster than RHEL4 (~20 seconds vs. 
~30 seconds).

My fear, of course, is that it might be a regression where one of
my recent changes to gfs_fsck (done for both RHEL4 and RHEL5) might
have impacted performance of both.  That has yet to be proven.

It could also be a specific code path that was impacted, involving
a specific kind of file system damage.  I need to run the same test 
(gfs_fsck_stress) on both trin clusters, but I don't know the requirements
of the test yet (i.e. how much storage space is needed).


-- Additional comment from rpeterso on 2007-01-18 13:35 EST --
Same test, same hardware, different file system (39G vs. 8G):

System  RHEL   REAL       USER        SYS
------- ---- ---------  ---------  ---------
trin-09   5  0m54.502s  0m21.201s  0m13.071s
trin-10   5  0m55.976s  0m21.676s  0m12.529s
trin-11   5  0m54.559s  0m21.201s  0m12.632s

trin-12   4  1m34.916s  0m22.840s  0m59.349s
trin-13   4  1m34.714s  0m22.046s  0m59.991s
trin-14   4  1m34.779s  0m22.054s  1m0.053s

Again, this shows RHEL5 is approximately 42% faster than RHEL4
(55sec vs. 95sec).


-- Additional comment from nstraz on 2007-01-18 22:49 EST --
Here are the numbers I get.  I ran the same hardware in both cases,
tank-0{2,3,4,5}.  Once with RHEL5-Server-20070112.3, once with RHEL4-U4.

gfs_fsck_stress should be started with the cluster up and one or more file
systems mounted.  For my runs I only had one 915GB file system mounted.

The following opreations are performed:

ck1 - umount the file system from all nodes and run fsck
mkfs - remake the file system
ck2 - fsck again
eattr - create some extended attributes on the file system
ck3 - fsck again
mktree - Create a tree of files on the file system
ck4 - fsck again
recov - Start a load and reboot all nodes, recover the file system.
ck5 - fsck again

Step    RHEL4   RHEL5

ck1     286     730
mkfs    4       3
ck2     290     745
eattr   105     82
ck3     296     745
mktree  86      686
ck4     313     748
recov           337
ck5             743

My RHEL4 run failed during the recovery step, but the overall slowdown is quite
apperant on my nodes.

-- Additional comment from nstraz on 2007-01-19 10:47 EST --
Created an attachment (id=145989)
gfs_fsck_stress output from RHEL4


-- Additional comment from nstraz on 2007-01-19 10:48 EST --
Created an attachment (id=145990)
gfs_fsck_stress output from RHEL5


-- Additional comment from rpeterso on 2007-01-19 13:01 EST --
Recreated and solved.  This problem is not between RHEL4 and RHEL5.
The slowdown, as I feared in comment #1, was due to recent changes to
gfs_fsck.  Namely, the code added to periodically report the percent
complete.  The code was making a system call (get time of day) for every
block in the file system, and I didn't realize it would affect performance
that much.  I changed the code so that it only checks the time of day
every one percent of the file system processed, and this method brings
back the performance seen before the change (approximately).

I tested a soon-to-be-attached patch under both RHEL4 and RHEL5 and
for the file system mentioned in comment #2, my times were as follows:

RHEL4: went from 1m34s down to 0m42.2s
RHEL5: went from 0m55s down to 0m38s


-- Additional comment from rpeterso on 2007-01-19 13:31 EST --
Created an attachment (id=146014)
Patch to fix the problem

This patch was tested for RHEL5 on trin-09.
I also changed the Makefile to use -O2 compiler optimization and
got the time down from 42s to 39s for RHEL4.

Comment 1 Robert Peterson 2007-01-23 20:58:45 UTC

Fix committed to CVS at HEAD and RHEL5.  Setting status to Modified.

Comment 2 Rob Kenna 2007-03-12 21:10:30 UTC

ack for 5.1 in case it's not in 5.0 yet

Comment 3 Kiersten (Kerri) Anderson 2007-04-23 17:43:18 UTC

Fixing Product Name.  Cluster Suite components were integrated into Enterprise
Linux for verion 5.0.

Comment 6 errata-xmlrpc 2007-11-07 18:04:29 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0579.html

Note You need to log in before you can comment on or make changes to this bug.