+++ This bug was initially created as a clone of Bug #222871 +++ Crosswrite from GFS1 to GFS2 for RHEL5 Description of problem: Our test case for gfs_fsck stress has not changed between RHEL4 and RHEL5. The time it takes to run has doubled. RHEL4: ~40 minutes <start tag="gfs_fsck_stress" pid="6745" time="Wed Jan 3 17:29:14 2007" type="cmd"/> <pass tag="gfs_fsck_stress" pid="6745" time="Wed Jan 3 18:11:25 2007" type="cmd"/> RHEL5: ~90 minutes <start tag="gfs_fsck_stress" pid="8381" time="Mon Jan 15 16:16:57 2007" type="cmd"/> <pass tag="gfs_fsck_stress" pid="8381" time="Mon Jan 15 17:49:32 2007" type="cmd"/> Version-Release number of selected component (if applicable): gfs2-utils-0.1.24-1.el5.i386 gfs-utils-0.1.10-1.el5.i386 kernel-2.6.18-1.3002.el5.i686 kmod-gfs-0.1.16-2.2.6.18_1.3002.el5.i686 How reproducible: Every time. Steps to Reproduce: 1. run gfs_fsck_stress from sts 2. 3. Actual results: See above times. Expected results: It should take roughly the same amount of time. Additional info: I need to add more intermediate timestamps to the test case so see which parts are taking longer. -- Additional comment from rpeterso on 2007-01-18 12:42 EST -- My initial tests reflect just the opposite, that RHEL5 is much faster. I ran gfs_fsck -y on an 8GB GFS file system from my two trin clusters (one RHEL4, the other RHEL5) on nearly identical hardware, including the same file system on the same SAN, visible to both clusters. Here are the results: System RHEL REAL USER SYS ------- ---- --------- -------- --------- trin-09 5 0m20.567s 0m4.248s 0m2.580s trin-10 5 0m20.423s 0m4.121s 0m2.673s trin-11 5 0m20.268s 0m4.146s 0m2.584s trin-12 4 0m29.904s 0m4.343s 0m12.152s trin-13 4 0m32.040s 0m4.389s 0m12.112s trin-14 4 0m30.140s 0m4.286s 0m12.216s As you can see, RHEL5 performed the same fsck on the same GFS file system about a third (33%) faster than RHEL4 (~20 seconds vs. ~30 seconds). My fear, of course, is that it might be a regression where one of my recent changes to gfs_fsck (done for both RHEL4 and RHEL5) might have impacted performance of both. That has yet to be proven. It could also be a specific code path that was impacted, involving a specific kind of file system damage. I need to run the same test (gfs_fsck_stress) on both trin clusters, but I don't know the requirements of the test yet (i.e. how much storage space is needed). -- Additional comment from rpeterso on 2007-01-18 13:35 EST -- Same test, same hardware, different file system (39G vs. 8G): System RHEL REAL USER SYS ------- ---- --------- --------- --------- trin-09 5 0m54.502s 0m21.201s 0m13.071s trin-10 5 0m55.976s 0m21.676s 0m12.529s trin-11 5 0m54.559s 0m21.201s 0m12.632s trin-12 4 1m34.916s 0m22.840s 0m59.349s trin-13 4 1m34.714s 0m22.046s 0m59.991s trin-14 4 1m34.779s 0m22.054s 1m0.053s Again, this shows RHEL5 is approximately 42% faster than RHEL4 (55sec vs. 95sec). -- Additional comment from nstraz on 2007-01-18 22:49 EST -- Here are the numbers I get. I ran the same hardware in both cases, tank-0{2,3,4,5}. Once with RHEL5-Server-20070112.3, once with RHEL4-U4. gfs_fsck_stress should be started with the cluster up and one or more file systems mounted. For my runs I only had one 915GB file system mounted. The following opreations are performed: ck1 - umount the file system from all nodes and run fsck mkfs - remake the file system ck2 - fsck again eattr - create some extended attributes on the file system ck3 - fsck again mktree - Create a tree of files on the file system ck4 - fsck again recov - Start a load and reboot all nodes, recover the file system. ck5 - fsck again Step RHEL4 RHEL5 ck1 286 730 mkfs 4 3 ck2 290 745 eattr 105 82 ck3 296 745 mktree 86 686 ck4 313 748 recov 337 ck5 743 My RHEL4 run failed during the recovery step, but the overall slowdown is quite apperant on my nodes. -- Additional comment from nstraz on 2007-01-19 10:47 EST -- Created an attachment (id=145989) gfs_fsck_stress output from RHEL4 -- Additional comment from nstraz on 2007-01-19 10:48 EST -- Created an attachment (id=145990) gfs_fsck_stress output from RHEL5 -- Additional comment from rpeterso on 2007-01-19 13:01 EST -- Recreated and solved. This problem is not between RHEL4 and RHEL5. The slowdown, as I feared in comment #1, was due to recent changes to gfs_fsck. Namely, the code added to periodically report the percent complete. The code was making a system call (get time of day) for every block in the file system, and I didn't realize it would affect performance that much. I changed the code so that it only checks the time of day every one percent of the file system processed, and this method brings back the performance seen before the change (approximately). I tested a soon-to-be-attached patch under both RHEL4 and RHEL5 and for the file system mentioned in comment #2, my times were as follows: RHEL4: went from 1m34s down to 0m42.2s RHEL5: went from 0m55s down to 0m38s -- Additional comment from rpeterso on 2007-01-19 13:31 EST -- Created an attachment (id=146014) Patch to fix the problem This patch was tested for RHEL5 on trin-09. I also changed the Makefile to use -O2 compiler optimization and got the time down from 42s to 39s for RHEL4.
Fix committed to CVS at HEAD and RHEL5. Setting status to Modified.
ack for 5.1 in case it's not in 5.0 yet
Fixing Product Name. Cluster Suite components were integrated into Enterprise Linux for verion 5.0.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0579.html