494372 – GFS2: gfs2_grow very slow with 1k file system block size

Bug 494372 - GFS2: gfs2_grow very slow with 1k file system block size

Summary: GFS2: gfs2_grow very slow with 1k file system block size

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	gfs2-utils
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	rc
Target Release:	5.5
Assignee:	Robert Peterson
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:	485451
Blocks:
TreeView+	depends on / blocked

Reported:	2009-04-06 15:52 UTC by Robert Peterson
Modified:	2010-01-12 03:42 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	485451
Environment:
Last Closed:	2009-11-19 14:25:24 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Robert Peterson 2009-04-06 15:52:13 UTC

+++ This bug was initially created as a clone of Bug #485451 +++
It's been noted that gfs2_grow is also slow with 1K block size.
Any potential fix for gfs2_grow would be vastly different from the
fix I did for gfs_grow in bug #485451 because gfs2_grow is so
different.  So I need to start over with analysis on what's slowing
it down.  If gfs2_grow performance is acceptable, I may close this
bug as WORKSFORME or some such.

Description of problem:

gfs_grow takes a lot longer to run when the file system block size is 1k vs 4k.

Here is a list of gfs_grow run times for a completed 4k run and a still running 1k run.

4k 
Feb 12 14:52:20 6(sec)
Feb 12 14:52:33 6(sec)
Feb 12 14:52:50 7(sec)
Feb 12 14:55:16 84(sec)
Feb 12 14:57:32 97(sec)
Feb 12 14:59:23 127(sec)
Feb 12 15:01:00 71 (sec)
Feb 12 15:04:00 110(sec)
Feb 12 15:06:45 130(sec)
Feb 12 15:08:57 145(sec)
Feb 12 15:12:04 132(sec)
Feb 12 15:15:52 146(sec)
Feb 12 15:17:17 102(sec)
Feb 12 15:20:29 124(sec)

1k 
Feb 12 16:54:35 5509(sec)
Feb 12 20:04:43 11191(sec)
Feb 13 00:17:17 15055(sec)
Feb 13 05:28:58 18461(sec)

Version-Release number of selected component (if applicable):
gfs-utils-0.1.18-1.el5
kmod-gfs-0.1.31-3.el5
kernel-2.6.18-129.gfs2abhi.001

How reproducible:
100%

Steps to Reproduce:
1. mkfs -t gfs -b 1024 ... /dev/f
2. 
3.
  
Actual results:


Expected results:


Additional info:

--- Additional comment from nstraz on 2009-02-13 11:22:33 EDT ---

I should note that both runs are running with a load on the file system while the gfs_grow is running.

--- Additional comment from nstraz on 2009-02-13 14:42:32 EDT ---

Created an attachment (id=331865)
Output log from repetitive gfs_grow test

I ran a test which started with a 500GB file system, then grew the file system by 500GB five times.  I got the following runtimes in seconds

GFS 1k: 145 255 368 483 600
GFS 4k: 4 5 5 5 5 4
GFS2 1k: 70 69 153 241 306
GFS2 4k: 17 3 4 6 7

This shows that as the file system size grows, gfs_grow and gfs2_grow take
longer to complete.

Attached is the full output log from my test run.

--- Additional comment from nstraz on 2009-02-18 13:26:26 EDT ---

gfs2_grow also takes up a ton of memory while growing a 1k block size file system.

Here's top while growing a file system by 500G:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
21510 root      18   0 1024m 722m  496 R  1.0 72.1   1:11.50 gfs2_grow

Test output:

lvextend -l +100%FREE growfs/grow1 on dash-03
  Extending logical volume grow1 to 3.87 TB
  Logical volume grow1 successfully resized
growing grow1 on dash-02
FS: Mount Point: /mnt/grow1
FS: Device:      /dev/mapper/growfs-grow1
FS: Size:        3638531454 (0xd8df957e)
FS: RG size:     523982 (0x7fece)
DEV: Size:       4158324736 (0xf7db0000)
The file system grew by 507610MB.

--- Additional comment from rpeterso on 2009-02-23 15:24:17 EDT ---

Created an attachment (id=332967)
Try 1 patch

This is still a preliminary (proof-of-concept) patch.

In my tests with a load running on a single node (the same node doing
gfs_grow) I found that writing the new RGs was very slow; even slower
than writing the new rindex entries.  The times looked like this:

gather_info: 220 ms.
create_rgrps: 14 ms.
write rgrps: 68537 ms.
write rindex: 48621 ms.

With this patch, the times are now:

write rgrps: 1011 ms.
write rindex: 21334 ms.

FWIW, this was extending a gfs file system from 5G to 305G.

The speedup in rgrps is simple: Instead of doing:
(1) open, (2) lseek, (3) write, (4) sync, (5) close for every block
it now does one open, one lseek per RG, several consecutive
writes, then one sync and one close.  This should be pretty safe
because if the gfs_grow is interrupted or loses power during that
write, the next gfs_grow will recalculate the same starting block
number and rewrite the RGs in their entirety, and since the rindex
hasn't been written at that point, the new rindex should be accurate.

The speedup in rindex writing was achieved by writing ten rindex
entries at a shot rather than one per write.  We could easily change
it to more than ten, since I defined it as a #define.  In theory,
we could get fancy and try to write the number of entries that best
fits onto our block size, after calculating where in the block we
started on that first rindex append.  I'm not sure it's worth it:
Aside from our QA, how often are customers executing this code?

This isn't guaranteed to be quite as safe.  Still, at this point, the
RGs have been written out and synced to disk.  If the rindex write is
interrupted, you're probably just as likely to run into the same
problem with a single 96-byte write as you are with a 960-byte write.
Perhaps Steve or Abhi can look this over and see if the can punch
holes in my theory.

--- Additional comment from nstraz on 2009-02-26 12:29:48 EDT ---

I've been testing a gfs_grow which writes 40 rindex entries at a time and the speed still has not improved enough.

strace shows that the ioctls are taking a long time:

[root@dash-01 ~]# strace -T -p 24674
Process 24674 attached - interrupt to quit
ioctl(3, 0x472d, 0x7fffbda5a600)        = 3840 <38.554636>
ioctl(3, 0x472d, 0x7fffbda5a600)        = 3840 <249.566196>
ioctl(3, 0x472d, 0x7fffbda5a600)        = 3840 <33.458248>
ioctl(3, 0x472d, 0x7fffbda5a600)        = 3840 <418.865492>
ioctl(3, 0x472d

There is a lot of lock traffic going on.  I'm trying to analyze the DLM network traffic with wireshark.  I captured the extra section of the DLM packets with the following command:

tshark -R dlm3.m.extra -z proto,colinfo,dlm3.m.extra,dlm3.m.extra

then piped it through a perl script to decode the extra data into something more readable.  I ran the capture for 26:24 on dash-01, where a gfs_grow was in progress.

571965 locks captured, 361 lock ops per second.
    138 inode, 571821 RG, 7 quota

59357 unique locks found
      4 inode, 59351 RG, 2 quota

--- Additional comment from syeghiay on 2009-03-03 15:24:58 EDT ---

Set pm_ack for 5.4 gfs-utils blocker-rc.

--- Additional comment from rpeterso on 2009-03-31 18:44:27 EDT ---

Created an attachment (id=337417)
Try 2 patch

I verified that the holdup here is in doing this:

	error = gfs_glock_nq_init(ip->i_gl, LM_ST_EXCLUSIVE,
				 LM_FLAG_PRIORITY | GL_SYNC | GL_NOCANCEL_OTHER,
				 &i_gh);

This often takes 1-2 seconds--yes seconds--to acquire the rindex glock.
Part of the slowdown is that the GL_SYNC forces it to sync to disk,
including the journal entry, and the inplace_reserve forces it to
re-read the rindex file, possibly on all nodes, because the version
number of the inode glock is changed.

FYI, the glock_purge "feature" doesn't help the speed at all.

This patch is much like the previous one.  The biggest difference is
that it tries to write 43 rindex entries per write rather than 1 in
the original and 10 or 40 in previous incarnations.  The magic number
of 43 was calculated from (the 4K page size / 96 bytes per rindex entry).

The goal here is to allocate a full page of data if we can and write
that to the rindex file so the reads and journals don't have to fight
with each other quite as much.  This should probably be tweaked so
that it retrieves the current rindex size and calculates exactly how
best to do it to minimize disk IO.  Perhaps tomorrow I'll have another
go at that concept.  For now, this patch allows the grow test to run
in approximately 20 minutes per iteration, which is considerably
better than before.

Beyond the tweaks mentioned (which I'll try tomorrow) there's really
nothing more I can improve upon here unless I change the gfs kernel
code to introduce specialized code for gfs_grow, such as a new
performance-tweaked ioctl to use instead of hfile_write.

--- Additional comment from syeghiay on 2009-04-01 09:34:07 EDT ---

Removed Conditional NAK - Capacity from Devel Whiteboard since devel_ack+ and qa_ack+ exist.

--- Additional comment from rpeterso on 2009-04-01 14:01:07 EDT ---

*** Bug 486408 has been marked as a duplicate of this bug. ***

--- Additional comment from rpeterso on 2009-04-02 22:36:55 EDT ---

Created an attachment (id=337968)
Try 3 patch

This patch seems to work just as well but has an added safety
measure for near-full file systems.  It also has a big comment
explaining why I'm doing what I'm doing.

--- Additional comment from rpeterso on 2009-04-03 17:17:48 EDT ---

Created an attachment (id=338126)
Try 4 patch

This patch is similar to its predecessor, but the chunk size is
calculated dynamically based on the block size.  That way, performance
should be improved as well for systems with bigger block sizes, such as
ia64.

Comment 1 Robert Peterson 2009-05-06 14:03:28 UTC

Nate, can you please verify that gfs2_grow's performance is indeed
"very slow"?  I ran it a few times and although 1k blocks were a
bit slower than 4k blocks, it still didn't seem outrageous to me,
at least on my hardware (the SAN on my roth cluster).  I started
with a 200GB lv, lvextended it by +500G and gfs2_grow took only
9.6 seconds, as opposed to the same hardware with 4K blocks, which
took 6.7 seconds.

I'm not saying there's no problem, but given that a typical gfs2
user probably won't be doing gfs2_grow more than once a month at
most, 3 seconds doesn't seem like much of a sacrifice.  I'm not sure
it's worth messing with.  If, however, the problem impacts the
ability of qe to get through their tests in a timely manner, then
it might be well worth tweaking the code a bit.

Comment 2 Robert Peterson 2009-11-19 14:25:24 UTC

Closing due to six months of NEEDINFO.

Note You need to log in before you can comment on or make changes to this bug.