Bug 190950 - Small write and gfs_fsync performance issue
Summary: Small write and gfs_fsync performance issue
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: gfs
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Wendy Cheng
QA Contact: GFS Bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-05-07 02:29 UTC by Wendy Cheng
Modified: 2018-10-19 20:43 UTC (History)
3 users (show)

Fixed In Version: RHBA-2006-0561
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-08-10 21:35:14 UTC
Embargoed:


Attachments (Terms of Use)
A draft patch for this issue (860 bytes, patch)
2006-05-07 02:37 UTC, Wendy Cheng
no flags Details | Diff
Revised patch. (1.03 KB, patch)
2006-05-08 05:44 UTC, Wendy Cheng
no flags Details | Diff
another revise - add file op statistics call back. (1.03 KB, patch)
2006-05-08 16:33 UTC, Wendy Cheng
no flags Details | Diff
Newest revised patch - will check this code into CVS. (823 bytes, patch)
2006-05-12 22:21 UTC, Wendy Cheng
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2006:0561 0 normal SHIPPED_LIVE GFS-kernel bug fix update 2006-08-10 04:00:00 UTC

Description Wendy Cheng 2006-05-07 02:29:57 UTC
Description of problem:

Customer supplies a test program that is said to well represent 
their running environment. Other than 10x less bandwidth (from 
application point of view) when compared with EXT3, GFS also 
generates 200MB of disk I/O for 8MB of application data vs. 
EXT3's 38MB of disk I/O.

The test program does the following:

 1. Create an 8M of temp file and sequentially write to it.
 2. Enable its own pthread mutex locks
 3. Start the timer
 4. Loop 8192 times that
        1. "write" 1024 byte of data into random offset
        2. "fdatasync", followed by "fsync" after every write
 5. Close the file
 6. Stop the timer
 7. Calculate bandwidth and latency based on time statistics
    collected between step 3 to 6.

The program is capable of doing multi-thread run (and many other 
features) but we're focusing on the scenario from step 1 to 7 
using one single thread.

Version-Release number of selected component (if applicable):
GFS 6.1

How reproducible:
Each ime and every time

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 2 Wendy Cheng 2006-05-07 02:37:59 UTC
Created attachment 128700 [details]
A draft patch for this issue

The performance gap is found in gfs lock implementation inside gfs_fsync(). 
It adds GL_SYNC flag into its shared inode lock. This global flag introduces
repeated page writes and meta data flush among many other things.

The uploaded test patch tries to remedy the issue by:

1. Replace the shared lock with exclusive lock
2. Borrow linux VFS layer's generic_osync_inode() (used by O_SYNC code path) 
   to flush the local in-core inode into the disk, instead of the original 
   GFS inode_go_sync().

After the changes, the (application) bandwidth jumps from 240.94 KB/s up to 
2.67 MB/s, very close (and almost equal) to ext3's under lock_nolock mount 
option.

The ramification of this change is unknown at this moment - still under heavy
testing.

Comment 3 Wendy Cheng 2006-05-07 03:25:21 UTC
There is an (newly added) exclusive lock in the code path - so (I hope)
there will be no data corruption due to this change. The major concern 
here is whether other nodes will have metadata hanging around in the
memory. These data may sooner or later get flushed to disk but we may
violate how fsync is supposed to work. 

The excessive flushes in the original code is to make sure *all* data
and meta data are synced into the disk *across* the cluster. It is a
very difficult job. Even on local filesystem, the inode race is one of
the top bug generators. There have been plenty of nasty examples in 
previous RHEL updates.

Other than testing the above patch, the inode_go_sync() & friends are 
being re-examined and we may find a good fix there. Right now, consider 
the above patch a work-around.

Comment 4 Wendy Cheng 2006-05-08 05:44:34 UTC
Created attachment 128727 [details]
Revised patch.

There are issues found with previous patch that uses generic_osync_inode() vfs
call. So instead, we go for the inode_go_sync() route. After few revises, we
simply yank the gfs_log_flush_glock out of inode_go_sync() and let gfs_fsync
call it directly.

After the new fix, GFS obtains 1 MB/s bandwith, compared to the original 240.94
KB/s. 

Most likely, we'll settle down with this patch.

Comment 8 Wendy Cheng 2006-05-08 16:33:52 UTC
Created attachment 128750 [details]
another revise - add file op statistics call back.

Comment 15 Wendy Cheng 2006-05-09 03:39:11 UTC
In my machine 6G RAM with FC storage:

./fstest -d -r -l -S 1 -b 1k -s xm

if x=10 (10M file)

ext3: 3.33 MB/s
gfs1: 1.00 MB/s (after) 238.14 KB/s (before)

if x=8 (8M file)

ext3: 2.67 MB/s 
gfs1: 1.00 MB/s (after)  240.94 KB/s (befo

Comment 18 Wendy Cheng 2006-05-12 22:21:05 UTC
Created attachment 128964 [details]
Newest revised patch - will check this code into CVS.

Comment 25 Wendy Cheng 2006-05-25 14:53:02 UTC
Another un-expected (good) side effect of this work is to allow GFS's vfs inode
state consistent with VFS layer's control structure. Before this change, GFS
totally ignores vfs inode state - after it has synced the data into the disk,
its vfs inode->state remains to be dirty; and during it is syncing the data into
the disk, the vfs inode state never set properly set to I_LOCK.

Comment 28 Red Hat Bugzilla 2006-08-10 21:35:19 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0561.html



Note You need to log in before you can comment on or make changes to this bug.