Description of problem:
Customer supplies a test program that is said to well represent
their running environment. Other than 10x less bandwidth (from
application point of view) when compared with EXT3, GFS also
generates 200MB of disk I/O for 8MB of application data vs.
EXT3's 38MB of disk I/O.
The test program does the following:
1. Create an 8M of temp file and sequentially write to it.
2. Enable its own pthread mutex locks
3. Start the timer
4. Loop 8192 times that
1. "write" 1024 byte of data into random offset
2. "fdatasync", followed by "fsync" after every write
5. Close the file
6. Stop the timer
7. Calculate bandwidth and latency based on time statistics
collected between step 3 to 6.
The program is capable of doing multi-thread run (and many other
features) but we're focusing on the scenario from step 1 to 7
using one single thread.
Version-Release number of selected component (if applicable):
Each ime and every time
Steps to Reproduce:
Created attachment 128700 [details]
A draft patch for this issue
The performance gap is found in gfs lock implementation inside gfs_fsync().
It adds GL_SYNC flag into its shared inode lock. This global flag introduces
repeated page writes and meta data flush among many other things.
The uploaded test patch tries to remedy the issue by:
1. Replace the shared lock with exclusive lock
2. Borrow linux VFS layer's generic_osync_inode() (used by O_SYNC code path)
to flush the local in-core inode into the disk, instead of the original
After the changes, the (application) bandwidth jumps from 240.94 KB/s up to
2.67 MB/s, very close (and almost equal) to ext3's under lock_nolock mount
The ramification of this change is unknown at this moment - still under heavy
There is an (newly added) exclusive lock in the code path - so (I hope)
there will be no data corruption due to this change. The major concern
here is whether other nodes will have metadata hanging around in the
memory. These data may sooner or later get flushed to disk but we may
violate how fsync is supposed to work.
The excessive flushes in the original code is to make sure *all* data
and meta data are synced into the disk *across* the cluster. It is a
very difficult job. Even on local filesystem, the inode race is one of
the top bug generators. There have been plenty of nasty examples in
previous RHEL updates.
Other than testing the above patch, the inode_go_sync() & friends are
being re-examined and we may find a good fix there. Right now, consider
the above patch a work-around.
Created attachment 128727 [details]
There are issues found with previous patch that uses generic_osync_inode() vfs
call. So instead, we go for the inode_go_sync() route. After few revises, we
simply yank the gfs_log_flush_glock out of inode_go_sync() and let gfs_fsync
call it directly.
After the new fix, GFS obtains 1 MB/s bandwith, compared to the original 240.94
Most likely, we'll settle down with this patch.
Created attachment 128750 [details]
another revise - add file op statistics call back.
In my machine 6G RAM with FC storage:
./fstest -d -r -l -S 1 -b 1k -s xm
if x=10 (10M file)
ext3: 3.33 MB/s
gfs1: 1.00 MB/s (after) 238.14 KB/s (before)
if x=8 (8M file)
ext3: 2.67 MB/s
gfs1: 1.00 MB/s (after) 240.94 KB/s (befo
Created attachment 128964 [details]
Newest revised patch - will check this code into CVS.
Another un-expected (good) side effect of this work is to allow GFS's vfs inode
state consistent with VFS layer's control structure. Before this change, GFS
totally ignores vfs inode state - after it has synced the data into the disk,
its vfs inode->state remains to be dirty; and during it is syncing the data into
the disk, the vfs inode state never set properly set to I_LOCK.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.