Description of problem: Customer supplies a test program that is said to well represent their running environment. Other than 10x less bandwidth (from application point of view) when compared with EXT3, GFS also generates 200MB of disk I/O for 8MB of application data vs. EXT3's 38MB of disk I/O. The test program does the following: 1. Create an 8M of temp file and sequentially write to it. 2. Enable its own pthread mutex locks 3. Start the timer 4. Loop 8192 times that 1. "write" 1024 byte of data into random offset 2. "fdatasync", followed by "fsync" after every write 5. Close the file 6. Stop the timer 7. Calculate bandwidth and latency based on time statistics collected between step 3 to 6. The program is capable of doing multi-thread run (and many other features) but we're focusing on the scenario from step 1 to 7 using one single thread. Version-Release number of selected component (if applicable): GFS 6.1 How reproducible: Each ime and every time Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 128700 [details] A draft patch for this issue The performance gap is found in gfs lock implementation inside gfs_fsync(). It adds GL_SYNC flag into its shared inode lock. This global flag introduces repeated page writes and meta data flush among many other things. The uploaded test patch tries to remedy the issue by: 1. Replace the shared lock with exclusive lock 2. Borrow linux VFS layer's generic_osync_inode() (used by O_SYNC code path) to flush the local in-core inode into the disk, instead of the original GFS inode_go_sync(). After the changes, the (application) bandwidth jumps from 240.94 KB/s up to 2.67 MB/s, very close (and almost equal) to ext3's under lock_nolock mount option. The ramification of this change is unknown at this moment - still under heavy testing.
There is an (newly added) exclusive lock in the code path - so (I hope) there will be no data corruption due to this change. The major concern here is whether other nodes will have metadata hanging around in the memory. These data may sooner or later get flushed to disk but we may violate how fsync is supposed to work. The excessive flushes in the original code is to make sure *all* data and meta data are synced into the disk *across* the cluster. It is a very difficult job. Even on local filesystem, the inode race is one of the top bug generators. There have been plenty of nasty examples in previous RHEL updates. Other than testing the above patch, the inode_go_sync() & friends are being re-examined and we may find a good fix there. Right now, consider the above patch a work-around.
Created attachment 128727 [details] Revised patch. There are issues found with previous patch that uses generic_osync_inode() vfs call. So instead, we go for the inode_go_sync() route. After few revises, we simply yank the gfs_log_flush_glock out of inode_go_sync() and let gfs_fsync call it directly. After the new fix, GFS obtains 1 MB/s bandwith, compared to the original 240.94 KB/s. Most likely, we'll settle down with this patch.
Created attachment 128750 [details] another revise - add file op statistics call back.
In my machine 6G RAM with FC storage: ./fstest -d -r -l -S 1 -b 1k -s xm if x=10 (10M file) ext3: 3.33 MB/s gfs1: 1.00 MB/s (after) 238.14 KB/s (before) if x=8 (8M file) ext3: 2.67 MB/s gfs1: 1.00 MB/s (after) 240.94 KB/s (befo
Created attachment 128964 [details] Newest revised patch - will check this code into CVS.
Another un-expected (good) side effect of this work is to allow GFS's vfs inode state consistent with VFS layer's control structure. Before this change, GFS totally ignores vfs inode state - after it has synced the data into the disk, its vfs inode->state remains to be dirty; and during it is syncing the data into the disk, the vfs inode state never set properly set to I_LOCK.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0561.html