Bug 230803 - Fix GFS readahead setting
Fix GFS readahead setting
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: gfs (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: Wendy Cheng
Dean Jansa
Depends On:
  Show dependency treegraph
Reported: 2007-03-02 18:11 EST by Wendy Cheng
Modified: 2010-01-11 22:14 EST (History)
4 users (show)

See Also:
Fixed In Version: RHBA-2007-0998
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2007-11-21 16:14:22 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
Tentative patch for sequential read performance fix (4.62 KB, patch)
2007-03-15 12:08 EDT, Wendy Cheng
no flags Details | Diff

  None (edit)
Description Wendy Cheng 2007-03-02 18:11:26 EST
Description of problem:
[cluster-devel] mailing list

Date: Fri, 2 Mar 2007 17:54:08 +0100
From: Mathieu Avila <mathieu.avila@seanodes.com>


I'm having troubles with the read-ahead feature in GFS, when doing
long sequential reads. I understand that there are 3 different code
paths when doing a read:
- If it is accessed in Direct I/O then i get into:
do_read_direct (ops_file.c :: 406)
otherwise it will go in "do_read_buf", and therefore :
- if the inode is stuffed or its data is journaled, i get into :
do_read_readi(file, buf, size, offset); (ops_file.c :: 376)
- otherwise i get into the common kernel functions:
generic_file_read(file, buf, size, offset); (ops_file.c :: 378)

- In the 2nd case, the tunable parameter "max_readahead" is used to get
blocks ahead in gfs_start_ra.
- In the 3rd case, everything is handled by the kernel code, and the
read-ahead value is the one associated to the "struct file*" :
file->f_ra.ra_pages . This value is set by default by the kernel, by
using the read-ahead value that comes from the block device. The block
device here is the diapered ones (diaper.c). The block device structure
is initialized to zero, so that i get a 0 read-ahead device.

I've changed this by initializing the read ahead of the "struct file*"
when entering the read_gfs function and it worked successfully. I was
able to get a factor 2 improvement  in reading unstuffed unjournaled
files, in buffered mode.
But i'm not sure that this is correct according to the
lock states and data coherency among nodes. So, did the diaper
read-ahead was volontarily set to zero to avoid those kind of
coherency problems ? If not, can we safely set it to the value of the
underlying block device, so that the kernel will do the
same than what is performed in do_read_readi/gfs_start_ra ?

Thanks in advance,


Version-Release number of selected component (if applicable):

Additional info:
Comment 1 Wendy Cheng 2007-03-02 18:12:27 EST
The comment is correct. It has been an oversight of GFS implementation.
Comment 2 Wendy Cheng 2007-03-02 18:17:53 EST
A quick fix might be in fill_super(), right after GFS registers the diaper 
device, the readahead value can be pulled from the ra_pages of backing_dev_info. 
Comment 4 Wendy Cheng 2007-03-13 22:09:40 EDT
I'll try to get the test patch done today. 
Comment 5 Wendy Cheng 2007-03-14 14:49:04 EDT
well, after reading the source code carefully, I see diaper device does
get its backing_dev_info from the real device (in diaper_get()):

    363         diaper = bdget_disk(gd, 0);
    364         if (!diaper)
    365                 goto fail_remove;
    367         down(&diaper->bd_sem);
    368         if (!diaper->bd_openers) {
    369                 diaper->bd_disk = gd;
    370                 diaper->bd_contains = diaper;
    371                 bd_set_size(diaper, 0x7FFFFFFFFFFFFFFFULL);
    372                 diaper->bd_inode->i_data.backing_dev_info = &gd->queue->
    373         } else
    374                 printk("GFS: diaper: reopening\n");

This is further confirmed by tracing from vfs super block in the entry 
of gfs_write, into s_bdev->bd_inode->i_mapping by examining the run time 
kernel memory:

crash> struct backing_dev_info 0x10119b1e6e0
struct backing_dev_info {
  ra_pages = 32,
  state = 0,
  memory_backed = 0,
  congested_fn = 0xffffffffa033beb0,
  congested_data = 0x1007dbcda80,
  unplug_io_fn = 0xffffffff8024e2c2 <blk_backing_dev_unplug>,
  unplug_io_data = 0x10119b1e490

So GFS doesn't miss any readahead. However, 32 may be good for general 
purpose readahead, not for large sequential write ? 

We happen to have a customer complaning about large sequential read
performance. Will give out a tunable to see whether it could improve
the number. If proved useful, we'll take it for RHEL 4.5
Comment 6 Wendy Cheng 2007-03-14 14:53:09 EDT
Or, depending on the device, the driver may not set readahead for its queue
at all. So a tunable would be even more proper. 
Comment 7 Wendy Cheng 2007-03-15 12:04:28 EDT
VFS layer happens to offer a backing device info block in its super block
structure to allow filesystem to override the default setting of its under-
lying block device. 

After adding this function into GFS as a tunable (seq_readahead), a quick 
and dirty dd test with 2G file shows the following result on my test 

default dd read: 27.76s 
2048 readahead: 15.19s.

Will continue this testing.
Comment 8 Wendy Cheng 2007-03-15 12:08:41 EDT
Created attachment 150135 [details]
Tentative patch for sequential read performance fix
Comment 9 Wendy Cheng 2007-03-15 12:10:48 EDT
The above patch is not finalized yet - needs some clean up but is good for
test runs. 
Comment 10 Wendy Cheng 2007-03-15 12:15:58 EDT
ok, after few loops, it is un-mistakenly 2x performance gain. Mathieu, thanks
for bringing this issue up !
Comment 12 Kiersten (Kerri) Anderson 2007-04-10 11:14:34 EDT
Moving to proposed for 4.6 - too late in the cycle for 4.5
Comment 13 Wendy Cheng 2007-07-18 12:06:20 EDT
Our TAM just tested this patch out 
Comment 14 Wendy Cheng 2007-07-18 12:19:27 EDT
Sorry, hit submit key too soon..

This patch seems to significantly help IPTV read performance. The tunable is 
set by "gfs_tool settune <mount> seq_readahead <val>" command. The <val> we 
use is subject to the file size. Assuming 2G file with 2048 readahead is good,
in IPTV test runs, with file size between 400-500M, we use 256 readahead. The 
change made us passing ext3 number and TAM seems to be very pleased with the 
Comment 15 Wendy Cheng 2007-08-28 11:00:28 EDT
Code checked into CVS. Move bugzilla into modified state. 
Comment 19 errata-xmlrpc 2007-11-21 16:14:22 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.