Bug 230803

Summary: Fix GFS readahead setting
Product: [Retired] Red Hat Cluster Suite Reporter: Wendy Cheng <nobody+wcheng>
Component: gfsAssignee: Wendy Cheng <nobody+wcheng>
Status: CLOSED ERRATA QA Contact: Dean Jansa <djansa>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: hlawatschek, kanderso, mathieu.avila, rkenna
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0998 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-21 21:14:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Tentative patch for sequential read performance fix none

Description Wendy Cheng 2007-03-02 23:11:26 UTC
Description of problem:
[cluster-devel] mailing list

Date: Fri, 2 Mar 2007 17:54:08 +0100
From: Mathieu Avila <mathieu.avila>

Hello,

I'm having troubles with the read-ahead feature in GFS, when doing
long sequential reads. I understand that there are 3 different code
paths when doing a read:
- If it is accessed in Direct I/O then i get into:
do_read_direct (ops_file.c :: 406)
otherwise it will go in "do_read_buf", and therefore :
- if the inode is stuffed or its data is journaled, i get into :
do_read_readi(file, buf, size, offset); (ops_file.c :: 376)
- otherwise i get into the common kernel functions:
generic_file_read(file, buf, size, offset); (ops_file.c :: 378)

- In the 2nd case, the tunable parameter "max_readahead" is used to get
blocks ahead in gfs_start_ra.
- In the 3rd case, everything is handled by the kernel code, and the
read-ahead value is the one associated to the "struct file*" :
file->f_ra.ra_pages . This value is set by default by the kernel, by
using the read-ahead value that comes from the block device. The block
device here is the diapered ones (diaper.c). The block device structure
is initialized to zero, so that i get a 0 read-ahead device.

I've changed this by initializing the read ahead of the "struct file*"
when entering the read_gfs function and it worked successfully. I was
able to get a factor 2 improvement  in reading unstuffed unjournaled
files, in buffered mode.
But i'm not sure that this is correct according to the
lock states and data coherency among nodes. So, did the diaper
read-ahead was volontarily set to zero to avoid those kind of
coherency problems ? If not, can we safely set it to the value of the
underlying block device, so that the kernel will do the
same than what is performed in do_read_readi/gfs_start_ra ?

Thanks in advance,

--
Mathieu


Version-Release number of selected component (if applicable):
ALL

Additional info:

Comment 1 Wendy Cheng 2007-03-02 23:12:27 UTC
The comment is correct. It has been an oversight of GFS implementation.

Comment 2 Wendy Cheng 2007-03-02 23:17:53 UTC
A quick fix might be in fill_super(), right after GFS registers the diaper 
device, the readahead value can be pulled from the ra_pages of backing_dev_info. 
       

Comment 4 Wendy Cheng 2007-03-14 02:09:40 UTC
I'll try to get the test patch done today. 

Comment 5 Wendy Cheng 2007-03-14 18:49:04 UTC
well, after reading the source code carefully, I see diaper device does
get its backing_dev_info from the real device (in diaper_get()):

    363         diaper = bdget_disk(gd, 0);
    364         if (!diaper)
    365                 goto fail_remove;
    366
    367         down(&diaper->bd_sem);
    368         if (!diaper->bd_openers) {
    369                 diaper->bd_disk = gd;
    370                 diaper->bd_contains = diaper;
    371                 bd_set_size(diaper, 0x7FFFFFFFFFFFFFFFULL);
    372                 diaper->bd_inode->i_data.backing_dev_info = &gd->queue->
       backing_dev_info;
    373         } else
    374                 printk("GFS: diaper: reopening\n");

This is further confirmed by tracing from vfs super block in the entry 
of gfs_write, into s_bdev->bd_inode->i_mapping by examining the run time 
kernel memory:

crash> struct backing_dev_info 0x10119b1e6e0
struct backing_dev_info {
  ra_pages = 32,
  state = 0,
  memory_backed = 0,
  congested_fn = 0xffffffffa033beb0,
  congested_data = 0x1007dbcda80,
  unplug_io_fn = 0xffffffff8024e2c2 <blk_backing_dev_unplug>,
  unplug_io_data = 0x10119b1e490
}

So GFS doesn't miss any readahead. However, 32 may be good for general 
purpose readahead, not for large sequential write ? 

We happen to have a customer complaning about large sequential read
performance. Will give out a tunable to see whether it could improve
the number. If proved useful, we'll take it for RHEL 4.5

Comment 6 Wendy Cheng 2007-03-14 18:53:09 UTC
Or, depending on the device, the driver may not set readahead for its queue
at all. So a tunable would be even more proper. 

Comment 7 Wendy Cheng 2007-03-15 16:04:28 UTC
VFS layer happens to offer a backing device info block in its super block
structure to allow filesystem to override the default setting of its under-
lying block device. 

After adding this function into GFS as a tunable (seq_readahead), a quick 
and dirty dd test with 2G file shows the following result on my test 
cluster:

default dd read: 27.76s 
2048 readahead: 15.19s.

Will continue this testing.


Comment 8 Wendy Cheng 2007-03-15 16:08:41 UTC
Created attachment 150135 [details]
Tentative patch for sequential read performance fix

Comment 9 Wendy Cheng 2007-03-15 16:10:48 UTC
The above patch is not finalized yet - needs some clean up but is good for
test runs. 

Comment 10 Wendy Cheng 2007-03-15 16:15:58 UTC
ok, after few loops, it is un-mistakenly 2x performance gain. Mathieu, thanks
for bringing this issue up !

Comment 12 Kiersten (Kerri) Anderson 2007-04-10 15:14:34 UTC
Moving to proposed for 4.6 - too late in the cycle for 4.5

Comment 13 Wendy Cheng 2007-07-18 16:06:20 UTC
Our TAM just tested this patch out 

Comment 14 Wendy Cheng 2007-07-18 16:19:27 UTC
Sorry, hit submit key too soon..

This patch seems to significantly help IPTV read performance. The tunable is 
set by "gfs_tool settune <mount> seq_readahead <val>" command. The <val> we 
use is subject to the file size. Assuming 2G file with 2048 readahead is good,
in IPTV test runs, with file size between 400-500M, we use 256 readahead. The 
change made us passing ext3 number and TAM seems to be very pleased with the 
results.  


Comment 15 Wendy Cheng 2007-08-28 15:00:28 UTC
Code checked into CVS. Move bugzilla into modified state. 

Comment 19 errata-xmlrpc 2007-11-21 21:14:22 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0998.html