230803 – Fix GFS readahead setting

Bug 230803 - Fix GFS readahead setting

Summary: Fix GFS readahead setting

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	gfs
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Wendy Cheng
QA Contact:	Dean Jansa
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-03-02 23:11 UTC by Wendy Cheng
Modified:	2010-01-12 03:14 UTC (History)
CC List:	4 users (show)
Fixed In Version:	RHBA-2007-0998
Clone Of:
Environment:
Last Closed:	2007-11-21 21:14:22 UTC
Embargoed:

Attachments	(Terms of Use)
Tentative patch for sequential read performance fix (4.62 KB, patch) 2007-03-15 16:08 UTC, Wendy Cheng	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2007:0998	0	normal	SHIPPED_LIVE	GFS-kernel bug fix update	2007-11-29 14:35:09 UTC

Description Wendy Cheng 2007-03-02 23:11:26 UTC

Description of problem:
[cluster-devel] mailing list

Date: Fri, 2 Mar 2007 17:54:08 +0100
From: Mathieu Avila <mathieu.avila>

Hello,

I'm having troubles with the read-ahead feature in GFS, when doing
long sequential reads. I understand that there are 3 different code
paths when doing a read:
- If it is accessed in Direct I/O then i get into:
do_read_direct (ops_file.c :: 406)
otherwise it will go in "do_read_buf", and therefore :
- if the inode is stuffed or its data is journaled, i get into :
do_read_readi(file, buf, size, offset); (ops_file.c :: 376)
- otherwise i get into the common kernel functions:
generic_file_read(file, buf, size, offset); (ops_file.c :: 378)

- In the 2nd case, the tunable parameter "max_readahead" is used to get
blocks ahead in gfs_start_ra.
- In the 3rd case, everything is handled by the kernel code, and the
read-ahead value is the one associated to the "struct file*" :
file->f_ra.ra_pages . This value is set by default by the kernel, by
using the read-ahead value that comes from the block device. The block
device here is the diapered ones (diaper.c). The block device structure
is initialized to zero, so that i get a 0 read-ahead device.

I've changed this by initializing the read ahead of the "struct file*"
when entering the read_gfs function and it worked successfully. I was
able to get a factor 2 improvement  in reading unstuffed unjournaled
files, in buffered mode.
But i'm not sure that this is correct according to the
lock states and data coherency among nodes. So, did the diaper
read-ahead was volontarily set to zero to avoid those kind of
coherency problems ? If not, can we safely set it to the value of the
underlying block device, so that the kernel will do the
same than what is performed in do_read_readi/gfs_start_ra ?

Thanks in advance,

--
Mathieu

Version-Release number of selected component (if applicable):
ALL

Additional info:

Comment 1 Wendy Cheng 2007-03-02 23:12:27 UTC

The comment is correct. It has been an oversight of GFS implementation.

Comment 2 Wendy Cheng 2007-03-02 23:17:53 UTC

A quick fix might be in fill_super(), right after GFS registers the diaper 
device, the readahead value can be pulled from the ra_pages of backing_dev_info.

Comment 4 Wendy Cheng 2007-03-14 02:09:40 UTC

I'll try to get the test patch done today.

Comment 5 Wendy Cheng 2007-03-14 18:49:04 UTC

well, after reading the source code carefully, I see diaper device does
get its backing_dev_info from the real device (in diaper_get()):

    363         diaper = bdget_disk(gd, 0);
    364         if (!diaper)
    365                 goto fail_remove;
    366
    367         down(&diaper->bd_sem);
    368         if (!diaper->bd_openers) {
    369                 diaper->bd_disk = gd;
    370                 diaper->bd_contains = diaper;
    371                 bd_set_size(diaper, 0x7FFFFFFFFFFFFFFFULL);
    372                 diaper->bd_inode->i_data.backing_dev_info = &gd->queue->
       backing_dev_info;
    373         } else
    374                 printk("GFS: diaper: reopening\n");

This is further confirmed by tracing from vfs super block in the entry 
of gfs_write, into s_bdev->bd_inode->i_mapping by examining the run time 
kernel memory:

crash> struct backing_dev_info 0x10119b1e6e0
struct backing_dev_info {
  ra_pages = 32,
  state = 0,
  memory_backed = 0,
  congested_fn = 0xffffffffa033beb0,
  congested_data = 0x1007dbcda80,
  unplug_io_fn = 0xffffffff8024e2c2 <blk_backing_dev_unplug>,
  unplug_io_data = 0x10119b1e490
}

So GFS doesn't miss any readahead. However, 32 may be good for general 
purpose readahead, not for large sequential write ? 

We happen to have a customer complaning about large sequential read
performance. Will give out a tunable to see whether it could improve
the number. If proved useful, we'll take it for RHEL 4.5

Comment 6 Wendy Cheng 2007-03-14 18:53:09 UTC

Or, depending on the device, the driver may not set readahead for its queue
at all. So a tunable would be even more proper.

Comment 7 Wendy Cheng 2007-03-15 16:04:28 UTC

VFS layer happens to offer a backing device info block in its super block
structure to allow filesystem to override the default setting of its under-
lying block device. 

After adding this function into GFS as a tunable (seq_readahead), a quick 
and dirty dd test with 2G file shows the following result on my test 
cluster:

default dd read: 27.76s 
2048 readahead: 15.19s.

Will continue this testing.

Comment 8 Wendy Cheng 2007-03-15 16:08:41 UTC

Created attachment 150135 [details]
Tentative patch for sequential read performance fix

Comment 9 Wendy Cheng 2007-03-15 16:10:48 UTC

The above patch is not finalized yet - needs some clean up but is good for
test runs.

Comment 10 Wendy Cheng 2007-03-15 16:15:58 UTC

ok, after few loops, it is un-mistakenly 2x performance gain. Mathieu, thanks
for bringing this issue up !

Comment 12 Kiersten (Kerri) Anderson 2007-04-10 15:14:34 UTC

Moving to proposed for 4.6 - too late in the cycle for 4.5

Comment 13 Wendy Cheng 2007-07-18 16:06:20 UTC

Our TAM just tested this patch out

Comment 14 Wendy Cheng 2007-07-18 16:19:27 UTC

Sorry, hit submit key too soon..

This patch seems to significantly help IPTV read performance. The tunable is 
set by "gfs_tool settune <mount> seq_readahead <val>" command. The <val> we 
use is subject to the file size. Assuming 2G file with 2048 readahead is good,
in IPTV test runs, with file size between 400-500M, we use 256 readahead. The 
change made us passing ext3 number and TAM seems to be very pleased with the 
results.

Comment 15 Wendy Cheng 2007-08-28 15:00:28 UTC

Code checked into CVS. Move bugzilla into modified state.

Comment 19 errata-xmlrpc 2007-11-21 21:14:22 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0998.html

Note You need to log in before you can comment on or make changes to this bug.