Description of problem: [cluster-devel] mailing list Date: Fri, 2 Mar 2007 17:54:08 +0100 From: Mathieu Avila <mathieu.avila> Hello, I'm having troubles with the read-ahead feature in GFS, when doing long sequential reads. I understand that there are 3 different code paths when doing a read: - If it is accessed in Direct I/O then i get into: do_read_direct (ops_file.c :: 406) otherwise it will go in "do_read_buf", and therefore : - if the inode is stuffed or its data is journaled, i get into : do_read_readi(file, buf, size, offset); (ops_file.c :: 376) - otherwise i get into the common kernel functions: generic_file_read(file, buf, size, offset); (ops_file.c :: 378) - In the 2nd case, the tunable parameter "max_readahead" is used to get blocks ahead in gfs_start_ra. - In the 3rd case, everything is handled by the kernel code, and the read-ahead value is the one associated to the "struct file*" : file->f_ra.ra_pages . This value is set by default by the kernel, by using the read-ahead value that comes from the block device. The block device here is the diapered ones (diaper.c). The block device structure is initialized to zero, so that i get a 0 read-ahead device. I've changed this by initializing the read ahead of the "struct file*" when entering the read_gfs function and it worked successfully. I was able to get a factor 2 improvement in reading unstuffed unjournaled files, in buffered mode. But i'm not sure that this is correct according to the lock states and data coherency among nodes. So, did the diaper read-ahead was volontarily set to zero to avoid those kind of coherency problems ? If not, can we safely set it to the value of the underlying block device, so that the kernel will do the same than what is performed in do_read_readi/gfs_start_ra ? Thanks in advance, -- Mathieu Version-Release number of selected component (if applicable): ALL Additional info:
The comment is correct. It has been an oversight of GFS implementation.
A quick fix might be in fill_super(), right after GFS registers the diaper device, the readahead value can be pulled from the ra_pages of backing_dev_info.
I'll try to get the test patch done today.
well, after reading the source code carefully, I see diaper device does get its backing_dev_info from the real device (in diaper_get()): 363 diaper = bdget_disk(gd, 0); 364 if (!diaper) 365 goto fail_remove; 366 367 down(&diaper->bd_sem); 368 if (!diaper->bd_openers) { 369 diaper->bd_disk = gd; 370 diaper->bd_contains = diaper; 371 bd_set_size(diaper, 0x7FFFFFFFFFFFFFFFULL); 372 diaper->bd_inode->i_data.backing_dev_info = &gd->queue-> backing_dev_info; 373 } else 374 printk("GFS: diaper: reopening\n"); This is further confirmed by tracing from vfs super block in the entry of gfs_write, into s_bdev->bd_inode->i_mapping by examining the run time kernel memory: crash> struct backing_dev_info 0x10119b1e6e0 struct backing_dev_info { ra_pages = 32, state = 0, memory_backed = 0, congested_fn = 0xffffffffa033beb0, congested_data = 0x1007dbcda80, unplug_io_fn = 0xffffffff8024e2c2 <blk_backing_dev_unplug>, unplug_io_data = 0x10119b1e490 } So GFS doesn't miss any readahead. However, 32 may be good for general purpose readahead, not for large sequential write ? We happen to have a customer complaning about large sequential read performance. Will give out a tunable to see whether it could improve the number. If proved useful, we'll take it for RHEL 4.5
Or, depending on the device, the driver may not set readahead for its queue at all. So a tunable would be even more proper.
VFS layer happens to offer a backing device info block in its super block structure to allow filesystem to override the default setting of its under- lying block device. After adding this function into GFS as a tunable (seq_readahead), a quick and dirty dd test with 2G file shows the following result on my test cluster: default dd read: 27.76s 2048 readahead: 15.19s. Will continue this testing.
Created attachment 150135 [details] Tentative patch for sequential read performance fix
The above patch is not finalized yet - needs some clean up but is good for test runs.
ok, after few loops, it is un-mistakenly 2x performance gain. Mathieu, thanks for bringing this issue up !
Moving to proposed for 4.6 - too late in the cycle for 4.5
Our TAM just tested this patch out
Sorry, hit submit key too soon.. This patch seems to significantly help IPTV read performance. The tunable is set by "gfs_tool settune <mount> seq_readahead <val>" command. The <val> we use is subject to the file size. Assuming 2G file with 2048 readahead is good, in IPTV test runs, with file size between 400-500M, we use 256 readahead. The change made us passing ext3 number and TAM seems to be very pleased with the results.
Code checked into CVS. Move bugzilla into modified state.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0998.html