Description of problem: We were performing some ext4 tests on a 16 TB GPT partition and ran into this issue when writing a single large file with dd and syncing afterwards. The problem: dd is fast (cached) but the subsequent sync is *very* slow. # /usr/bin/time bash -c "dd if=/dev/zero of=/mnt/large/10GB bs=1M count=10000 && sync" 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 15.9423 seconds, 658 MB/s 0.01user 441.40system 7:26.10elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+794minor)pagefaults 0swaps dd: ~16 seconds sync: ~7 minutes (The same test finishes in 57s with xfs!) Here's the "iostat -m /dev/sdb 1" output during dd write: avg-cpu: %user %nice %system %iowait %steal %idle 0,00 0,00 6,62 19,35 0,00 74,03 Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn sdb 484,00 0,00 242,00 0 242 "iostat -m /dev/sdb 1" during the sync looks like this avg-cpu: %user %nice %system %iowait %steal %idle 0,00 0,00 12,48 0,00 0,00 87,52 Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn sdb 22,00 0,00 8,00 0 8 However, the sync performance is fine if we ... * use xfs or * use ext3 * disable the ext4 journal or * disable ext4 extents (with enabled journal) Version-Release number of selected component (if applicable): 2.6.18-186.el5 x86_64 Other kernels we've tried: * Bad: 2.6.18-164.el5 (CentOS 5.4 dvd kernel) * Bad: 2.6.18-164.11.1el5 (CentOS 5.4 update kernel) * Bad: 2.6.18-186-el5 (Red Hat Server 5.5 Beta kernel) * Bad: 2.6.18-190.el5 (Red Hat Test) * Good: 2.6.32 vanilla * Good: 2.6.33 vanilla How reproducible: Every time, on three separate machines. Steps to Reproduce: 1. Format a large partition (16 TB on a GPT partition in our case. 2. Mount it with "ext4,noatime,data=writeback" 3. /usr/bin/time bash -c "dd if=/dev/zero of=/mnt/large/10GB bs=1M count=10000 && sync" Actual results: # /usr/bin/time bash -c "dd if=/dev/zero of=/mnt/large/10GB bs=1M count=10000 && sync" 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 15.9423 seconds, 658 MB/s 0.01user 441.40system 7:26.10elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+794minor)pagefaults 0swaps The sync should be finished in <1 minute. Expected results: Much faster sync performance. The write rate during sync drops as low a 8 MB/s. Additional info: Caveat: On 2010-02-26 I've already reported the problem on the linux-ext4 mailing list (Thread: http://www.spinics.net/lists/linux-ext4/index.html#17672) We first noticed this problem on a CentOS 5.4 machines but now we also verified it with Red Hat Server 5.5 Beta x86_64. Here some more information from out initial test on CentOS 5.4: Here's a kernel profile of the test: # readprofile -r # /usr/bin/time bash -c "dd if=/dev/zero of=/mnt/large/10GB_3 bs=1M count=10000 && sync" 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 15.8261 seconds, 663 MB/s 0.01user 448.55system 7:32.89elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+788minor)pagefaults 0swaps # readprofile -m /boot/System.map-2.6.18-190.el5 | sort -nr -k 3 | head -15 3450304 default_idle 43128.8000 9532 mod_zone_page_state 733.2308 58594 find_get_pages 537.5596 58499 find_get_pages_tag 427.0000 72404 __set_page_dirty_nobuffers 310.7468 10740 __wake_up_bit 238.6667 7786 unlock_page 165.6596 1996 dec_zone_page_state 153.5385 12230 clear_page_dirty_for_io 63.0412 5938 page_waitqueue 60.5918 14440 release_pages 41.7341 12664 __mark_inode_dirty 34.6011 5281 copy_user_generic_unrolled 30.7035 323 redirty_page_for_writepage 26.9167 15537 write_cache_pages 18.9939 Here are three call traces from the "sync" command: sync R running task 0 5041 5032 (NOTLB) 00000000ffffffff 000000000000000c ffff81022bb16510 00000000001ec2a3 ffff81022bb16548 ffffffffffffff10 ffffffff800d1964 0000000000000010 0000000000000286 ffff8101ff56bb48 0000000000000018 ffff81033686e970 Call Trace: [<ffffffff800d1964>] page_mkclean+0x255/0x281 [<ffffffff8000eab5>] find_get_pages_tag+0x34/0x89 [<ffffffff800f4467>] write_cache_pages+0x21b/0x332 [<ffffffff886fad9d>] :ext4:__mpage_da_writepage+0x0/0x162 [<ffffffff886fc2f1>] :ext4:ext4_da_writepages+0x317/0x4fe [<ffffffff8005b1f0>] do_writepages+0x20/0x2f [<ffffffff8002fefc>] __writeback_single_inode+0x1ae/0x328 [<ffffffff8004a667>] wait_on_page_writeback_range+0xd6/0x12e [<ffffffff80020f0d>] sync_sb_inodes+0x1b5/0x26f [<ffffffff800f3b0b>] sync_inodes_sb+0x99/0xa9 [<ffffffff800f3b78>] __sync_inodes+0x5d/0xaa [<ffffffff800f3bd6>] sync_inodes+0x11/0x29 [<ffffffff800e1bb0>] do_sync+0x12/0x5a [<ffffffff800e1c06>] sys_sync+0xe/0x12 [<ffffffff8005e28d>] tracesys+0xd5/0xe0 sync R running task 0 5041 5032 (NOTLB) ffff81022a3e4e88 ffffffff8000e930 ffff8101ff56bee8 ffff8101ff56bcd8 0000000000000000 ffffffff886fbdbc ffff8101ff56bc68 ffff8101ff56bc68 00000000002537c3 000000000001dfcd 0000000000000007 ffff8101ff56bc90 Call Trace: [<ffffffff8000e9c4>] __set_page_dirty_nobuffers+0xde/0xe9 [<ffffffff8001b694>] find_get_pages+0x2f/0x6d [<ffffffff886f8170>] :ext4:mpage_da_submit_io+0xd0/0x12c [<ffffffff886fc31d>] :ext4:ext4_da_writepages+0x343/0x4fe [<ffffffff8005b1f0>] do_writepages+0x20/0x2f [<ffffffff8002fefc>] __writeback_single_inode+0x1ae/0x328 [<ffffffff8004a667>] wait_on_page_writeback_range+0xd6/0x12e [<ffffffff80020f0d>] sync_sb_inodes+0x1b5/0x26f [<ffffffff800f3b0b>] sync_inodes_sb+0x99/0xa9 [<ffffffff800f3b78>] __sync_inodes+0x5d/0xaa [<ffffffff800f3bd6>] sync_inodes+0x11/0x29 [<ffffffff800e1bb0>] do_sync+0x12/0x5a [<ffffffff800e1c06>] sys_sync+0xe/0x12 [<ffffffff8005e28d>] tracesys+0xd5/0xe0 sync R running task 0 5353 5348 (NOTLB) ffff810426e04048 0000000000001200 0000000100000001 0000000000000001 0000000100000000 ffff8103dcecadf8 ffff810426dfd000 ffff8102ebd81b48 ffff810426e04048 ffff810239bbbc40 0000000000000008 00000000008447f8 Call Trace: [<ffffffff801431db>] elv_merged_request+0x1e/0x26 [<ffffffff8000c02b>] __make_request+0x324/0x401 [<ffffffff8005c6cf>] cache_alloc_refill+0x106/0x186 [<ffffffff886f6bf9>] :ext4:walk_page_buffers+0x65/0x8b [<ffffffff8000e9c4>] __set_page_dirty_nobuffers+0xde/0xe9 [<ffffffff886fbd42>] :ext4:ext4_writepage+0x9b/0x333 [<ffffffff886f8170>] :ext4:mpage_da_submit_io+0xd0/0x12c [<ffffffff886fc31d>] :ext4:ext4_da_writepages+0x343/0x4fe [<ffffffff8005b1f0>] do_writepages+0x20/0x2f [<ffffffff8002fefc>] __writeback_single_inode+0x1ae/0x328 [<ffffffff8004a667>] wait_on_page_writeback_range+0xd6/0x12e [<ffffffff80020f0d>] sync_sb_inodes+0x1b5/0x26f [<ffffffff800f3b0b>] sync_inodes_sb+0x99/0xa9 [<ffffffff800f3b78>] __sync_inodes+0x5d/0xaa [<ffffffff800f3bd6>] sync_inodes+0x11/0x29 [<ffffffff800e1bb0>] do_sync+0x12/0x5a [<ffffffff800e1c06>] sys_sync+0xe/0x12 [<ffffffff8005e28d>] tracesys+0xd5/0xe0 I've tried some more options. These do *not* influence the (bad) result: * data=writeback and data=ordered * disabling/enabling uninit_bg * max_sectors_kb=512 or 4096 * io scheduler: cfq or noop Some background information about the system: OS: CentOS 5.4 x86_64 / Red Hat Server 5.5 x86_64 Memory: 16 GB CPUs: 2x Quad-Core Opteron 2356 IO scheduler: CFQ # df -h /dev/sdb{1,2} Filesystem Size Used Avail Use% Mounted on /dev/sdb1 15T 9.9G 14T 1% /mnt/large /dev/sdb2 7.3T 179M 7.0T 1% /mnt/small (parted) print Model: easyRAID easyRAID_Q16PS (scsi) Disk /dev/sdb: 24,0TB Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 1 1049kB 16,0TB 16,0TB ext3 large 2 16,0TB 24,0TB 8003GB ext3 small ("start 1049kB" is at sector 2048) sdb is a FC HW-RAID (easyRAID_Q16PS) and consists of a RAID6 volume created from 14 disks with chunk size 128kb. QLogic Fibre Channel HBA Driver: 8.03.01.04.05.05-k QLogic QLE2462 - PCI-Express Dual Channel 4Gb Fibre Channel HBA ISP2432: PCIe (2.5Gb/s x4) @ 0000:01:00.1 hdma+, host#=8, fw=4.04.09 (486) The ext4 filesystem was created with mkfs.ext4 -T ext4,largefile4 -E stride=32,stripe-width=$((32*(14-2))) /dev/sdb1 or mkfs.ext4 -T ext4 /dev/sdb1 Mount options: defaults,noatime,data=writeback I'm willing to test patches if somebody wants to debug the problem.
Does it show up with normal data=ordered mode as well? I think on the list you said that it did... If you are highly motivated, capturing blktrace info during the slow writeout post-sync could be interesting. Let me know if you need info about how to run blktrace ... I'll try to reproduce here as well, of course. This could be either bad writeback interaction, or ext4's buffer-at-a-time IO behavior (both somewhat related, I guess...) Thanks, -Eric
Yes, it's the same with data=ordered (See above "These do *not* influence the (bad) result" for the list of things I've tried). I'll see what I do regarding blktrace but you'll probably have to wait until Tuesday before I can resume testing.
Thanks, I appreciate any testing you're willing to do, but it's a simple enough test that I should be able to find a similar machine to replicate it here, as well. Thanks, -Eric
Here's the result of the blktrace: (/dev/sdb1 is mounted on /mnt/large) # pwd /root/blktrace # blktrace -b 4096 -n 16 /dev/sdb & [1] 3734 # /usr/bin/time dd if=/dev/zero of=/mnt/large/10GB bs=1M count=10000 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 17.5677 seconds, 597 MB/s 0.01user 12.49system 0:17.56elapsed 71%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+212minor)pagefaults 0swaps # /usr/bin/time sync 0.00user 325.90system 5:28.54elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+145minor)pagefaults 0swaps # kill -15 3734 Device: /dev/sdb CPU 0: 0 events, 25132 KiB data CPU 1: 0 events, 66271 KiB data CPU 2: 0 events, 36997 KiB data CPU 3: 0 events, 50769 KiB data CPU 4: 0 events, 28761 KiB data CPU 5: 0 events, 130447 KiB data CPU 6: 0 events, 34952 KiB data CPU 7: 0 events, 30460 KiB data Total: 0 events (dropped 0), 403786 KiB data [1]+ Done blktrace -b 4096 -n 16 /dev/sdb # blkparse sdb >events During the dd the trace looks like this: 8,16 1 1088 0.000476679 3688 A W 48501576 + 8 <- (8,17) 48499528 8,16 1 1089 0.000476765 3688 Q W 48501576 + 8 [pdflush] 8,16 1 1090 0.000477033 3688 M W 48501576 + 8 [pdflush] 8,16 1 1091 0.000477904 3688 A W 48501584 + 8 <- (8,17) 48499536 8,16 1 1092 0.000477989 3688 Q W 48501584 + 8 [pdflush] 8,16 1 1093 0.000478190 3688 M W 48501584 + 8 [pdflush] 8,16 1 1094 0.000479064 3688 A W 48501592 + 8 <- (8,17) 48499544 8,16 1 1095 0.000479149 3688 Q W 48501592 + 8 [pdflush] 8,16 1 1096 0.000479380 3688 M W 48501592 + 8 [pdflush] 8,16 1 1097 0.000480771 3688 A W 48501600 + 8 <- (8,17) 48499552 8,16 1 1098 0.000480855 3688 Q W 48501600 + 8 [pdflush] 8,16 1 1099 0.000481057 3688 M W 48501600 + 8 [pdflush] 8,16 1 1100 0.000482007 3688 A W 48501608 + 8 <- (8,17) 48499560 8,16 1 1101 0.000482091 3688 Q W 48501608 + 8 [pdflush] 8,16 1 1102 0.000482320 3688 M W 48501608 + 8 [pdflush] 8,16 1 1103 0.000483212 3688 A W 48501616 + 8 <- (8,17) 48499568 During the sync it looks like this: 8,16 5 2497292 345.234556150 3753 A W 70477736 + 8 <- (8,17) 70475688 8,16 5 2497293 345.234556233 3753 Q W 70477736 + 8 [sync] 8,16 5 2497294 345.234556455 3753 M W 70477736 + 8 [sync] 8,16 5 2497295 345.234557455 3753 A W 70477744 + 8 <- (8,17) 70475696 8,16 5 2497296 345.234557539 3753 Q W 70477744 + 8 [sync] 8,16 5 2497297 345.234557693 3753 M W 70477744 + 8 [sync] 8,16 5 2497298 345.234558511 3753 A W 70477752 + 8 <- (8,17) 70475704 8,16 5 2497299 345.234558595 3753 Q W 70477752 + 8 [sync] 8,16 5 2497300 345.234558749 3753 M W 70477752 + 8 [sync] 8,16 5 2497301 345.234559742 3753 A W 70477760 + 8 <- (8,17) 70475712 8,16 5 2497302 345.234559927 3753 Q W 70477760 + 8 [sync] 8,16 5 2497303 345.234560176 3753 M W 70477760 + 8 [sync] 8,16 5 2497304 345.234561042 3753 A W 70477768 + 8 <- (8,17) 70475720 8,16 5 2497305 345.234561125 3753 Q W 70477768 + 8 [sync] 8,16 5 2497306 345.234561364 3753 M W 70477768 + 8 [sync] 8,16 5 2497307 345.234562254 3753 A W 70477776 + 8 <- (8,17) 70475728 8,16 5 2497308 345.234562338 3753 Q W 70477776 + 8 [sync] 8,16 5 2497309 345.234562494 3753 M W 70477776 + 8 [sync] 8,16 5 2497310 345.234563316 3753 A W 70477784 + 8 <- (8,17) 70475736 8,16 5 2497311 345.234563399 3753 Q W 70477784 + 8 [sync] 8,16 5 2497312 345.234563621 3753 M W 70477784 + 8 [sync] 8,16 5 2497313 345.234564502 3753 A W 70477792 + 8 <- (8,17) 70475744 8,16 5 2497314 345.234564586 3753 Q W 70477792 + 8 [sync] 8,16 5 2497315 345.234564739 3753 M W 70477792 + 8 [sync] 8,16 5 2497316 345.234565691 3753 A W 70477800 + 8 <- (8,17) 70475752 The summary CPU0 (sdb): Reads Queued: 0, 0KiB Writes Queued: 159,804, 639,216KiB Read Dispatches: 0, 0KiB Write Dispatches: 1,152, 439,936KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB Read Merges: 0, 0KiB Write Merges: 158,114, 632,456KiB Read depth: 1 Write depth: 33 IO unplugs: 80 Timer unplugs: 73 CPU1 (sdb): Reads Queued: 2, 8KiB Writes Queued: 420,749, 1,682MiB Read Dispatches: 2, 8KiB Write Dispatches: 4,062, 1,880MiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 1, 4KiB Writes Completed: 2,539, 1,291MiB Read Merges: 0, 0KiB Write Merges: 416,887, 1,667MiB Read depth: 1 Write depth: 33 IO unplugs: 97 Timer unplugs: 91 CPU2 (sdb): Reads Queued: 0, 0KiB Writes Queued: 229,993, 919,972KiB Read Dispatches: 0, 0KiB Write Dispatches: 5,343, 1,888MiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 3, 12KiB Writes Completed: 13,897, 5,168MiB Read Merges: 0, 0KiB Write Merges: 227,615, 910,460KiB Read depth: 1 Write depth: 33 IO unplugs: 178 Timer unplugs: 99 CPU3 (sdb): Reads Queued: 0, 0KiB Writes Queued: 323,216, 1,292MiB Read Dispatches: 0, 0KiB Write Dispatches: 1,650, 632,824KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB Read Merges: 0, 0KiB Write Merges: 320,069, 1,280MiB Read depth: 1 Write depth: 33 IO unplugs: 106 Timer unplugs: 102 CPU4 (sdb): Reads Queued: 1, 4KiB Writes Queued: 183,059, 732,236KiB Read Dispatches: 1, 4KiB Write Dispatches: 1,012, 388,924KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB Read Merges: 0, 0KiB Write Merges: 181,248, 724,992KiB Read depth: 1 Write depth: 33 IO unplugs: 64 Timer unplugs: 61 CPU5 (sdb): Reads Queued: 2, 8KiB Writes Queued: 827,703, 3,310MiB Read Dispatches: 2, 8KiB Write Dispatches: 8,156, 3,919MiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 2, 8KiB Writes Completed: 7,782, 3,782MiB Read Merges: 0, 0KiB Write Merges: 820,692, 3,282MiB Read depth: 1 Write depth: 33 IO unplugs: 115 Timer unplugs: 88 CPU6 (sdb): Reads Queued: 1, 4KiB Writes Queued: 222,307, 889,228KiB Read Dispatches: 1, 4KiB Write Dispatches: 1,522, 585,076KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB Read Merges: 0, 0KiB Write Merges: 220,008, 880,032KiB Read depth: 1 Write depth: 33 IO unplugs: 94 Timer unplugs: 90 CPU7 (sdb): Reads Queued: 0, 0KiB Writes Queued: 193,729, 774,916KiB Read Dispatches: 0, 0KiB Write Dispatches: 1,321, 505,844KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB Read Merges: 0, 0KiB Write Merges: 191,709, 766,836KiB Read depth: 1 Write depth: 33 IO unplugs: 87 Timer unplugs: 81 Total (sdb): Reads Queued: 6, 24KiB Writes Queued: 2,560K, 10,242MiB Read Dispatches: 6, 24KiB Write Dispatches: 24,218, 10,242MiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 6, 24KiB Writes Completed: 24,218, 10,242MiB Read Merges: 0, 0KiB Write Merges: 2,536K, 10,145MiB IO unplugs: 821 Timer unplugs: 685 Throughput (R/W): 0KiB/s / 29,649KiB/s Events (sdb): 7,760,280 entries Skips: 0 forward (0 - 0.0%) Input file sdb.blktrace.0 added Input file sdb.blktrace.1 added Input file sdb.blktrace.2 added Input file sdb.blktrace.3 added Input file sdb.blktrace.4 added Input file sdb.blktrace.5 added Input file sdb.blktrace.6 added Input file sdb.blktrace.7 added I still have the entire log but it is rather long (35MB compressed with xz). Could you replicate the problem on your test system?
I still need to find a system to reserve with similar hardware to do the test, I'm afraid. Do you still have the original blktrace files? Could you make them available on some side-channel? (Probably too big to attach to the bz ...) Thanks, -Eric
Yes I still have the files. Do you maybe have a ftp server?
Yes, you can follow: http://kbase.redhat.com/faq/docs/DOC-2113 and after that, I think just let me know what unique filename you chose. Thanks, -Eric
I've uploaded the xz compressed events file and a tarball of the sdb.blktrace.* files: $ md5sum 572930-* c5ab90074c327f8a1e7283919771c850 572930-events.xz c37d49fa9eec4f560f1854159ddc9bc0 572930-sdb.blktrace.tar.xz
Got them, thanks - nothing jumping out in the blktrace data so far, I'm afraid, but still looking.
FWIW: I've now reproduced the problem on a (much smaller) 500 GB ext4 partition (on the same 20 TB block device with GPT label). I could even reproduce it on the tiny ext4 re-formatted 16 GB swap partition on the local 250 GB SATA system disk (MSDOS label) of the test system. (I used "mkfs.ext4 -T ext4,largefile /dev/sda2" to format the former swap partition). On this 16 GB partition the dd+sync took 2m28s with ext3 and 7m26s with ext4. I.e. you probably don't need to reserve a big machine to reproduce the issue by yourself (and the bug title is a little bit misleading).
Okay, now it's getting interesting: I've just discovered that the problem disappears if I boot the test system with the kernel parameter "mem=2048M"! Here are the numbers for the 10 GB dd+sync test (to: 16 TB ext4 with 16 GB system RAM: 7m26s 16 TB ext4 with 2 GB system RAM: 0m49s I.e. there are now three know tricks to workaround the bug: 1. Disable ext4 extents 2. Disable the ext4 journal 3. mem=2048M (4. use ext3 or xfs)
Ok, I have finally been able to reproduce the problem - on a 100G or so filesystem, on a large-memory box. So, that's a good clue about the restricted memory; since the original dd can't fit in memory, the write out behavior is different. Now that I have a place to reproduce it I probably won't need to bother you with more questions, but thank you for all the info you've provided!
Well, TBH I'm not seeing quite the dismal performance that you are, but it's still slower than ext3 and xfs. xfs: dd - 35.5466 seconds, 295 MB/s sync - 0m53.619s ext3: dd - 54.2255 seconds, 193 MB/s sync - 0m51.485s ext4: dd - 33.876 seconds, 310 MB/s sync - 2m35.596s This was a 10g buffered dd to a single spindle, 137Gb, on a 16G box, cfq, defaults for all mkfs & mount.
IMHO even that is dismal compared to ext3. And vanilla kernels don't have this problem (did you try 2.6.33, too?). Unfortunately, I don't have a good idea what could explain the difference between our ext4 results. If you find a bug and create a patch to push the ext4 performance close to ext3 levels on your system I'll be more than happy to test it on my 5.5beta test machine. We'll see if our results are in the same ballpark or if there's still a 2nd issue which only triggers on my system and explains the difference of our bad runs. (FWIW: I'll be offline until the 29th).
Today I'm seeing more like 4:30 for the run; I had been inadvertently running on a xen kernel but I can't imagine that'd make it any faster, so need to see if that was the change that mattered or not. I'm working on this one, don't worry, I'll be happy to have you test anything I come up with. -Eric
I know what's happening here, but still not sure why it differs from upstream or quite the best way to fix it. When we do a sync, basically write_cache_pages is told to sync the whole range of the file. It starts from the first dirty page, and loops.... Nothing stops it until it gets to the end of the file.* Then we submit IO on all that ... but only actually write out a small-ish amount. So we come back, scan the -entire- file again, and again write out a small portion. Rinse, repeat, and the nr of loops in write_cache_pages goes down by a bit each time. This is why it's only apparent on large files. If I short-circuit __mpage_da_write_page at the end: if ((mpd->next_page - mpd->first_page) > 2048) return MPAGE_DA_EXTENT_TAIL; to stop when we've got as many pages (well, in this case blocks) as mballoc will handle, then we have the fast behavior again. This isn't really an acceptable fix yet, though, I need to see why we differ from upstream - perhaps something else masks the problem upstream. * the reason ext4 w/o extents is fast is: if (nrblocks >= EXT4_MAX_TRANS_DATA) { /* * With non-extent format we are limited by the journal * credit available. Total credit needed to insert * nrblocks contiguous blocks is dependent on the * nrblocks. So limit nrblocks. */ goto flush_it; and -this- short-circuits the scan-to-the-end-of-the-file for us. Just a crazy side-effect.
Turns out this is likely a problem upstream as well for some forms of sync, so I'll likely get a patch upstram and backport it. -Eric
Did you report it to/provide a patch for upstream? I would like to follow the discussion if possible.
Yes, it's here: http://marc.info/?l=linux-ext4&m=126987658403214&w=2 The same patch should apply to the rhel5 ext4 codebase, if you're interested in testing it. Are you able to build a new ext4.ko or would you like me to do that for you?
Thanks. Unfortunately, nobody responded so far. I've downloaded RHEL 5.5 final kernel 2.6.18-194.el5 source rpm and recompiled it on my 5.5 beta test system. I can confirm that 2.6.18-194.el5 still has the problem. Then I've applied your patch. With the patches ext4 the problem is gone! # /usr/bin/time bash -c "dd if=/dev/zero of=10GB bs=1M count=10000 && sync" 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 17.8429 seconds, 588 MB/s 0.01user 14.92system 0:47.87elapsed 31%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+743minor)pagefaults 0swaps
Good news, I'll make sure it gets in upstream and see about getting it into the next rhel5 release. Thanks for testing! -Eric
Hm, according to the kernel rpm changelog the fix didn't make it in 5.5 update kernel-2.6.18-194.3.1.el5.
This wasn't discovered or resolved in time for 5.5, I'm afraid. -Eric
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
I've just read the new 2.6.18-194.8.1.el5 kernel changelog: Unfortunately, the patch is still missing. You posted the fix at the end of March and comment #24 suggested that it'll be included in the next kernel update. What's the status?
Karsten, by "next update" I'm afraid I meant RHEL5.6... tech previews don't usually get z-stream updates.
55138e0bc29c0751e2152df9ad35deea542f29b3 upstream ("ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks" should be doing this same limiting. However it introduced another, similar scan-forward problem. :( Anyway 5.6 will have a massive update, I will post links to test kernels when available. Thanks, -Eric
A test kernel for RHEL5.6 is available at http://people.redhat.com/jwilson/el5/219.el5/ with many ext4 updates, including changes which may have resolved this bug. I would appreciate testing and feedback, thanks! -Eric
(Actually my original patch did make it upstream, just as an aside; c445e3e0a5c2804524dec6e55f66d63f6bc5bc3e is the commit)
http://people.redhat.com/jwilson/el5/219.el5/ doesn't exist anymore. But I've downloaded the latest jwilson kernel 2.6.18-225.el5.x86_64.rpm and grepped the kernel rpm changelog for the bug id "572930" but there was no match. Was the patch dropped again or does it still make sense to test this newer version (to make sure the fix makes it into 5.6)?
Karsten, any recent kernel under http://people.redhat.com/jwilson/el5/ should be fine to test. You won't see this bug because I didn't include it in the list of fixed bugs in the ext4 rebase, but if you find that it's resolved, I'll just dup this one. Thanks, -Eric
I've finally found some time and another available test system (running CentOS 5.5 right now and not vanilla RHEL 5.5, sorry). The old system was already in production). First, I've tried the latest 5.5 kernel 2.6.18-194.17.1.el5: root@test fs1 # /usr/bin/time bash -c "dd if=/dev/zero of=/net/test/fs1/10GB bs=1M count=10000 && sync" 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 6.07717 seconds, 1.7 GB/s 0.00user 352.95system 5:59.35elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+835minor)pagefaults 0swaps Then I've rebooted into jwilson's kernel 2.6.18-225.el5.x86_64 and repeated the test: root@test fs1 # /usr/bin/time bash -c "dd if=/dev/zero of=/net/test/fs1/10GB_2 bs=1M count=10000 && sync" 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 6.96489 seconds, 1.5 GB/s 0.00user 338.68system 5:45.53elapsed 98%CPU (0avgtext+0avgdata 6400maxresident)k 0inputs+0outputs (8major+827minor)pagefaults 0swaps root@test fs1 # uname -a Linux test 2.6.18-225.el5 #1 SMP Mon Sep 27 10:32:19 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux => Only a couple of seconds faster. Now the same test after reformatting the LV with ext3 (still 2.6.18-225.el5). (The filesystem is on a LV which is located on a single disk VG.) root@test fs1 # /usr/bin/time bash -c "dd if=/dev/zero of=/net/test/fs1/10GB bs=1M count=10000 && sync" 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 79.4931 seconds, 132 MB/s 0.00user 11.84system 1:59.04elapsed 9%CPU (0avgtext+0avgdata 6416maxresident)k 0inputs+0outputs (0major+835minor)pagefaults 0swaps => Much faster. So the problem is still not fixed in 2.6.18-225.el5.x86_64. At least for me. Could you please test it again, too? I would like to know if it is really fixed on your system.
Well, my fix made it in but perhaps something else unfixed it. I'll double check, thanks.
Ah, too many trees, too many commits ... the patch that I had here which originally fixed it made it upstream very late, and thus isn't in rhel6 or rhel5 yet, I'll push to both places. Sorry for the confusion, and thanks for the testing.
This should fix it; it's the upstream commit: diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 8a3f138..27604aa 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -2371,6 +2371,15 @@ static void mpage_add_bh_to_extent(struct mpage_da_data *mpd, sector_t next; int nrblocks = mpd->b_size >> mpd->inode->i_blkbits; + /* + * XXX Don't go larger than mballoc is willing to allocate + * This is a stopgap solution. We eventually need to fold + * mpage_da_submit_io() into this function and then call + * ext4_get_blocks() multiple times in a loop + */ + if (nrblocks >= 8*1024*1024/mpd->inode->i_sb->s_blocksize) + goto flush_it; + /* check if thereserved journal credits might overflow */ if (!(EXT4_I(mpd->inode)->i_flags & EXT4_EXTENTS_FL)) { if (nrblocks >= EXT4_MAX_TRANS_DATA) { If you have any chance to test it that'd be great; I ran it on a 16G RHEL5.6 machine but only slow disks - throughput didn't differ much but CPU usage was WAY down, so on faster storage I think I'd have seen better results. I'll find a better machine to test on. Thanks, -Eric
The improvements are certainly there ;) Timing the 10G buffered dd + sync on a 16G box, with just a single spindle, I got this w/o the patch: 10485760000 bytes (10 GB) copied, 373.516 seconds, 28.1 MB/s 0.01user 500.23system 6:14.05elapsed 133%CPU (0avgtext+0avgdata 6704maxresident)k and this with: 10485760000 bytes (10 GB) copied, 107.205 seconds, 97.8 MB/s 0.01user 53.60system 5:01.56elapsed 17%CPU (0avgtext+0avgdata 6704maxresident)k Sorry the fix for this got delayed, but it will be in 5.6 now. Thanks, -Eric
Would you mind repeating the test with ext3 on your machine, too? I would like to see if the fixed ext4 result is approx as fast as ext3.
Here's ext3 vs. old ext4 vs. patched ext4: old ext4: 10485760000 bytes (10 GB) copied, 294.416 seconds, 35.6 MB/s 0.01user 487.24system 6:01.17elapsed 134%CPU (0avgtext+0avgdata 6688maxresident)k ext3: 10485760000 bytes (10 GB) copied, 245.194 seconds, 42.8 MB/s 0.02user 50.53system 5:31.36elapsed 15%CPU (0avgtext+0avgdata 6704maxresident)k patched ext4: 10485760000 bytes (10 GB) copied, 258.2 seconds, 40.6 MB/s 0.01user 32.97system 5:26.37elapsed 10%CPU (0avgtext+0avgdata 6704maxresident)k That was just one run of each, a bit more careful testing & averaging would give a better idea of how close it is. TBH I'm still looking into ext4 writeback usptream as well, I'm sure there are more improvements to be made... Thanks, -Eric
in kernel-2.6.18-227.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Retested, verified, can be close. $ /usr/bin/time bash -c "dd if=/dev/zero of=/RHTSspareLUN1/10GBFILE bs=1M count=10000 && sync" 10000+0 records in 10000+0 records out 10485760000 bytes (10 GB) copied, 49.3003 seconds, 213 MB/s 0.00user 6.33system 0:52.42elapsed 12%CPU (0avgtext+0avgdata 6720maxresident)k 0inputs+0outputs (0major+875minor)pagefaults 0swaps 18:21:44 root.eng.brq.redhat.com: ~ $ cat /etc/redhat- redhat-lsb/ redhat-release 18:21:44 root.eng.brq.redhat.com: ~ $ cat /etc/redhat-release Red Hat Enterprise Linux Server release 5.6 Beta (Tikanga) 18:22:45 root.eng.brq.redhat.com: ~ $ uname -a Linux ibm-x3650m3-01.lab.eng.brq.redhat.com 2.6.18-236.el5 #1 SMP Mon Dec 6 18:41:54 EST 2010 x86_64 x86_64 x86_64 GNU/Linux
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html