572930 – Bad ext4 sync performance on 16 TB GPT partition

Bug 572930 - Bad ext4 sync performance on 16 TB GPT partition

Summary: Bad ext4 sync performance on 16 TB GPT partition

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.5
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Eric Sandeen
QA Contact:	Igor Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	682831 1020728
TreeView+	depends on / blocked

Reported:	2010-03-12 11:24 UTC by Karsten Weiss
Modified:	2013-10-18 08:33 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	682831 (view as bug list)
Environment:
Last Closed:	2011-01-13 21:16:07 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:0017	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update	2011-01-13 10:37:42 UTC

Description Karsten Weiss 2010-03-12 11:24:17 UTC

Description of problem:

We were performing some ext4 tests on a 16 TB GPT partition and ran into 
this issue when writing a single large file with dd and syncing 
afterwards.

The problem: dd is fast (cached) but the subsequent sync is *very* slow.

# /usr/bin/time bash -c "dd if=/dev/zero of=/mnt/large/10GB bs=1M count=10000 && sync"
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 15.9423 seconds, 658 MB/s
0.01user 441.40system 7:26.10elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+794minor)pagefaults 0swaps

dd: ~16 seconds
sync: ~7 minutes

(The same test finishes in 57s with xfs!)

Here's the "iostat -m /dev/sdb 1" output during dd write:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,00    0,00    6,62   19,35    0,00   74,03

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb             484,00         0,00       242,00          0        242

"iostat -m /dev/sdb 1" during the sync looks like this

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0,00    0,00   12,48    0,00    0,00   87,52

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdb              22,00         0,00         8,00          0          8

However, the sync performance is fine if we ...

* use xfs or
* use ext3
* disable the ext4 journal or
* disable ext4 extents (with enabled journal)

Version-Release number of selected component (if applicable):

2.6.18-186.el5 x86_64

Other kernels we've tried:

* Bad: 2.6.18-164.el5 (CentOS 5.4 dvd kernel)
* Bad: 2.6.18-164.11.1el5 (CentOS 5.4 update kernel)
* Bad: 2.6.18-186-el5 (Red Hat Server 5.5 Beta kernel)
* Bad: 2.6.18-190.el5 (Red Hat Test)
* Good: 2.6.32 vanilla
* Good: 2.6.33 vanilla

How reproducible:

Every time, on three separate machines.

Steps to Reproduce:
1. Format a large partition (16 TB on a GPT partition in our case.
2. Mount it with "ext4,noatime,data=writeback"
3. /usr/bin/time bash -c "dd if=/dev/zero of=/mnt/large/10GB bs=1M count=10000 && sync"
  
Actual results:

# /usr/bin/time bash -c "dd if=/dev/zero of=/mnt/large/10GB bs=1M count=10000 && sync"
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 15.9423 seconds, 658 MB/s
0.01user 441.40system 7:26.10elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+794minor)pagefaults 0swaps

The sync should be finished in <1 minute.

Expected results:

Much faster sync performance. The write rate during sync drops as low a 8 MB/s.


Additional info:

Caveat: On 2010-02-26 I've already reported the problem on the linux-ext4 mailing list (Thread: http://www.spinics.net/lists/linux-ext4/index.html#17672)

We first noticed this problem on a CentOS 5.4 machines but now we also verified it with Red Hat Server 5.5 Beta x86_64. Here some more information from out initial test on CentOS 5.4:

Here's a kernel profile of the test:

# readprofile -r
# /usr/bin/time bash -c "dd if=/dev/zero of=/mnt/large/10GB_3 bs=1M count=10000 && sync"
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 15.8261 seconds, 663 MB/s
0.01user 448.55system 7:32.89elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+788minor)pagefaults 0swaps
# readprofile -m /boot/System.map-2.6.18-190.el5 | sort -nr -k 3 | head -15
3450304 default_idle                             43128.8000
  9532 mod_zone_page_state                      733.2308
 58594 find_get_pages                           537.5596
 58499 find_get_pages_tag                       427.0000
 72404 __set_page_dirty_nobuffers               310.7468
 10740 __wake_up_bit                            238.6667
  7786 unlock_page                              165.6596
  1996 dec_zone_page_state                      153.5385
 12230 clear_page_dirty_for_io                   63.0412
  5938 page_waitqueue                            60.5918
 14440 release_pages                             41.7341
 12664 __mark_inode_dirty                        34.6011
  5281 copy_user_generic_unrolled                30.7035
   323 redirty_page_for_writepage                26.9167
 15537 write_cache_pages                         18.9939

Here are three call traces from the "sync" command:

sync          R  running task       0  5041   5032                     (NOTLB)
 00000000ffffffff 000000000000000c ffff81022bb16510 00000000001ec2a3
 ffff81022bb16548 ffffffffffffff10 ffffffff800d1964 0000000000000010
 0000000000000286 ffff8101ff56bb48 0000000000000018 ffff81033686e970
Call Trace:
 [<ffffffff800d1964>] page_mkclean+0x255/0x281
 [<ffffffff8000eab5>] find_get_pages_tag+0x34/0x89
 [<ffffffff800f4467>] write_cache_pages+0x21b/0x332
 [<ffffffff886fad9d>] :ext4:__mpage_da_writepage+0x0/0x162
 [<ffffffff886fc2f1>] :ext4:ext4_da_writepages+0x317/0x4fe
 [<ffffffff8005b1f0>] do_writepages+0x20/0x2f
 [<ffffffff8002fefc>] __writeback_single_inode+0x1ae/0x328
 [<ffffffff8004a667>] wait_on_page_writeback_range+0xd6/0x12e
 [<ffffffff80020f0d>] sync_sb_inodes+0x1b5/0x26f
 [<ffffffff800f3b0b>] sync_inodes_sb+0x99/0xa9
 [<ffffffff800f3b78>] __sync_inodes+0x5d/0xaa
 [<ffffffff800f3bd6>] sync_inodes+0x11/0x29
 [<ffffffff800e1bb0>] do_sync+0x12/0x5a
 [<ffffffff800e1c06>] sys_sync+0xe/0x12
 [<ffffffff8005e28d>] tracesys+0xd5/0xe0

sync          R  running task       0  5041   5032                     (NOTLB)
 ffff81022a3e4e88 ffffffff8000e930 ffff8101ff56bee8 ffff8101ff56bcd8
 0000000000000000 ffffffff886fbdbc ffff8101ff56bc68 ffff8101ff56bc68
 00000000002537c3 000000000001dfcd 0000000000000007 ffff8101ff56bc90
Call Trace:
 [<ffffffff8000e9c4>] __set_page_dirty_nobuffers+0xde/0xe9
 [<ffffffff8001b694>] find_get_pages+0x2f/0x6d
 [<ffffffff886f8170>] :ext4:mpage_da_submit_io+0xd0/0x12c
 [<ffffffff886fc31d>] :ext4:ext4_da_writepages+0x343/0x4fe
 [<ffffffff8005b1f0>] do_writepages+0x20/0x2f
 [<ffffffff8002fefc>] __writeback_single_inode+0x1ae/0x328
 [<ffffffff8004a667>] wait_on_page_writeback_range+0xd6/0x12e
 [<ffffffff80020f0d>] sync_sb_inodes+0x1b5/0x26f
 [<ffffffff800f3b0b>] sync_inodes_sb+0x99/0xa9
 [<ffffffff800f3b78>] __sync_inodes+0x5d/0xaa
 [<ffffffff800f3bd6>] sync_inodes+0x11/0x29
 [<ffffffff800e1bb0>] do_sync+0x12/0x5a
 [<ffffffff800e1c06>] sys_sync+0xe/0x12
 [<ffffffff8005e28d>] tracesys+0xd5/0xe0

sync          R  running task       0  5353   5348                     (NOTLB)
 ffff810426e04048 0000000000001200 0000000100000001 0000000000000001
 0000000100000000 ffff8103dcecadf8 ffff810426dfd000 ffff8102ebd81b48
 ffff810426e04048 ffff810239bbbc40 0000000000000008 00000000008447f8
Call Trace:
 [<ffffffff801431db>] elv_merged_request+0x1e/0x26
 [<ffffffff8000c02b>] __make_request+0x324/0x401
 [<ffffffff8005c6cf>] cache_alloc_refill+0x106/0x186
 [<ffffffff886f6bf9>] :ext4:walk_page_buffers+0x65/0x8b
 [<ffffffff8000e9c4>] __set_page_dirty_nobuffers+0xde/0xe9
 [<ffffffff886fbd42>] :ext4:ext4_writepage+0x9b/0x333
 [<ffffffff886f8170>] :ext4:mpage_da_submit_io+0xd0/0x12c
 [<ffffffff886fc31d>] :ext4:ext4_da_writepages+0x343/0x4fe
 [<ffffffff8005b1f0>] do_writepages+0x20/0x2f
 [<ffffffff8002fefc>] __writeback_single_inode+0x1ae/0x328
 [<ffffffff8004a667>] wait_on_page_writeback_range+0xd6/0x12e
 [<ffffffff80020f0d>] sync_sb_inodes+0x1b5/0x26f
 [<ffffffff800f3b0b>] sync_inodes_sb+0x99/0xa9
 [<ffffffff800f3b78>] __sync_inodes+0x5d/0xaa
 [<ffffffff800f3bd6>] sync_inodes+0x11/0x29
 [<ffffffff800e1bb0>] do_sync+0x12/0x5a
 [<ffffffff800e1c06>] sys_sync+0xe/0x12
 [<ffffffff8005e28d>] tracesys+0xd5/0xe0

I've tried some more options. These do *not* influence the (bad) result:

* data=writeback and data=ordered
* disabling/enabling uninit_bg 
* max_sectors_kb=512 or 4096
* io scheduler: cfq or noop

Some background information about the system:

OS: CentOS 5.4 x86_64 / Red Hat Server 5.5 x86_64
Memory: 16 GB
CPUs: 2x Quad-Core Opteron 2356
IO scheduler: CFQ

# df -h /dev/sdb{1,2}
Filesystem            Size  Used Avail Use% Mounted on
/dev/sdb1              15T  9.9G   14T   1% /mnt/large
/dev/sdb2             7.3T  179M  7.0T   1% /mnt/small

(parted) print
Model: easyRAID easyRAID_Q16PS (scsi)
Disk /dev/sdb: 24,0TB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start   End     Size    File system  Name   Flags
 1      1049kB  16,0TB  16,0TB  ext3         large
 2      16,0TB  24,0TB  8003GB  ext3         small

("start 1049kB" is at sector 2048)

sdb is a FC HW-RAID (easyRAID_Q16PS) and consists of a RAID6 volume 
created from 14 disks with chunk size 128kb.

QLogic Fibre Channel HBA Driver: 8.03.01.04.05.05-k
 QLogic QLE2462 - PCI-Express Dual Channel 4Gb Fibre Channel HBA
 ISP2432: PCIe (2.5Gb/s x4) @ 0000:01:00.1 hdma+, host#=8, fw=4.04.09 (486)

The ext4 filesystem was created with

mkfs.ext4 -T ext4,largefile4 -E stride=32,stripe-width=$((32*(14-2))) /dev/sdb1
or
mkfs.ext4 -T ext4 /dev/sdb1

Mount options: defaults,noatime,data=writeback

I'm willing to test patches if somebody wants to debug the problem.

Comment 1 Eric Sandeen 2010-03-12 15:56:27 UTC

Does it show up with normal data=ordered mode as well?  I think on the list you said that it did...

If you are highly motivated, capturing blktrace info during the slow writeout post-sync could be interesting.  Let me know if you need info about how to run blktrace ...

I'll try to reproduce here as well, of course.

This could be either bad writeback interaction, or ext4's buffer-at-a-time IO behavior (both somewhat related, I guess...)

Thanks,
-Eric

Comment 2 Karsten Weiss 2010-03-12 22:42:02 UTC

Yes, it's the same with data=ordered (See above "These do *not* influence the (bad) result" for the list of things I've tried).

I'll see what I do regarding blktrace but you'll probably have to wait until Tuesday before I can resume testing.

Comment 3 Eric Sandeen 2010-03-12 23:04:53 UTC

Thanks, I appreciate any testing you're willing to do, but it's a simple enough test that I should be able to find a similar machine to replicate it here, as well.

Thanks,
-Eric

Comment 4 Karsten Weiss 2010-03-16 15:04:07 UTC

Here's the result of the blktrace:

(/dev/sdb1 is mounted on /mnt/large)

# pwd
/root/blktrace
# blktrace -b 4096 -n 16 /dev/sdb &
[1] 3734
# /usr/bin/time dd if=/dev/zero of=/mnt/large/10GB bs=1M count=10000
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 17.5677 seconds, 597 MB/s
0.01user 12.49system 0:17.56elapsed 71%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+212minor)pagefaults 0swaps
# /usr/bin/time sync
0.00user 325.90system 5:28.54elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+145minor)pagefaults 0swaps
# kill -15 3734
Device: /dev/sdb
  CPU  0:                    0 events,    25132 KiB data
  CPU  1:                    0 events,    66271 KiB data
  CPU  2:                    0 events,    36997 KiB data
  CPU  3:                    0 events,    50769 KiB data
  CPU  4:                    0 events,    28761 KiB data
  CPU  5:                    0 events,   130447 KiB data
  CPU  6:                    0 events,    34952 KiB data
  CPU  7:                    0 events,    30460 KiB data
  Total:                     0 events (dropped 0),   403786 KiB data

[1]+  Done                    blktrace -b 4096 -n 16 /dev/sdb
# blkparse sdb >events

During the dd the trace looks like this:

  8,16   1     1088     0.000476679  3688  A   W 48501576 + 8 <- (8,17) 48499528
  8,16   1     1089     0.000476765  3688  Q   W 48501576 + 8 [pdflush]
  8,16   1     1090     0.000477033  3688  M   W 48501576 + 8 [pdflush]
  8,16   1     1091     0.000477904  3688  A   W 48501584 + 8 <- (8,17) 48499536
  8,16   1     1092     0.000477989  3688  Q   W 48501584 + 8 [pdflush]
  8,16   1     1093     0.000478190  3688  M   W 48501584 + 8 [pdflush]
  8,16   1     1094     0.000479064  3688  A   W 48501592 + 8 <- (8,17) 48499544
  8,16   1     1095     0.000479149  3688  Q   W 48501592 + 8 [pdflush]
  8,16   1     1096     0.000479380  3688  M   W 48501592 + 8 [pdflush]
  8,16   1     1097     0.000480771  3688  A   W 48501600 + 8 <- (8,17) 48499552
  8,16   1     1098     0.000480855  3688  Q   W 48501600 + 8 [pdflush]
  8,16   1     1099     0.000481057  3688  M   W 48501600 + 8 [pdflush]
  8,16   1     1100     0.000482007  3688  A   W 48501608 + 8 <- (8,17) 48499560
  8,16   1     1101     0.000482091  3688  Q   W 48501608 + 8 [pdflush]
  8,16   1     1102     0.000482320  3688  M   W 48501608 + 8 [pdflush]
  8,16   1     1103     0.000483212  3688  A   W 48501616 + 8 <- (8,17) 48499568

During the sync it looks like this:

  8,16   5  2497292   345.234556150  3753  A   W 70477736 + 8 <- (8,17) 70475688
  8,16   5  2497293   345.234556233  3753  Q   W 70477736 + 8 [sync]
  8,16   5  2497294   345.234556455  3753  M   W 70477736 + 8 [sync]
  8,16   5  2497295   345.234557455  3753  A   W 70477744 + 8 <- (8,17) 70475696
  8,16   5  2497296   345.234557539  3753  Q   W 70477744 + 8 [sync]
  8,16   5  2497297   345.234557693  3753  M   W 70477744 + 8 [sync]
  8,16   5  2497298   345.234558511  3753  A   W 70477752 + 8 <- (8,17) 70475704
  8,16   5  2497299   345.234558595  3753  Q   W 70477752 + 8 [sync]
  8,16   5  2497300   345.234558749  3753  M   W 70477752 + 8 [sync]
  8,16   5  2497301   345.234559742  3753  A   W 70477760 + 8 <- (8,17) 70475712
  8,16   5  2497302   345.234559927  3753  Q   W 70477760 + 8 [sync]
  8,16   5  2497303   345.234560176  3753  M   W 70477760 + 8 [sync]
  8,16   5  2497304   345.234561042  3753  A   W 70477768 + 8 <- (8,17) 70475720
  8,16   5  2497305   345.234561125  3753  Q   W 70477768 + 8 [sync]
  8,16   5  2497306   345.234561364  3753  M   W 70477768 + 8 [sync]
  8,16   5  2497307   345.234562254  3753  A   W 70477776 + 8 <- (8,17) 70475728
  8,16   5  2497308   345.234562338  3753  Q   W 70477776 + 8 [sync]
  8,16   5  2497309   345.234562494  3753  M   W 70477776 + 8 [sync]
  8,16   5  2497310   345.234563316  3753  A   W 70477784 + 8 <- (8,17) 70475736
  8,16   5  2497311   345.234563399  3753  Q   W 70477784 + 8 [sync]
  8,16   5  2497312   345.234563621  3753  M   W 70477784 + 8 [sync]
  8,16   5  2497313   345.234564502  3753  A   W 70477792 + 8 <- (8,17) 70475744
  8,16   5  2497314   345.234564586  3753  Q   W 70477792 + 8 [sync]
  8,16   5  2497315   345.234564739  3753  M   W 70477792 + 8 [sync]
  8,16   5  2497316   345.234565691  3753  A   W 70477800 + 8 <- (8,17) 70475752

The summary

CPU0 (sdb):
 Reads Queued:           0,        0KiB  Writes Queued:     159,804,  639,216KiB
 Read Dispatches:        0,        0KiB  Write Dispatches:    1,152,  439,936KiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:        0,        0KiB  Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB  Write Merges:      158,114,  632,456KiB
 Read depth:             1               Write depth:            33
 IO unplugs:            80               Timer unplugs:          73
CPU1 (sdb):
 Reads Queued:           2,        8KiB  Writes Queued:     420,749,    1,682MiB
 Read Dispatches:        2,        8KiB  Write Dispatches:    4,062,    1,880MiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:        1,        4KiB  Writes Completed:    2,539,    1,291MiB
 Read Merges:            0,        0KiB  Write Merges:      416,887,    1,667MiB
 Read depth:             1               Write depth:            33
 IO unplugs:            97               Timer unplugs:          91
CPU2 (sdb):
 Reads Queued:           0,        0KiB  Writes Queued:     229,993,  919,972KiB
 Read Dispatches:        0,        0KiB  Write Dispatches:    5,343,    1,888MiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:        3,       12KiB  Writes Completed:   13,897,    5,168MiB
 Read Merges:            0,        0KiB  Write Merges:      227,615,  910,460KiB
 Read depth:             1               Write depth:            33
 IO unplugs:           178               Timer unplugs:          99
CPU3 (sdb):
 Reads Queued:           0,        0KiB  Writes Queued:     323,216,    1,292MiB
 Read Dispatches:        0,        0KiB  Write Dispatches:    1,650,  632,824KiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:        0,        0KiB  Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB  Write Merges:      320,069,    1,280MiB
 Read depth:             1               Write depth:            33
 IO unplugs:           106               Timer unplugs:         102
CPU4 (sdb):
 Reads Queued:           1,        4KiB  Writes Queued:     183,059,  732,236KiB
 Read Dispatches:        1,        4KiB  Write Dispatches:    1,012,  388,924KiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:        0,        0KiB  Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB  Write Merges:      181,248,  724,992KiB
 Read depth:             1               Write depth:            33
 IO unplugs:            64               Timer unplugs:          61
CPU5 (sdb):
 Reads Queued:           2,        8KiB  Writes Queued:     827,703,    3,310MiB
 Read Dispatches:        2,        8KiB  Write Dispatches:    8,156,    3,919MiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:        2,        8KiB  Writes Completed:    7,782,    3,782MiB
 Read Merges:            0,        0KiB  Write Merges:      820,692,    3,282MiB
 Read depth:             1               Write depth:            33
 IO unplugs:           115               Timer unplugs:          88
CPU6 (sdb):
 Reads Queued:           1,        4KiB  Writes Queued:     222,307,  889,228KiB
 Read Dispatches:        1,        4KiB  Write Dispatches:    1,522,  585,076KiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:        0,        0KiB  Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB  Write Merges:      220,008,  880,032KiB
 Read depth:             1               Write depth:            33
 IO unplugs:            94               Timer unplugs:          90
CPU7 (sdb):
 Reads Queued:           0,        0KiB  Writes Queued:     193,729,  774,916KiB
 Read Dispatches:        0,        0KiB  Write Dispatches:    1,321,  505,844KiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:        0,        0KiB  Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB  Write Merges:      191,709,  766,836KiB
 Read depth:             1               Write depth:            33
 IO unplugs:            87               Timer unplugs:          81

Total (sdb):
 Reads Queued:           6,       24KiB  Writes Queued:       2,560K,  10,242MiB
 Read Dispatches:        6,       24KiB  Write Dispatches:   24,218,   10,242MiB
 Reads Requeued:         0               Writes Requeued:         0
 Reads Completed:        6,       24KiB  Writes Completed:   24,218,   10,242MiB
 Read Merges:            0,        0KiB  Write Merges:        2,536K,  10,145MiB
 IO unplugs:           821               Timer unplugs:         685

Throughput (R/W): 0KiB/s / 29,649KiB/s
Events (sdb): 7,760,280 entries
Skips: 0 forward (0 -   0.0%)
Input file sdb.blktrace.0 added
Input file sdb.blktrace.1 added
Input file sdb.blktrace.2 added
Input file sdb.blktrace.3 added
Input file sdb.blktrace.4 added
Input file sdb.blktrace.5 added
Input file sdb.blktrace.6 added
Input file sdb.blktrace.7 added

I still have the entire log but it is rather long (35MB compressed with xz).

Could you replicate the problem on your test system?

Comment 5 Eric Sandeen 2010-03-16 15:16:54 UTC

I still need to find a system to reserve with similar hardware to do the test, I'm afraid.

Do you still have the original blktrace files?  Could you make them available on some side-channel?  (Probably too big to attach to the bz ...)

Thanks,
-Eric

Comment 6 Karsten Weiss 2010-03-16 15:35:13 UTC

Yes I still have the files. Do you maybe have a ftp server?

Comment 7 Eric Sandeen 2010-03-16 15:46:01 UTC

Yes, you can follow: http://kbase.redhat.com/faq/docs/DOC-2113 and after that, I think just let me know what unique filename you chose.

Thanks,
-Eric

Comment 8 Karsten Weiss 2010-03-16 17:13:20 UTC

I've uploaded the xz compressed events file and a tarball of the sdb.blktrace.* files:

$ md5sum 572930-*
c5ab90074c327f8a1e7283919771c850  572930-events.xz
c37d49fa9eec4f560f1854159ddc9bc0  572930-sdb.blktrace.tar.xz

Comment 9 Eric Sandeen 2010-03-16 22:33:57 UTC

Got them, thanks - nothing jumping out in the blktrace data so far, I'm afraid, but still looking.

Comment 10 Karsten Weiss 2010-03-17 10:28:24 UTC

FWIW: I've now reproduced the problem on a (much smaller) 500 GB ext4 partition (on the same 20 TB block device with GPT label).

I could even reproduce it on the tiny ext4 re-formatted 16 GB swap partition on the local 250 GB SATA system disk (MSDOS label) of the test system. (I used "mkfs.ext4 -T ext4,largefile /dev/sda2" to format the former swap partition).
On this 16 GB partition the dd+sync took 2m28s with ext3 and 7m26s with ext4.

I.e. you probably don't need to reserve a big machine to reproduce the issue by yourself (and the bug title is a little bit misleading).

Comment 11 Karsten Weiss 2010-03-17 11:10:36 UTC

Okay, now it's getting interesting: I've just discovered that the problem disappears if I boot the test system with the kernel parameter "mem=2048M"!

Here are the numbers for the 10 GB dd+sync test (to:

16 TB ext4 with 16 GB system RAM: 7m26s
16 TB ext4 with  2 GB system RAM: 0m49s

I.e. there are now three know tricks to workaround the bug:

1. Disable ext4 extents
2. Disable the ext4 journal
3. mem=2048M
(4. use ext3 or xfs)

Comment 12 Eric Sandeen 2010-03-17 15:36:09 UTC

Ok, I have finally been able to reproduce the problem - on a 100G or so filesystem, on a large-memory box.

So, that's a good clue about the restricted memory; since the original dd can't fit in memory, the write out behavior is different.

Now that I have a place to reproduce it I probably won't need to bother you with more questions, but thank you for all the info you've provided!

Comment 13 Eric Sandeen 2010-03-17 17:27:52 UTC

Well, TBH I'm not seeing quite the dismal performance that you are, but it's still slower than ext3 and xfs.

xfs:	dd - 35.5466 seconds, 295 MB/s	sync - 0m53.619s
ext3:	dd - 54.2255 seconds, 193 MB/s	sync - 0m51.485s
ext4:	dd - 33.876  seconds, 310 MB/s	sync - 2m35.596s

This was a 10g buffered dd to a single spindle, 137Gb, on a 16G box, cfq, defaults for all mkfs & mount.

Comment 14 Karsten Weiss 2010-03-18 21:51:22 UTC

IMHO even that is dismal compared to ext3. And vanilla kernels don't have this problem (did you try 2.6.33, too?). Unfortunately, I don't have a good idea what could explain the difference between our ext4 results.

If you find a bug and create a patch to push the ext4 performance close to ext3 levels on your system I'll be more than happy to test it on my 5.5beta test machine. We'll see if our results are in the same ballpark or if there's still a 2nd issue which only triggers on my system and explains the difference of our bad runs. (FWIW: I'll be offline until the 29th).

Comment 15 Eric Sandeen 2010-03-18 22:00:24 UTC

Today I'm seeing more like 4:30 for the run; I had been inadvertently running on a xen kernel but I can't imagine that'd make it any faster, so need to see if that was the change that mattered or not.

I'm working on this one, don't worry, I'll be happy to have you test anything I come up with.

-Eric

Comment 16 Eric Sandeen 2010-03-24 21:53:53 UTC

I know what's happening here, but still not sure why it differs from upstream or quite the best way to fix it.

When we do a sync, basically write_cache_pages is told to sync the whole range of the file.  It starts from the first dirty page, and loops.... Nothing stops it until it gets to the end of the file.*  Then we submit IO on all that ... but only actually write out a small-ish amount.  So we come back, scan the -entire- file again, and again write out a small portion.  Rinse, repeat, and the nr of loops in write_cache_pages goes down by a bit each time.  This is why it's only apparent on large files.

If I short-circuit __mpage_da_write_page at the end:

        if ((mpd->next_page - mpd->first_page) > 2048)
                return MPAGE_DA_EXTENT_TAIL;

to stop when we've got as many pages (well, in this case blocks) as mballoc will handle, then we have the fast behavior again.

This isn't really an acceptable fix yet, though, I need to see why we differ from upstream - perhaps something else masks the problem upstream.

* the reason ext4 w/o extents is fast is:

                if (nrblocks >= EXT4_MAX_TRANS_DATA) {
                         /*
                          * With non-extent format we are limited by the journal
                          * credit available.  Total credit needed to insert
                          * nrblocks contiguous blocks is dependent on the
                          * nrblocks.  So limit nrblocks.
                          */
                         goto flush_it;

and -this- short-circuits the scan-to-the-end-of-the-file for us.  Just a crazy side-effect.

Comment 17 Eric Sandeen 2010-03-26 19:52:22 UTC

Turns out this is likely a problem upstream as well for some forms of sync, so I'll likely get a patch upstram and backport it.

-Eric

Comment 18 Karsten Weiss 2010-03-30 09:02:10 UTC

Did you report it to/provide a patch for upstream? I would like to follow the discussion if possible.

Comment 19 Eric Sandeen 2010-03-30 13:50:46 UTC

Yes, it's here:

http://marc.info/?l=linux-ext4&m=126987658403214&w=2

The same patch should apply to the rhel5 ext4 codebase, if you're interested in testing it.  Are you able to build a new ext4.ko or would you like me to do that for you?

Comment 20 Karsten Weiss 2010-03-31 12:52:21 UTC

Thanks. Unfortunately, nobody responded so far.

I've downloaded RHEL 5.5 final kernel 2.6.18-194.el5 source rpm and recompiled it on my 5.5 beta test system. I can confirm that 2.6.18-194.el5 still has
the problem.

Then I've applied your patch. With the patches ext4 the problem is gone!

# /usr/bin/time bash -c "dd if=/dev/zero of=10GB bs=1M count=10000 && sync"
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 17.8429 seconds, 588 MB/s
0.01user 14.92system 0:47.87elapsed 31%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+743minor)pagefaults 0swaps

Comment 21 Eric Sandeen 2010-03-31 13:45:42 UTC

Good news, I'll make sure it gets in upstream and see about getting it into the next rhel5 release.

Thanks for testing!

-Eric

Comment 22 Karsten Weiss 2010-05-07 15:09:11 UTC

Hm, according to the kernel rpm changelog the fix didn't make it in 5.5 update kernel-2.6.18-194.3.1.el5.

Comment 23 Eric Sandeen 2010-05-07 15:41:11 UTC

This wasn't discovered or resolved in time for 5.5, I'm afraid.

-Eric

Comment 24 RHEL Program Management 2010-05-20 12:47:58 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 25 Karsten Weiss 2010-07-05 07:21:53 UTC

I've just read the new 2.6.18-194.8.1.el5 kernel changelog: Unfortunately, the patch is still missing. You posted the fix at the end of March and comment #24 suggested that it'll be included in the next kernel update. What's the status?

Comment 26 Eric Sandeen 2010-07-05 14:47:50 UTC

Karsten, by "next update" I'm afraid I meant RHEL5.6... tech previews don't usually get z-stream updates.

Comment 27 Eric Sandeen 2010-09-01 18:13:20 UTC

55138e0bc29c0751e2152df9ad35deea542f29b3 upstream ("ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks" should be doing this same limiting.  However it introduced another, similar scan-forward problem.  :(

Anyway 5.6 will have a massive update, I will post links to test kernels when available.

Thanks,
-Eric

Comment 28 Eric Sandeen 2010-09-10 22:02:39 UTC

A test kernel for RHEL5.6 is available at http://people.redhat.com/jwilson/el5/219.el5/ with many ext4 updates, including changes which may have resolved this bug.

I would appreciate testing and feedback, thanks!

-Eric

Comment 29 Eric Sandeen 2010-09-13 21:42:13 UTC

(Actually my original patch did make it upstream, just as an aside; c445e3e0a5c2804524dec6e55f66d63f6bc5bc3e is the commit)

Comment 30 Karsten Weiss 2010-09-28 15:03:52 UTC

http://people.redhat.com/jwilson/el5/219.el5/ doesn't exist anymore. But I've downloaded the latest jwilson kernel 2.6.18-225.el5.x86_64.rpm and grepped the kernel rpm changelog for the bug id "572930" but there was no match. Was the patch dropped again or does it still make sense to test this newer version (to make sure the fix makes it into 5.6)?

Comment 31 Eric Sandeen 2010-09-28 18:14:24 UTC

Karsten, any recent kernel under http://people.redhat.com/jwilson/el5/ should be fine to test.

You won't see this bug because I didn't include it in the list of fixed bugs in the ext4 rebase, but if you find that it's resolved, I'll just dup this one.

Thanks,
-Eric

Comment 32 Karsten Weiss 2010-10-01 14:11:52 UTC

I've finally found some time and another available test system (running CentOS 5.5 right now and not vanilla RHEL 5.5, sorry). The old system was already in production).

First, I've tried the latest 5.5 kernel 2.6.18-194.17.1.el5:

root@test fs1 # /usr/bin/time bash -c "dd if=/dev/zero of=/net/test/fs1/10GB bs=1M count=10000 && sync"
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 6.07717 seconds, 1.7 GB/s
0.00user 352.95system 5:59.35elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+835minor)pagefaults 0swaps

Then I've rebooted into jwilson's kernel 2.6.18-225.el5.x86_64 and repeated the test:

root@test fs1 # /usr/bin/time bash -c "dd if=/dev/zero of=/net/test/fs1/10GB_2 bs=1M count=10000 && sync"
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 6.96489 seconds, 1.5 GB/s
0.00user 338.68system 5:45.53elapsed 98%CPU (0avgtext+0avgdata 6400maxresident)k
0inputs+0outputs (8major+827minor)pagefaults 0swaps
root@test fs1 # uname -a
Linux test 2.6.18-225.el5 #1 SMP Mon Sep 27 10:32:19 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

=> Only a couple of seconds faster.

Now the same test after reformatting the LV with ext3 (still 2.6.18-225.el5).
(The filesystem is on a LV which is located on a single disk VG.)

root@test fs1 # /usr/bin/time bash -c "dd if=/dev/zero of=/net/test/fs1/10GB bs=1M count=10000 && sync"
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 79.4931 seconds, 132 MB/s
0.00user 11.84system 1:59.04elapsed 9%CPU (0avgtext+0avgdata 6416maxresident)k
0inputs+0outputs (0major+835minor)pagefaults 0swaps

=> Much faster.

So the problem is still not fixed in 2.6.18-225.el5.x86_64. At least for me.
Could you please test it again, too? I would like to know if it is really fixed on your system.

Comment 33 Eric Sandeen 2010-10-01 14:31:08 UTC

Well, my fix made it in but perhaps something else unfixed it.  I'll double check, thanks.

Comment 34 Eric Sandeen 2010-10-01 20:55:21 UTC

Ah, too many trees, too many commits ... the patch that I had here which originally fixed it made it upstream very late, and thus isn't in rhel6 or rhel5 yet, I'll push to both places.  Sorry for the confusion, and thanks for the testing.

Comment 35 Eric Sandeen 2010-10-01 22:34:18 UTC

This should fix it; it's the upstream commit:

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 8a3f138..27604aa 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2371,6 +2371,15 @@ static void mpage_add_bh_to_extent(struct mpage_da_data *mpd,
 	sector_t next;
 	int nrblocks = mpd->b_size >> mpd->inode->i_blkbits;
 
+	/*
+	 * XXX Don't go larger than mballoc is willing to allocate
+	 * This is a stopgap solution.  We eventually need to fold
+	 * mpage_da_submit_io() into this function and then call
+	 * ext4_get_blocks() multiple times in a loop
+	 */
+	if (nrblocks >= 8*1024*1024/mpd->inode->i_sb->s_blocksize)
+		goto flush_it;
+
 	/* check if thereserved journal credits might overflow */
 	if (!(EXT4_I(mpd->inode)->i_flags & EXT4_EXTENTS_FL)) {
 		if (nrblocks >= EXT4_MAX_TRANS_DATA) {


If you have any chance to test it that'd be great; I ran it on a 16G RHEL5.6 machine but only slow disks - throughput didn't differ much but CPU usage was WAY down, so on faster storage I think I'd have seen better results.  I'll find a better machine to test on.

Thanks,
-Eric

Comment 37 Eric Sandeen 2010-10-02 03:34:07 UTC

The improvements are certainly there ;)

Timing the 10G buffered dd + sync on a 16G box, with just a single spindle,
I got this w/o the patch:

10485760000 bytes (10 GB) copied, 373.516 seconds, 28.1 MB/s
0.01user 500.23system 6:14.05elapsed 133%CPU (0avgtext+0avgdata 6704maxresident)k

and this with:

10485760000 bytes (10 GB) copied, 107.205 seconds, 97.8 MB/s
0.01user 53.60system 5:01.56elapsed 17%CPU (0avgtext+0avgdata 6704maxresident)k

Sorry the fix for this got delayed, but it will be in 5.6 now.

Thanks,
-Eric

Comment 38 Karsten Weiss 2010-10-04 06:15:30 UTC

Would you mind repeating the test with ext3 on your machine, too? I would like to see if the fixed ext4 result is approx as fast as ext3.

Comment 39 Eric Sandeen 2010-10-04 17:13:53 UTC

Here's ext3 vs. old ext4 vs. patched ext4:

old ext4:

10485760000 bytes (10 GB) copied, 294.416 seconds, 35.6 MB/s
0.01user 487.24system 6:01.17elapsed 134%CPU (0avgtext+0avgdata 6688maxresident)k

ext3:

10485760000 bytes (10 GB) copied, 245.194 seconds, 42.8 MB/s
0.02user 50.53system 5:31.36elapsed 15%CPU (0avgtext+0avgdata 6704maxresident)k

patched ext4:

10485760000 bytes (10 GB) copied, 258.2 seconds, 40.6 MB/s
0.01user 32.97system 5:26.37elapsed 10%CPU (0avgtext+0avgdata 6704maxresident)k

That was just one run of each, a bit more careful testing & averaging would give a better idea of how close it is.  TBH I'm still looking into ext4 writeback usptream as well, I'm sure there are more improvements to be made...

Thanks,
-Eric

Comment 41 Jarod Wilson 2010-10-14 14:02:25 UTC

in kernel-2.6.18-227.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 43 Kamil Kolakowski 2010-12-14 17:24:04 UTC

Retested, verified, can be close.

$ /usr/bin/time bash -c "dd if=/dev/zero of=/RHTSspareLUN1/10GBFILE bs=1M count=10000 && sync"
10000+0 records in
10000+0 records out
10485760000 bytes (10 GB) copied, 49.3003 seconds, 213 MB/s
0.00user 6.33system 0:52.42elapsed 12%CPU (0avgtext+0avgdata 6720maxresident)k
0inputs+0outputs (0major+875minor)pagefaults 0swaps
18:21:44 root.eng.brq.redhat.com: ~
$ cat /etc/redhat-
redhat-lsb/     redhat-release  
18:21:44 root.eng.brq.redhat.com: ~
$ cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 5.6 Beta (Tikanga)
18:22:45 root.eng.brq.redhat.com: ~
$ uname -a
Linux ibm-x3650m3-01.lab.eng.brq.redhat.com 2.6.18-236.el5 #1 SMP Mon Dec 6 18:41:54 EST 2010 x86_64 x86_64 x86_64 GNU/Linux

Comment 45 errata-xmlrpc 2011-01-13 21:16:07 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html

Note You need to log in before you can comment on or make changes to this bug.