Bug 867041 - RHEL5: on 32-bit PAE system, a process restricted from HIGHMEM may get blocked indefinitely inside balance_dirty_pages
RHEL5: on 32-bit PAE system, a process restricted from HIGHMEM may get blocke...
Status: CLOSED DUPLICATE of bug 965359
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.8
i686 Linux
high Severity high
: rc
: ---
Assigned To: Frantisek Hrbata
Red Hat Kernel QE team
patch
:
Depends On:
Blocks: 836232 984996
  Show dependency treegraph
 
Reported: 2012-10-16 11:27 EDT by Jon
Modified: 2013-11-20 03:45 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-11-20 03:45:46 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
reproducer script, systemtap script, and logs showing mkfs stuck for over 3 minutes even after dd exits, also showing get_dirty_limits() returns different *pdirty and *pbackground values (126.90 KB, application/octet-stream)
2013-05-18 20:56 EDT, Dave Wysochanski
no flags Details

  None (edit)
Description Jon 2012-10-16 11:27:16 EDT
Description of problem:
- While running multiple threads that contribute to buffered memory, a race condition occurs and random threads are placed in D state.  However, the thread is never awoken nor coalesced.  

Version-Release number of selected component (if applicable):
Red Hat Enterprise Linux Server release 5.8 (Tikanga)
2.6.18-308.el5PAE #1 SMP Fri Jan 27 17:40:09 EST 2012 i686 i686 i386 GNU/Linux

How reproducible:
- Consistently

Steps to Reproduce:
1.  Start dd on unique lun
2.  start mkfs.ext3 on different unique lun
3.  Random process (dd | mkfs) enters D state indefinitely
  
Actual results:
Process enters D state indefinitely and does not return.  

Expected results:
Process(es) transition between D|R state and complete task.

Additional info:

Host: 
---
Red Hat Enterprise Linux Server release 6.3 (Santiago)
2.6.32-279.5.2.el6.x86_64 #1 SMP Tue Aug 14 11:36:39 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
qemu-kvm-0.12.1.2-2.295.el6_3.1.x86_64
libvirt-0.9.10-21.el6_3.4.x86_64

Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
# grep -c processor /proc/cpuinfo 
24

MemTotal:  65923168 kB

KVM Guest: 
---
Red Hat Enterprise Linux Server release 5.8 (Tikanga)
2.6.18-308.el5PAE #1 SMP Fri Jan 27 17:40:09 EST 2012 i686 i686 i386 GNU/Linux

Intel Xeon E312xx (Sandy Bridge)
# grep -c processor /proc/cpuinfo 
20

MemTotal:  8181712 kB
7808MB HIGHMEM available
896MB LOWMEM available

On node 0 totalpages: 2228224
  DMA zone: 4096 pages, LIFO batch:0
  Normal zone: 225280 pages, LIFO batch:31
  HighMem zone: 1998848 pages, LIFO batch:31

00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
00:04.0 SCSI storage controller: Red Hat, Inc Virtio block device
00:05.0 SCSI storage controller: Red Hat, Inc Virtio block device
00:06.0 SCSI storage controller: Red Hat, Inc Virtio block device
00:07.0 RAM memory: Red Hat, Inc Virtio memory balloon
Comment 1 Jon 2012-10-16 11:34:39 EDT
- kick off dd, buffered io: 
---
[root@node0 ~]# dd if=/dev/zero of=/vasm/d8e12a1f-e249-4e6a-8ff7-fc2255cd4c46/dd.file

/dev/vdb1 on /vasm/d8e12a1f-e249-4e6a-8ff7-fc2255cd4c46 type ext3 (rw,noexec,nosuid)

- Start mkfs on different virtual disk, separate controller
[root@node0 ~]# strace mkfs.ext3 -v -E stride=16 /dev/vdc1

- mkfs.ext3 enters D state

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                         
 8024 root      18   0  3968  588  500 R 67.5  0.0   0:43.13 dd                                                                              
 7376 root      10  -5     0    0    0 D  7.0  0.0   0:03.25 kjournald                                                                       
13847 apl       23   0 1189m  43m 7604 S  3.0  0.5  58:24.85 java                                                                            
  407 root      15   0     0    0    0 D  0.3  0.0   0:43.06 pdflush                                                                         
  408 root      15   0     0    0    0 D  0.3  0.0   0:17.39 pdflush                                                                         
 7401 root      16   0 33568  29m  840 D  0.3  0.4   0:06.11 mkfs.ext3                                                                       
13042 apl       18   0 44160  21m 2356 S  0.3  0.3   5:45.08 procctl.pl                                                                      
22204 root      15   0  2560 1276  844 R  0.3  0.0   0:02.77 top                                                                             
    1 root      15   0  2164  732  636 S  0.0  0.0   0:02.20 init  


- No IO issued to /dev/vdc1, buffered requests to vdb1:

Device:   rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
vdb       0.00 21836.00  0.00 4784.00     0.00 105936.00    44.29   142.87   30.17   0.21 100.20
vdb1      0.00 21836.00  0.00 4784.00     0.00 105936.00    44.29   142.87   30.17   0.21 100.20
vdc       0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
vdc1      0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
       

- mkfs enters D state on write (output from strace): 

write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768) = 32768
_llseek(3, 235956207616, [235956207616], SEEK_SET) = 0
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32768

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                         
 8024 root      18   0  3968  588  500 R 67.8  0.0   1:02.86 dd                                                                              
 7376 root      10  -5     0    0    0 R  7.3  0.0   0:05.35 kjournald                                                                       
13847 apl       23   0 1189m  43m 7604 S  3.3  0.5  58:25.28 java                                                                            
  409 root      10  -5     0    0    0 S  1.7  0.0   1:01.76 kswapd0                                                                         
  407 root      15   0     0    0    0 D  0.7  0.0   0:43.23 pdflush                                                                         
  408 root      15   0     0    0    0 R  0.7  0.0   0:17.57 pdflush                                                                         
 1794 apl       18   0 93468  14m 6452 S  0.3  0.2   0:00.41 sm_writer                                                                       
 7401 root      16   0 33568  29m  840 D  0.3  0.4   0:06.12 mkfs.ext3                                                                       
          
- Plenty of lowmem:
LowTotal:       710624 kB
LowFree:        497588 kB
SwapTotal:     2097144 kB
SwapFree:      2076216 kB
Dirty:         1422296 kB
Writeback:           0 kB
AnonPages:      859872 kB
Mapped:          99192 kB
Slab:           190904 kB

- kill dd (CTRL+C)
- # sync
- restart dd

- notice dd switches to serialized 4k request size and bypasses vfs/cache: 
---
Device:      rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
vdb          0.00     0.00 38.00  4.00   152.00    16.00     8.00     1.07   25.50  23.71  99.60
vdb1         0.00     0.00 38.00  4.00   152.00    16.00     8.00     1.07   25.50  23.71  99.60
vdc          0.00 84388.00  0.00 4751.00     0.00 88996.00    37.46    95.23   19.87   0.17  82.60
vdc1         0.00 84388.00  0.00 4751.00     0.00 88996.00    37.46    95.23   19.87   0.17  82.60

- dd is now in D state: 
- mkfs.resumed
---
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                         
22439 root      16   0 11132 4056 2440 R 24.3  0.0   0:56.72 sshd                                                                            
 7400 root      16   0  1992  644  552 R 20.6  0.0   0:20.47 strace                                                                          
 7401 root      15   0 33568  29m  840 T 15.3  0.4   0:14.77 mkfs.ext3                                                                       
  409 root      10  -5     0    0    0 S  2.0  0.0   1:06.35 kswapd0                                                                         
13528 root      18   0  3968  564  484 D  2.0  0.0   0:01.75 dd                                                                              

- # sync (dd awakes, mkfs enters D state)

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                         
13528 root      18   0  3968  592  500 R 76.2  0.0   2:25.23 dd                                                                              
 7376 root      10  -5     0    0    0 D  7.3  0.0   0:32.70 kjournald                                                                       
  409 root      10  -5     0    0    0 S  2.3  0.0   1:09.13 kswapd0                                                                         
  408 root      15   0     0    0    0 D  0.7  0.0   0:20.19 pdflush                                                                         
14327 apl       18   0  181m  29m  11m S  0.7  0.4  43:55.86 cam_mretr                                                                       
  407 root      15   0     0    0    0 D  0.3  0.0   0:45.81 pdflush                                                                         
 7401 root      16   0 33568  29m  840 D  0.3  0.4   0:15.50 mkfs.ext3                                                                        

- dd now buffering
---
Time: 02:22:06 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00    3.60   17.43    0.00   78.47

Device:     rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
vdb         0.00 21057.00  1.00 4541.00     4.00 102156.00    44.98   143.56   31.83   0.22 100.20
vdb1        0.00 21057.00  1.00 4541.00     4.00 102156.00    44.98   143.56   31.83   0.22 100.20
vdc         0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
vdc1        0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
Comment 2 Jon 2012-10-16 11:36:19 EDT
Default: 
  # cat /proc/sys/vm/lowmem_reserve_ratio 
  256	256	32

Altered: 
vm.lowmem_reserve_ratio = 256 256 9
vm.dirty_expire_centisecs=500
vm.dirty_writeback_centisecs=100

Result: 
No Change.
Comment 3 Jon 2012-10-16 11:37:41 EDT
crash> bt 0xf7e5f000
PID: 14553  TASK: f7e5f000  CPU: 18  COMMAND: "mkfs.ext3"
 #0 [f1858cb0] schedule at c0622a9a
 #1 [f1858d28] schedule_timeout at c06231de
 #2 [f1858d4c] io_schedule_timeout at c0622caf

 #3 [f1858d5c] blk_congestion_wait at c04e557f
 #4 [f1858d80] balance_dirty_pages_ratelimited_nr at c045d956

 #5 [f1858dd0] generic_file_buffered_write at c0459cb8
 #6 [f1858e48] __generic_file_aio_write_nolock at c045a2b2
 #7 [f1858ec8] generic_file_aio_write_nolock at c045a5df
 #8 [f1858ef0] generic_file_write_nolock at c045a976
 #9 [f1858f74] blkdev_file_write at c047d5aa
#10 [f1858f84] vfs_write at c0476f65
#11 [f1858f9c] sys_write at c047758c
#12 [f1858fb8] system_call at c0404f44
    EAX: ffffffda  EBX: 00000003  ECX: 09c61898  EDX: 00008000 
    DS:  007b      ESI: 00008000  ES:  007b      EDI: 00000048
    SS:  007b      ESP: bf944f54  EBP: bf944f98
    CS:  0073      EIP: 00666402  ERR: 00000004  EFLAGS: 00000246

crash> set 14553
    PID: 14553
COMMAND: "mkfs.ext3"
   TASK: f7e5f000  [THREAD_INFO: f1858000]
    CPU: 18
  STATE: TASK_UNINTERRUPTIBLE 

----8<----
#10 [f1858f84] vfs_write at c0476f65
    [RA: c0477591  SP: f1858f88  FP: f1858f9c  SIZE: 24]
    f1858f88: f1858fa4  e9afc540  fffffff7  00000048  
    f1858f98: f1858000  c0477591 
---->8----

crash> p ((struct file*)0xe9afc540)->f_dentry->d_name.name
$3 = (const unsigned char *) 0xf7a7f534 "vdc2"

crash> p ((struct file*)0xe9afc540)->f_vfsmnt->mnt_devname
$6 = 0xf7a25640 "/dev"
Comment 4 Jon 2012-10-16 14:51:06 EDT
Issue is not repeatable using 64-bit guest os.
Comment 5 Dave Wysochanski 2013-05-18 20:56:06 EDT
Created attachment 749897 [details]
reproducer script, systemtap script, and logs showing mkfs stuck for over 3 minutes even after dd exits, also showing get_dirty_limits() returns different *pdirty and *pbackground values
Comment 6 Dave Wysochanski 2013-05-18 21:39:42 EDT
I think there may be something really weird going on in 32-bit PAE systems.  And I think it may have to do with get_dirty_limits.

I just ran the reproducer while running a systemtap script, and even after the 'dd' exited, mkfs was hung in that loop for at least 3 minutes.  This is the first time I've observed this type of behavior.  This makes no sense to me.

The question is why doesn't the normal pdflush and background writeout get the system back to a normal state where mkfs can continue, even after 'dd' has exited?  It's looking like for some reason mkfs is stuck inside balance_dirty_pages() because of invalid numbers coming back from get_dirty_limits() (see below).  In essence mkfs is waiting for dirty_thresh to go much lower than it should have to, and since it did not dirty any of the pages, and calls into writeback_inodes() with a non-NULL bdi, it just spins there waiting.  As a result, this condition doesn't become true without some explicit 'sync':
		if (nr_reclaimable + global_page_state(NR_WRITEBACK) <=
			dirty_thresh)
				break;

See files testrun-capturing-mkfs-stuck-for-over-3-minutes.txt and stap-capturing-mkfs-stuck-for-over-3-minutes.txt in https://bugzilla.redhat.com/attachment.cgi?id=749897

The main question seems to now be, why are we getting a different *pdirty and *pbackground number from get_dirty_limits() when mkfs calls it vs dd or pdflush?   On the 32-bit PAE system, mkfs calls get_dirty_limits() and gets back numbers that are 1/4 what they are when 'dd' or 'pdflush' calls this same function.  This same discrepancy is not seen on a 64-bit machine running the same reproducer and stap - the pdirty and pbackground are the same for dd, pdflush, and mkfs.  I can't spot in the code what would be causing the difference.  I'll have to look more later.

reproducer testbed:

[Sun May 19 00:05:24 2013] writeback_inodes.call: pid = 23840, execname = pdflush, bdi = 0x0, sync_mode = 0, older_than_this = 0x0, nr_to_write = 1024^M
[Sun May 19 00:05:24 2013] writeback_inodes.return: pid = 23840, execname = pdflush, bdi = 0x0, sync_mode = 0, older_than_this = 0x0, nr_to_write = 1024^M
[Sun May 19 00:05:24 2013] get_dirty_limits.return: pid = 23840, execname = pdflush, *pbackground = 86507, *pdirty = 360448^M
[Sun May 19 00:05:24 2013] get_dirty_limits.return: pid = 24662, execname = dd, *pbackground = 86507, *pdirty = 360448^M
[Sun May 19 00:05:24 2013] get_dirty_limits.return: pid = 24662, execname = dd, *pbackground = 86507, *pdirty = 360448^M
...
[Sun May 19 00:05:29 2013] writeback_inodes.call: pid = 24665, execname = pdflush, bdi = 0x0, sync_mode = 0, older_than_this = 0x0, nr_to_write = 1024^M
[Sun May 19 00:05:29 2013] writeback_inodes.return: pid = 24665, execname = pdflush, bdi = 0x0, sync_mode = 0, older_than_this = 0x0, nr_to_write = 1024^M
[Sun May 19 00:05:29 2013] writeback_inodes.call: pid = 24165, execname = pdflush, bdi = 0x0, sync_mode = 0, older_than_this = 0xf60eaf94, nr_to_write = 1024^M
[Sun May 19 00:05:29 2013] writeback_inodes.return: pid = 24165, execname = pdflush, bdi = 0x0, sync_mode = 0, older_than_this = 0xf60eaf94, nr_to_write = 1024^M
[Sun May 19 00:05:29 2013] get_dirty_limits.return: pid = 24667, execname = mkfs.ext3, *pbackground = 21627, *pdirty = 90112^M
[Sun May 19 00:05:29 2013] writeback_inodes.call: pid = 24667, execname = mkfs.ext3, bdi = 0xcbd78cac, sync_mode = 0, older_than_this = 0x0, nr_to_write = 1536^M
[Sun May 19 00:05:29 2013] writeback_inodes.return: pid = 24667, execname = mkfs.ext3, bdi = 0xcbd78cac, sync_mode = 0, older_than_this = 0x0, nr_to_write = 1536^M
[Sun May 19 00:05:29 2013] get_dirty_limits.return: pid = 24667, execname = mkfs.ext3, *pbackground = 21627, *pdirty = 90112^M
[Sun May 19 00:05:29 2013] get_dirty_limits.return: pid = 24662, execname = dd, *pbackground = 86507, *pdirty = 360448^M
[Sun May 19 00:05:29 2013] get_dirty_limits.return: pid = 23840, execname = pdflush, *pbackground = 86507, *pdirty = 360448^M


my 64-bit testbed: *pbackground and *pdirty always the same for pdflush, dd, and mkfs.ext3

[Sun May 19 00:47:47 2013] writeback_inodes.return: pid = 10286, execname = pdflush, bdi = 0x0, sync_mode = 0, older_than_this = 0x0, nr_to_write = 1024^M
[Sun May 19 00:47:47 2013] writeback_inodes.return: pid = 10284, execname = pdflush, bdi = 0x0, sync_mode = 0, older_than_this = 0x0, nr_to_write = 1024^M
[Sun May 19 00:47:47 2013] get_dirty_limits.return: pid = 10280, execname = dd, *pbackground = 49788, *pdirty = 99576^M
[Sun May 19 00:47:47 2013] get_dirty_limits.return: pid = 10280, execname = dd, *pbackground = 49788, *pdirty = 99576^M
[Sun May 19 00:47:47 2013] writeback_inodes.call: pid = 10280, execname = dd, bdi = 0xffff8101539c2e18, sync_mode = 0, older_than_this = 0x0, nr_to_write = 1536^M
[Sun May 19 00:47:47 2013] get_dirty_limits.return: pid = 10291, execname = mkfs.ext3, *pbackground = 49788, *pdirty = 99576^M
[Sun May 19 00:47:47 2013] writeback_inodes.call: pid = 10291, execname = mkfs.ext3, bdi = 0xffff8101519e9aa8, sync_mode = 0, older_than_this = 0x0, nr_to_write = 1536^M
[Sun May 19 00:47:47 2013] writeback_inodes.return: pid = 10291, execname = mkfs.ext3, bdi = 0xffff8101519e9aa8, sync_mode = 0, older_than_this = 0x0, nr_to_write = 1536^M
[Sun May 19 00:47:47 2013] get_dirty_limits.return: pid = 10291, execname = mkfs.ext3, *pbackground = 49788, *pdirty = 99576^M
[Sun May 19 00:47:47 2013] get_dirty_limits.return: pid = 9471, execname = pdflush, *pbackground = 49788, *pdirty = 99576^M
[Sun May 19 00:47:47 2013] writeback_inodes.call: pid = 9471, execname = pdflush, bdi = 0x0, sync_mode = 0, older_than_this = 0x0, nr_to_write = 1024^M
[Sun May 19 00:47:47 2013] writeback_inodes.return: pid = 9471, execname = pdflush, bdi = 0x0, sync_mode = 0, older_than_this = 0x0, nr_to_write = 1024^M
[Sun May 19 00:47:47 2013] get_dirty_limits.return: pid = 10288, execname = pdflush, *pbackground = 49788, *pdirty = 99576^M
[Sun May 19 00:47:47 2013] get_dirty_limits.return: pid = 10283, execname = pdflush, *pbackground = 49788, *pdirty = 99576^M
Comment 7 Dave Wysochanski 2013-05-18 21:48:20 EDT
For the reproducer at least, it's looking like the difference is the CONFIG_HIGHEM clause at the top of get_dirty_limits, which reduces available_memory, and as a result, we get different values for *pdirty and *pbackground (It's *pdirty that we really care about, since that affects exit from the loop).  So it seems this bug is that a process which is restricted from using HIGHMEM may get blocked indefinitely inside balance_dirty_pages().

This does not happen for 'dd'.

[Sun May 19 01:43:17 2013] get_dirty_limits.return: pid = 27351, execname = pdflush, *pbackground = 86507, *pdirty = 360448, mapping = 0x0^M
[Sun May 19 01:43:17 2013] get_dirty_limits.call: pid = 27678, execname = dd, *pbackground = 0, *pdirty = 481603584, mapping->flags = 0x200d2^M   <-------------- __GFP_HIGHMEM set for 'dd'
[Sun May 19 01:43:17 2013] get_dirty_limits.return: pid = 27678, execname = dd, *pbackground = 86507, *pdirty = 360448, mapping = 0xddd6dd04^M
....
[Sun May 19 01:45:59 2013] get_dirty_limits.return: pid = 27685, execname = mkfs.ext3, *pbackground = 21627, *pdirty = 90112, mapping = 0xcbd28a70^M
[Sun May 19 01:45:59 2013] get_dirty_limits.call: pid = 27685, execname = mkfs.ext3, *pbackground = 21627, *pdirty = 90112, mapping->flags = 0x200d0^M  <-------- __GFP_HIGHMEM not set for 'mkfs'
[Sun May 19 01:45:59 2013] get_dirty_limits.return: pid = 27685, execname = mkfs.ext3, *pbackground = 21627, *pdirty = 90112, mapping = 0xcbd28a70^M
[Sun May 19 01:45:59 2013] writeback_inodes.call: pid = 27685, execname = mkfs.ext3, bdi = 0xcbd78cac, sync_mode = 0, older_than_this = 0x0, nr_to_write = 1536^M


static void
get_dirty_limits(long *pbackground, long *pdirty,
					struct address_space *mapping)
{
	int background_ratio;		/* Percentages */
	int dirty_ratio;
	int unmapped_ratio;
	long background;
	long dirty;
	unsigned long available_memory = total_pages;
	struct task_struct *tsk;

#ifdef CONFIG_HIGHMEM
	/*
	 * If this mapping can only allocate from low memory,
	 * we exclude high memory from our count.
	 */
	if (mapping && !(mapping_gfp_mask(mapping) & __GFP_HIGHMEM))
		available_memory -= totalhigh_pages;
#endif


	unmapped_ratio = 100 - ((global_page_state(NR_FILE_MAPPED) +
				global_page_state(NR_ANON_PAGES)) * 100) /
					total_pages;

	if (vm_dirty_bytes)
		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
	else {
		dirty_ratio = vm_dirty_ratio;

		/* if vm_dirty_ratio is 100 dont limit to 1/2 unmapped_ratio */
		if ((dirty_ratio > unmapped_ratio / 2) && (dirty_ratio != 100))
			dirty_ratio = unmapped_ratio / 2;

		if (dirty_ratio < 5)
			dirty_ratio = 5;

		dirty = (dirty_ratio * available_memory) / 100;
	}

	if (dirty_background_bytes)
		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
	else {
		background_ratio = dirty_background_ratio;
		if (background_ratio >= dirty_ratio)
			background_ratio = dirty_ratio / 2;

		background = (background_ratio * available_memory) / 100;
	}
	tsk = current;
	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
		background += background / 4;
		dirty += dirty / 4;
	}
	*pbackground = background;
	*pdirty = dirty;
}

include/linux/gfp.h
#define __GFP_HIGHMEM   ((__force gfp_t)0x02u)

static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
{
        return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
}


crash> p totalhigh_pages
totalhigh_pages = $2 = 1081341
crash> p total_pages
total_pages = $3 = 1441792
crash> pd (1441792 - 1081341)
$4 = 360451
crash> p vm_dirty_ratio
vm_dirty_ratio = $5 = 25
crash> p 360451 / 4
$6 = 90112
crash> p 1441792 / 4
$7 = 360448
Comment 9 Lachlan McIlroy 2013-05-19 03:16:19 EDT
Great work Dave - it would certainly explain the problem.

This patch looks like it addresses this very issue but it unconditionally uses the smaller dirty value so it will cause the throttle to kick in much earlier than expected (although consistently earlier).

commit dc6e29da9162fa8fa2a9e798569c0f6e87975614
Author: Linus Torvalds <torvalds@woody.linux-foundation.org>
Date:   Mon Jan 29 16:37:38 2007 -0800

    Fix balance_dirty_page() calculations with CONFIG_HIGHMEM
    
    This makes balance_dirty_page() always base its calculations on the
    amount of non-highmem memory in the machine, rather than try to base it
    on total memory and then falling back on non-highmem memory if the
    mapping it was writing wasn't highmem capable.
    
    This not only fixes a situation where two different writers can have
    wildly different notions about what is a "balanced" dirty state, but it
    also means that people with highmem machines don't run into an OOM
    situation when regular memory fills up with dirty pages.
    
    We used to try to handle the latter case by scaling down the dirty_ratio
    if the machine had a lot of highmem pages in page_writeback_init(), but
    it wasn't aggressive enough for some situations, and since basing the
    dirty ratio on highmem memory was broken in the first place, let's just
    stop doing so.
    
    (A variation of this theme fixed Justin Piszcz's OOM problem when
    copying an 18GB file on a RAID setup).
    
    Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
    Cc: Justin Piszcz <jpiszcz@lucidpixels.com>
    Cc: Andrew Morton <akpm@osdl.org>
    Cc: Neil Brown <neilb@suse.de>
    Cc: Ingo Molnar <mingo@elte.hu>
    Cc: Randy Dunlap <rdunlap@xenotime.net>
    Cc: Christoph Lameter <clameter@sgi.com>
    Cc: Jens Axboe <jens.axboe@oracle.com>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Adrian Bunk <bunk@stusta.de>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 1d2fc89..be0efbd 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -133,11 +133,9 @@ get_dirty_limits(long *pbackground, long *pdirty,
 
 #ifdef CONFIG_HIGHMEM
        /*
-        * If this mapping can only allocate from low memory,
-        * we exclude high memory from our count.
+        * We always exclude high memory from our count.
         */
-       if (mapping && !(mapping_gfp_mask(mapping) & __GFP_HIGHMEM))
-               available_memory -= totalhigh_pages;
+       available_memory -= totalhigh_pages;
 #endif
 
 
@@ -526,28 +524,25 @@ static struct notifier_block __cpuinitdata ratelimit_nb = {
 };
 
 /*
- * If the machine has a large highmem:lowmem ratio then scale back the default
- * dirty memory thresholds: allowing too much dirty highmem pins an excessive
- * number of buffer_heads.
+ * Called early on to tune the page writeback dirty limits.
+ *
+ * We used to scale dirty pages according to how total memory
+ * related to pages that could be allocated for buffers (by
+ * comparing nr_free_buffer_pages() to vm_total_pages.
+ *
+ * However, that was when we used "dirty_ratio" to scale with
+ * all memory, and we don't do that any more. "dirty_ratio"
+ * is now applied to total non-HIGHPAGE memory (by subtracting
+ * totalhigh_pages from vm_total_pages), and as such we can't
+ * get into the old insane situation any more where we had
+ * large amounts of dirty pages compared to a small amount of
+ * non-HIGHMEM memory.
+ *
+ * But we might still want to scale the dirty_ratio by how
+ * much memory the box has..
  */
 void __init page_writeback_init(void)
 {
-       long buffer_pages = nr_free_buffer_pages();
-       long correction;
-
-       correction = (100 * 4 * buffer_pages) / vm_total_pages;
-
-       if (correction < 100) {
-               dirty_background_ratio *= correction;
-               dirty_background_ratio /= 100;
-               vm_dirty_ratio *= correction;
-               vm_dirty_ratio /= 100;
-
-               if (dirty_background_ratio <= 0)
-                       dirty_background_ratio = 1;
-               if (vm_dirty_ratio <= 0)
-                       vm_dirty_ratio = 1;
-       }
        mod_timer(&wb_timer, jiffies + dirty_writeback_interval);
        writeback_set_ratelimit();
        register_cpu_notifier(&ratelimit_nb);
Comment 12 RHEL Product and Program Management 2013-07-24 00:12:29 EDT
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Note You need to log in before you can comment on or make changes to this bug.