Bug 1401012

Summary:	OOM but no swap used
Product:	[Fedora] Fedora	Reporter:	Peter Backes <rtc>
Component:	kernel	Assignee:	Kernel Maintainer List <kernel-maint>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	25	CC:	andreas.kriegl, bugs, buhrt, cz172638, dtucker, frank, gansalmon, ichavero, itamar, jforbes, jonathan, kernel-maint, madhu.chinakonda, mchehab, paolini, pbrobinson, stepglenn, trevor
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-04-19 18:14:57 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Peter Backes 2016-12-02 14:54:32 UTC

Description of problem:
Once a day, I run rsync on my system for backup. Then the OOM killer starts killing processes despite the swap space being completely unused. This happens since I upgraded to f25 (via dnf upgrade), though, before I rebooted after the upgrade, I was running kernel 4.5.3, so it might be the case that the problem would have occured in f24 if I had rebooted with a more recent kernel.

[88629.433590] rsync invoked oom-killer: gfp_mask=0x2420848(GFP_NOFS|__GFP_NOFAIL|__GFP_HARDWALL|__GFP_MOVABLE), order=0, oom_score_adj=0
[88629.433593] rsync cpuset=/ mems_allowed=0
[88629.433600] CPU: 0 PID: 12554 Comm: rsync Not tainted 4.8.10-300.fc25.i686+PAE #1
[88629.433601] Hardware name: .   .  /IP35V(Intel P35+ICH7), BIOS 6.00 PG 03/31/2008
[88629.433603]  d1e3a967 9174f5d6 00000286 edaf7a90 d175ccd0 edaf7bb0 eef6b180 edaf7ad8
[88629.433607]  d15de996 d1d54748 ece32574 02420848 edaf7bbc 00000000 00000000 00000206
[88629.433610]  edaf7ad8 d176296f d1e4c3a0 e807c680 f2fe7440 d1b6b5f4 eef6b180 00000041
[88629.433613] Call Trace:
[88629.433623]  [<d175ccd0>] dump_stack+0x58/0x78
[88629.433626]  [<d15de996>] dump_header+0x4a/0x18f
[88629.433628]  [<d176296f>] ? ___ratelimit+0x9f/0x100
[88629.433632]  [<d1577c0a>] oom_kill_process+0x1da/0x3a0
[88629.433634]  [<d147608a>] ? has_capability_noaudit+0x1a/0x30
[88629.433635]  [<d15773a9>] ? oom_badness.part.12+0xc9/0x140
[88629.433637]  [<d157800e>] out_of_memory+0x1de/0x220
[88629.433639]  [<d157ca21>] __alloc_pages_nodemask+0xb61/0xc20
[88629.433641]  [<d1573e52>] pagecache_get_page+0xc2/0x2d0
[88629.433644]  [<d1616a1c>] __getblk_slow+0xec/0x330
[88629.433646]  [<d1616cb9>] __getblk_gfp+0x59/0x70
[88629.433647]  [<d1617ccb>] __breadahead+0x2b/0x70
[88629.433649]  [<d1664dbe>] __ext4_get_inode_loc+0x3fe/0x470
[88629.433651]  [<d15fbd66>] ? inode_init_always+0x106/0x1a0
[88629.433653]  [<d1667d1b>] ext4_iget+0x7b/0xae0
[88629.433654]  [<d15fa8e3>] ? __d_alloc+0x23/0x180
[88629.433656]  [<d16687af>] ext4_iget_normal+0x2f/0x40
[88629.433657]  [<d1673785>] ext4_lookup+0xb5/0x230
[88629.433659]  [<d15ec2e7>] ? legitimize_path.isra.33+0x27/0x60
[88629.433661]  [<d15ec48e>] ? unlazy_walk+0x16e/0x1a0
[88629.433662]  [<d15ed24c>] lookup_slow+0x7c/0x130
[88629.433664]  [<d15ede24>] walk_component+0x1e4/0x300
[88629.433665]  [<d15ece5e>] ? path_init+0x19e/0x330
[88629.433666]  [<d15eb78f>] ? terminate_walk+0x8f/0x100
[88629.433667]  [<d15ef04b>] path_lookupat+0x5b/0x110
[88629.433669]  [<d15f12e9>] filename_lookup+0x99/0x190
[88629.433672]  [<d15cc105>] ? kmem_cache_alloc+0x155/0x1c0
[88629.433673]  [<d15f0f2a>] ? getname_flags+0x3a/0x1a0
[88629.433675]  [<d15f14b6>] user_path_at_empty+0x36/0x40
[88629.433676]  [<d15e6b80>] vfs_fstatat+0x60/0xb0
[88629.433678]  [<d15e75dd>] SyS_lstat64+0x2d/0x50
[88629.433679]  [<d15f9287>] ? dput+0xc7/0x210
[88629.433680]  [<d1601850>] ? mntput+0x20/0x40
[88629.433682]  [<d15f2bd8>] ? SyS_link+0x148/0x1c0
[88629.433684]  [<d14037ad>] do_fast_syscall_32+0x8d/0x140
[88629.433688]  [<d1b57c72>] sysenter_past_esp+0x47/0x75
[88629.433689] Mem-Info:
[88629.433693] active_anon:118948 inactive_anon:66658 isolated_anon:0
                active_file:92614 inactive_file:227471 isolated_file:0
                unevictable:8 dirty:0 writeback:0 unstable:0
                slab_reclaimable:186083 slab_unreclaimable:17532
                mapped:61209 shmem:7634 pagetables:2022 bounce:0
                free:306763 free_pcp:245 free_cma:0
[88629.433696] Node 0 active_anon:475792kB inactive_anon:266632kB active_file:370456kB inactive_file:909884kB unevictable:32kB isolated(anon):0kB isolated(file):0kB mapped:244836kB dirty:0kB writeback:0kB shmem:30536kB writeback_tmp:0kB unstable:0kB pages_scanned:6033348 all_unreclaimable? yes
[88629.433700] DMA free:3252kB min:68kB low:84kB high:100kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:12096kB slab_unreclaimable:568kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[88629.433700] lowmem_reserve[]: 0 798 4005 4005
[88629.433705] Normal free:3500kB min:3576kB low:4468kB high:5360kB active_anon:0kB inactive_anon:0kB active_file:2500kB inactive_file:212kB unevictable:0kB writepending:0kB present:892920kB managed:830252kB mlocked:0kB slab_reclaimable:732236kB slab_unreclaimable:69560kB kernel_stack:2552kB pagetables:0kB bounce:0kB free_pcp:980kB local_pcp:244kB free_cma:0kB
[88629.433705] lowmem_reserve[]: 0 0 25655 25655
[88629.433710] HighMem free:1220300kB min:512kB low:4104kB high:7696kB active_anon:475792kB inactive_anon:266632kB active_file:367956kB inactive_file:909684kB unevictable:32kB writepending:0kB present:3283848kB managed:3283848kB mlocked:32kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:8088kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[88629.433710] lowmem_reserve[]: 0 0 0 0
[88629.433712] DMA: 21*4kB (U) 30*8kB (UE) 5*16kB (UE) 5*32kB (UE) 4*64kB (UE) 1*128kB (U) 1*256kB (E) 0*512kB 0*1024kB 1*2048kB (M) 0*4096kB = 3252kB
[88629.433721] Normal: 441*4kB (UMEH) 153*8kB (UMH) 8*16kB (MH) 2*32kB (H) 1*64kB (H) 0*128kB 1*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3500kB
[88629.433729] HighMem: 461*4kB (UM) 245*8kB (UM) 241*16kB (UM) 251*32kB (UM) 1640*64kB (UM) 1787*128kB (UM) 270*256kB (UM) 50*512kB (UM) 28*1024kB (UM) 13*2048kB (UM) 176*4096kB (UM) = 1220300kB
[88629.433739] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[88629.433740] 327747 total pagecache pages
[88629.433741] 0 pages in swap cache
[88629.433742] Swap cache stats: add 0, delete 0, find 0/0
[88629.433743] Free swap  = 4192960kB
[88629.433743] Total swap = 4192960kB
[88629.433744] 1048190 pages RAM
[88629.433744] 820962 pages HighMem/MovableOnly
[88629.433745] 15686 pages reserved
[88629.433745] 0 pages hwpoisoned
[88629.433746] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[88629.433754] [  298]     0   298    14155     2226      15       3        0             0 systemd-journal
[88629.433757] [  340]     0   340     4144     1713      10       3        0         -1000 systemd-udevd
[88629.433760] [  476]     0   476     3486      772       6       3        0         -1000 auditd
[88629.433762] [  486]     0   486     2979      437       6       3        0             0 audispd
[88629.433764] [  489]     0   489     1509      494       6       3        0             0 sedispatch
[88629.433766] [  505]    81   505     3788     1158       7       3        0          -900 dbus-daemon
[88629.433769] [  506]     0   506      702       17       5       3        0             0 gpm
[88629.433771] [  510]     0   510    18792     3097      19       3        0             0 NetworkManager
[88629.433772] [  511]     0   511      701      459       5       3        0             0 mcelog
[88629.433774] [  512]     0   512      999      312       5       3        0             0 alsactl
[88629.433777] [  514]     0   514     2458     1851      10       3        0             0 systemd-logind
[88629.433778] [  538]     0   538     1140      351       6       3        0             0 irqbalance
[88629.433780] [  553]    38   553     1804     1225       7       3        0             0 ntpd
[88629.433782] [  607]     0   607     2988     1831       8       3        0             0 wpa_supplicant
[88629.433784] [  608]   485   608    19007     3457      21       3        0             0 polkitd
[88629.433786] [  826]     0   826     4070     2228      11       3        0             0 cupsd
[88629.433788] [  835]     0   835      884      541       6       3        0             0 atd
[88629.433789] [  836]     0   836     1692      782       8       3        0             0 crond
[88629.433791] [  845]     0   845    11223     1564      15       3        0             0 lightdm
[88629.433792] [  846]     0   846     2800      652       9       3        0         -1000 sshd
[88629.433794] [  858]     0   858    72032    23349      84       3        0             0 Xorg
[88629.433796] [  862]     0   862     9992     1656      15       3        0             0 accounts-daemon
[88629.433797] [  878]   488   878    10757     2385      14       3        0             0 colord
[88629.433799] [  932]     0   932     7440     1823      13       3        0             0 lightdm
[88629.433800] [  958]  1001   958     2419     1660       8       3        0             0 systemd
[88629.433802] [  960]  1001   960     3294      512      10       3        0             0 (sd-pam)
[88629.433803] [  964]  1001   964     1498      703       5       3        0             0 sh
[88629.433805] [  975]  1001   975     3765     1126       7       3        0             0 dbus-daemon
[88629.433807] [  981]  1001   981     2022      103       8       3        0             0 ssh-agent
[88629.433808] [ 1012]  1001  1012    11583     3479      20       3        0             0 xfce4-session
[88629.433809] [ 1016]  1001  1016     2533     1276      10       3        0             0 xfconfd
[88629.433811] [ 1019]  1001  1019     6146      189       8       3        0             0 gpg-agent
[88629.433812] [ 1021]  1001  1021     7970     4960      19       3        0             0 xfwm4
[88629.433814] [ 1025]  1001  1025    13960     6484      24       3        0             0 xfce4-panel
[88629.433815] [ 1027]  1001  1027    11828     3686      20       3        0             0 Thunar
[88629.433817] [ 1029]  1001  1029    53087    10684      42       3        0             0 xfdesktop
[88629.433818] [ 1030]  1001  1030    12424     3963      20       3        0             0 xfsettingsd
[88629.433820] [ 1031]  1001  1031    12022     3767      22       3        0             0 seapplet
[88629.433821] [ 1032]  1001  1032    16827     8913      30       3        0             0 redshift-gtk
[88629.433823] [ 1035]  1001  1035    25688     6858      34       3        0             0 nm-applet
[88629.433825] [ 1036]  1001  1036    11238     1378      13       3        0             0 at-spi-bus-laun
[88629.433826] [ 1043]  1001  1043     3645      913       8       3        0             0 dbus-daemon
[88629.433827] [ 1048]  1001  1048     7552     1601      12       3        0             0 at-spi2-registr
[88629.433829] [ 1051]  1001  1051    11270     3179      20       3        0             0 xfce-polkit
[88629.433830] [ 1054]  1001  1054    10022     1536      15       3        0             0 gvfsd
[88629.433832] [ 1059]  1001  1059    13074     1290      14       3        0             0 gvfsd-fuse
[88629.433833] [ 1068]  1001  1068    14040     6042      25       3        0             0 xfce4-power-man
[88629.433835] [ 1074]  1001  1074     2059     1142       8       3        0             0 xscreensaver
[88629.433837] [ 1120]  1001  1120     2359     1025       8       3        0             0 redshift
[88629.433838] [ 1127]     0  1127    13160     2023      18       3        0             0 upowerd
[88629.433840] [ 1128]  1001  1128    11805     4112      22       3        0             0 panel-8-actions
[88629.433841] [ 1129]  1001  1129    11805     3998      21       3        0             0 panel-17-action
[88629.433843] [ 1137]  1001  1137    11106     2173      17       3        0             0 gvfs-udisks2-vo
[88629.433844] [ 1147]  1001  1147    11699     4107      21       3        0             0 panel-12-thunar
[88629.433845] [ 1149]  1001  1149     6955     3502      18       3        0             0 panel-15-systra
[88629.433847] [ 1177]  1001  1177    12154     1740      16       3        0             0 gvfsd-trash
[88629.433848] [ 1192]     0  1192    14261     2140      16       3        0             0 udisksd
[88629.433850] [ 1204]  1001  1204    42147     2896      47       3        0             0 pulseaudio
[88629.433851] [ 1212]  1001  1212    13023     1754      16       3        0             0 gvfs-afc-volume
[88629.433853] [ 1220]  1001  1220     7242     1242      11       3        0             0 gvfsd-metadata
[88629.433854] [ 1327]  1001  1327    23003     9207      37       3        0             0 xfce4-terminal
[88629.433856] [ 1335]  1001  1335     1957     1113       8       3        0             0 zsh
[88629.433857] [ 1360]  1001  1360     1957     1129       7       3        0             0 zsh
[88629.433859] [ 1384]  1001  1384     1949     1136       7       3        0             0 zsh
[88629.433860] [ 1401]  1001  1401     1486      680       5       3        0             0 screen
[88629.433862] [ 1402]  1001  1402     1618      770       6       3        0             0 screen
[88629.433863] [ 1403]  1001  1403     4176     2067      11       3        0             0 micq
[88629.433865] [ 1407]  1001  1407    18104    10380      33       3        0             0 gvim
[88629.433866] [ 1410]  1001  1410    16108     6178      28       3        0             0 orage
[88629.433867] [ 1411]  1001  1411    12482     5291      24       3        0             0 globaltime
[88629.433869] [ 1454]  1001  1454     3550     1743      12       3        0             0 gconfd-2
[88629.433870] [ 1515]  1001  1515    74243    20616      93       3        0             0 liferea
[88629.433872] [ 1520]  1001  1520    64697    20788      85       3        0             0 rhythmbox
[88629.433873] [ 1528]  1001  1528     6532     1257       9       3        0             0 dconf-service
[88629.433875] [ 1567]  1001  1567     1950     1163       7       3        0             0 zsh
[88629.433876] [ 1585]  1001  1585     1950     1132       7       3        0             0 zsh
[88629.433877] [ 1631]  1001  1631    18313     4590      24       3        0             0 gvfsd-http
[88629.433879] [ 3406]  1001  3406   316232   136506     451       3        0             0 firefox
[88629.433880] [ 9864]     0  9864     5034     3548      13       3        0             0 dhclient
[88629.433882] [ 9941]     0  9941     3656      981      10       3        0             0 sendmail
[88629.433883] [ 9963]    51  9963     3526      750      10       3        0             0 sendmail
[88629.433885] [11441]     0 11441     4144     1965      12       3        0             0 sshd
[88629.433886] [11446]  1001 11446     4191     1122      11       3        0             0 sshd
[88629.433888] [11447]  1001 11447     1878     1107       8       3        0             0 zsh
[88629.433889] [11472]  1001 11472     1410      681       6       3        0             0 screen
[88629.433891] [11473]  1001 11473     3279     1760      10       3        0             0 ssh
[88629.433892] [12534]     0 12534     1692      642       8       3        0             0 crond
[88629.433894] [12535]     0 12535     1498      677       6       3        0             0 sh
[88629.433895] [12539]     0 12539     1498      729       6       3        0             0 dorsync
[88629.433897] [12543]    51 12543     3526     1731      12       3        0             0 sendmail
[88629.433898] [12552]     0 12552    16496     2807      36       3        0             0 rsync
[88629.433900] [12553]     0 12553     1410      573       6       3        0             0 grep
[88629.433901] [12554]     0 12554     4264     2086      12       3        0             0 rsync
[88629.433903] [12555]     0 12555    21837     2450      46       3        0             0 rsync
[88629.433904] Out of memory: Kill process 3406 (firefox) score 65 or sacrifice child
[88629.433983] Killed process 3406 (firefox) total-vm:1264928kB, anon-rss:427892kB, file-rss:102352kB, shmem-rss:15780kB

Version-Release number of selected component (if applicable):
kernel-PAE-4.8.10-300.fc25.i686

How reproducible:
fairly regularly

Steps to Reproduce:
1. run rsync

Actual results:
OOM killer starts killing processes

Expected results:
swap space is used first before it starts killing processes

Additional info:

Comment 1 Jeff Buhrt 2016-12-05 14:34:30 UTC

[a little more detail and links to the live threads]

I am fighting the same issue after upgrading a KVM guest to F25. It appears you have the same problem (many) others are having with still using a PAE kernel.

Seems to be recent work on a fix
https://bugzilla.redhat.com/show_bug.cgi?id=1373339

An older ticket from when the problem seems to have started
https://bugzilla.redhat.com/show_bug.cgi?id=1075185

For my testing I have munin logging the guest and I am now at only 3GB of 10GB of (guest assigned) memory, no swap used... and rsync, Tomcat, and anything else that moves now gets nailed. 
I am testing:
echo 1 > /proc/sys/vm/overcommit_memory

All my 64bit hosts and guests are fine. Only the busy 32bit PAE guest is getting OOM killed.

Comment 2 Peter Backes 2016-12-13 11:55:03 UTC

This is still happening on kernel-PAE-4.9.0-1.fc26.i686

Comment 3 Josh Boyer 2016-12-14 01:55:16 UTC

32-bit x86 is a low priority for the Fedora kernel team and relies on greater community effort for support.  You are more likely to get feedback by reporting issues directly upstream.

Comment 4 Mark Hittinger 2016-12-14 19:45:02 UTC

I also see this when running rsync to a f24 32 bit pae or non-pae target.

a folder with a large tree always dies on the destination with the OOM.

early f24 32 kernels did not do it.

Comment 5 Trevor Cordes 2017-01-11 23:37:49 UTC

This may be the bug I am seeing (large dir scans cause oom-killer then hang/panic on 32-bit PAE).  I have git bisected my bug in the kernel to commit b2e18757f2c9d1cdd746a882e9878852fdec9501
Author: Mel Gorman <mgorman>
Date:   Thu Jul 28 15:45:37 2016 -0700
mm, vmscan: begin reclaiming pages on a per-node basis

I have posted about it to the LKML and those guys are helping out, some work has already been done on it and I am testing their patches now.  Keep an eye on the LKML thread I started:
https://lkml.org/lkml/2017/1/11/182

I'll report back if/when a patch fixes the problem and hopefully someone here in Fedoraland can apply the patch to a Fedora 24 errata kernel.

Comment 6 Darren Tucker 2017-01-11 23:49:37 UTC

Trevor: from your LKML post:
>  I tuned the RAM down because around 8GB the PAE kernel has massive IO speed issues

I suggest trying "echo 1 >/proc/sys/vm/highmem_is_dirtyable" as per
https://bugzilla.redhat.com/show_bug.cgi?id=1373339#c24 which made a huge difference for me.

Comment 7 Darren Tucker 2017-01-11 23:57:30 UTC

Trevor: also:
> [it's] rsync or rdiff-backup also doing big dir scans

In my case it was rsync.

> when I do "find /" manually I can't trigger the bug.

I think it's the file reads, not directory scans.  If that's the case, execing dd from find to read the files should trigger it:

find / -type f -exec dd of=/dev/null if='{}' \;

Comment 8 Trevor Cordes 2017-01-20 06:44:55 UTC

The kernel guys (Mel & Michal) have been working with me a lot on this issue, but so far no patch solves the problem.  We are continuing, and those interested can google the LKML thread referenced in comment #5.

However, we seem to have a workaround that everyone here can employ immiedately:

add to your kernel boot options:
cgroup_disable=memory

Usually you'd put this in the /etc/default/grub file in the GRUB_CMDLINE_LINUX section, and auto-regen your grub.cfg and reboot.

With that option my testing didn't crash over 3 nights, when 90% of the time it crashes in 1 night and 100% within 2 nights.

I'm hazy as to what the option actually does but I suppose it's relatively harmless, at least compared to the oom/panic alternative.

I'll post again when we have a real patch to solve this bug.

Comment 9 Mark Hittinger 2017-01-20 21:16:10 UTC

on a 4.8.16-200 PAE the cgroup_disable=memory didn't help - still get OOM kills when rsync tries to scan a very large folder (ext3, dir_index) for updating.

Comment 10 Trevor Cordes 2017-01-20 23:26:21 UTC

Hmm, did you confirm the cgroup_disable=memory actually took effect by checking /proc/cmdline?

Perhaps it also has to do with the amount of RAM.  If yours is very large, try constraining it to something like 6G with another boot option: mem=6G.  PAE gets worse the more RAM you have.  On my box with 16GB if I constrain it to 6G and do the cgroup_disable=memory then it didn't oom or crash for 3+ days.  Without cgroup_disable=memory it oom/crashes in 1 day 90% of the time, and always within 2 days.

I'll report back when I get some more solutions/ideas.  However, it definitely appears that if you want >4GB RAM you need to think about switching to a 64-bit OS.  I'm still trying to figure out a way to trick dnf into allowing an "upgrade" to 64-bit (kernel and userland).  Complete wipe/reinstalls would be a major pain.

Comment 11 Frank Crawford 2017-01-21 03:15:36 UTC

Certainly, the cgroup_disable=memory doesn't stop all OOM killing, and maybe it shouldn't.  Remember that the 4.7 kernel totally reworked the OOM management (for good or bad), and while this issue is one that affects only 32bit systems, there are lots of other valid reasons for OOM to occur.  I know that I'm also seeing 1 or 2 a week on 64bit systems, where previously I didn't see any.

With regard to "upgrading" to a 64-bit system, I've done it in the past (somewhere around F10 with yum) and you can do it, with lots of effort.  I know I needed to put together a few scripts to edit RPM lists.  Even now I still find a few funny RPMs still lurking around.

Comment 12 Maurizio Paolini 2017-01-28 18:50:59 UTC

We have apparently the same issue on many boxes with kernel-PAE after upgrading
to f25 (possibly the issue is related to kernels 4.8 or 4.9).
The problem is systematic on a computer that we use to do large backups (with rsync), however it also occasionally happens with large system upgrades using
dnf.

No problem ad all with kernels up to 4.7.

Comment 13 Maurizio Paolini 2017-01-28 21:01:23 UTC

I tried adding

cgroup_disable=memory

to the kernel boot options, but it did not help.  This is a snippet from
/var/log/messages:

[...]
Jan 28 21:29:32 solaris kernel: Linux version 4.9.5-200.fc25.i686+PAE (mockbuild.fedoraproject.org) (gcc version 6.3.1 20161221 (Red Hat 6.3.1-1) (GCC) ) #1 SMP Fri Jan 20 12:43:13 UTC 2017
[...]
Jan 28 21:29:32 solaris kernel: Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.9.5-200.fc25.i686+PAE root=UUID=a7916f45-3648-405b-b127-9ab6044d566b ro LANG=en_US.UTF-8 cgroup_disable=memory
[...]
Jan 28 21:33:40 solaris kernel: rsync invoked oom-killer: gfp_mask=0x25000c0(GFP_KERNEL_ACCOUNT), nodemask=0, order=0, oom_score_adj=0
Jan 28 21:33:40 solaris kernel: rsync cpuset=/ mems_allowed=0
Jan 28 21:33:40 solaris kernel: CPU: 0 PID: 1432 Comm: rsync Tainted: G        W       4.9.5-200.fc25.i686+PAE #1
Jan 28 21:33:40 solaris kernel: Hardware name: Dell Inc. PowerEdge R320/0KM5PX, BIOS 2.4.2 01/29/2015
Jan 28 21:33:40 solaris kernel: e8ddbc48 dcb6c6c0 e8ddbd74 f6b43600 e8ddbc78 dc9e7c10 00000206 e8ddbc78
Jan 28 21:33:40 solaris kernel: dcb7237f dd273940 e8ddbc7c eabdb180 dcf889b4 f6b43600 dd17cb55 e8ddbd74
Jan 28 21:33:40 solaris kernel: e8ddbcbc dc9810bd dc87717a e8ddbca8 dc980d49 00000003 00000000 000000b7
Jan 28 21:33:40 solaris kernel: Call Trace:
Jan 28 21:33:40 solaris kernel: [<dcb6c6c0>] dump_stack+0x58/0x78
Jan 28 21:33:40 solaris kernel: [<dc9e7c10>] dump_header+0x64/0x1a6
Jan 28 21:33:40 solaris kernel: [<dcb7237f>] ? ___ratelimit+0x9f/0x100
Jan 28 21:33:40 solaris kernel: [<dc9810bd>] oom_kill_process+0x1fd/0x3c0
Jan 28 21:33:40 solaris kernel: [<dc87717a>] ? has_capability_noaudit+0x1a/0x30
Jan 28 21:33:40 solaris kernel: [<dc980d49>] ? oom_badness.part.13+0xc9/0x140
Jan 28 21:33:40 solaris kernel: [<dc981585>] out_of_memory+0xf5/0x2b0
[...]

Comment 14 Trevor Cordes 2017-01-28 22:13:56 UTC

Maurizio, how much RAM do you have? It seems for cgroup_disable=memory to work you also need to limit your RAM. Also add to the boot command mem=6G (or really anything 2-6G, I have good success with 3G, and also 6G). Confirm with top or meminfo once you're booted. The guys who cgroup_disable=memory doesn't work for I'm betting have gobs of RAM, certainly >8G. All PAE problems (oom and I/O problems especially) get worse the more RAM you have. If someone here has it still oom/crash with the option on AND RAM limited, I'd be shocked.

Linus (and others) has already said you shouldn't run PAE with >4G. I run a ton of systems with PAE >4G just because going onsite to put heads on them to wipe/reinstall them to 64-bit will be a major undertaking. However, once I figure out a way to headless upgrade (if possible), I'm going to do it. It's obvious that PAE is the ugly stepchild that Linus hates, RH says they don't support (much) (see comment #3), and gets almost no testing by kernel devs compared to 64-bit. This is the 4th kernel bug that is PAE-only that I've bisected/lkml'd in the last year and it's getting to be a pain. I'm just a lowly user, not a LKML wizard.

I'm sure the main reason why everyone here is using PAE is like me: because we've just upgraded these boxes (which long ago were 32-bit only) while never reinstalling from scratch. I suppose some might have 32-bit only apps/drivers but even that's not an imperative reason anymore. I kind of wish Linus would just come out and say "PAE shouldn't be run and is unsupported on RAM >xGB" where X is whatever (4 or 8GB). "Pretending" it's a normal, supported, tested, OS is now just a myth these days. Yet that's the impression we all get: "Oh you have 64GB of RAM, PAE is perfect!" when it's not.

The LKML thread is ongoing (~30 emails already), and I'm doing a ton of testing. A fix patch may actually be out there already, I just need to confirm it.
https://lkml.org/lkml/2017/1/11/182

Comment 15 Maurizio Paolini 2017-01-28 22:52:41 UTC

Right.  I have 16Gb of RAM.  I just rebooted with the addition of mem=6G:
# cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-4.9.5-200.fc25.i686+PAE root=UUID=a7916f45-3648-405b-b127-9ab6044d566b ro LANG=en_US.UTF-8 cgroup_disable=memory mem=6G
(verified with top).
I tested the backup script... and this is the outcome:

----------------------- /var/log/messages --------------------
Jan 28 23:38:52 solaris systemd-journald: System journal (/var/log/journal/) is 816.1M, max 1.9G, 1.1G free.
Jan 28 23:38:55 solaris kernel: rsync cpuset=/ mems_allowed=0
Jan 28 23:38:55 solaris kernel: CPU: 1 PID: 1562 Comm: rsync Tainted: G        W       4.9.5-200.fc25.i686+PAE #1
Jan 28 23:38:55 solaris kernel: Hardware name: Dell Inc. PowerEdge R320/0KM5PX, BIOS 2.4.2 01/29/2015
Jan 28 23:38:55 solaris kernel: ef8d9ba4 ca76c6c0 ef8d9cd0 f0f99200 ef8d9bd4 ca5e7c10 00000206 ef8d9bd4
Jan 28 23:38:55 solaris kernel: ca77237f cae73940 ef8d9bd8 f6b598c0 cab889b4 f0f99200 cad7cb55 ef8d9cd0
Jan 28 23:38:55 solaris kernel: ef8d9c18 ca5810bd ca47717a ef8d9c04 ca580d49 00000003 00000000 0000015b
Jan 28 23:38:55 solaris kernel: Call Trace:
Jan 28 23:38:55 solaris kernel: [<ca76c6c0>] dump_stack+0x58/0x78
Jan 28 23:38:55 solaris kernel: [<ca5e7c10>] dump_header+0x64/0x1a6
Jan 28 23:38:55 solaris kernel: [<ca77237f>] ? ___ratelimit+0x9f/0x100
Jan 28 23:38:55 solaris kernel: [<ca5810bd>] oom_kill_process+0x1fd/0x3c0
Jan 28 23:38:55 solaris kernel: [<ca47717a>] ? has_capability_noaudit+0x1a/0x30
Jan 28 23:38:55 solaris kernel: [<ca580d49>] ? oom_badness.part.13+0xc9/0x140
[...]
Jan 28 23:38:57 solaris kernel: Mem-Info:
Jan 28 23:38:57 solaris kernel: active_anon:7386 inactive_anon:165 isolated_anon:0#012 active_file:65552 inactive_file:1001287 isolated_file:64#012 unevictable:0 dirty:0 writeback:0 unstable:0#012 slab_reclaimable:189633 slab_unreclaimable:10445#012 mapped:15763 shmem:271 pagetables:593 bounce:0#012 free:61656 free_pcp:741 free_cma:0
Jan 28 23:38:57 solaris kernel: Node 0 active_anon:29544kB inactive_anon:660kB active_file:262208kB inactive_file:4005148kB unevictable:0kB isolated(anon):0kB isolated(file):256kB mapped:63052kB dirty:0kB writeback:0kB shmem:1084kB writeback_tmp:0kB unstable:0kB pages_scanned:11348367 all_unreclaimable? yes
Jan 28 23:38:57 solaris kernel: DMA free:3184kB min:68kB low:84kB high:100kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15980kB managed:15904kB mlocked:0kB slab_reclaimable:12420kB slab_unreclaimable:300kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Jan 28 23:38:57 solaris kernel: lowmem_reserve[]: 0 780 5223 5223
Jan 28 23:38:57 solaris kernel: Normal free:3484kB min:3536kB low:4420kB high:5304kB active_anon:0kB inactive_anon:0kB active_file:1028kB inactive_file:0kB unevictable:0kB writepending:0kB present:892920kB managed:815220kB mlocked:0kB slab_reclaimable:746112kB slab_unreclaimable:41480kB kernel_stack:1392kB pagetables:0kB bounce:0kB free_pcp:1920kB local_pcp:392kB free_cma:0kB
Jan 28 23:38:57 solaris kernel: lowmem_reserve[]: 0 0 35543 35543
Jan 28 23:38:57 solaris kernel: HighMem free:239956kB min:512kB low:5544kB high:10576kB active_anon:29544kB inactive_anon:660kB active_file:261044kB inactive_file:4005096kB unevictable:0kB writepending:0kB present:4549576kB managed:4549576kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:2372kB bounce:0kB free_pcp:1044kB local_pcp:108kB free_cma:0kB
Jan 28 23:38:57 solaris kernel: lowmem_reserve[]: 0 0 0 0
Jan 28 23:38:57 solaris kernel: DMA: 12*4kB (UE) 8*8kB (UE) 4*16kB (UE) 10*32kB (U) 6*64kB (UM) 2*128kB (U) 2*256kB (UM) 1*512kB (M) 1*1024kB (M) 0*2048kB 0*4096kB = 3184kB
Jan 28 23:38:57 solaris kernel: Normal: 93*4kB (UMEH) 45*8kB (MEH) 102*16kB (UMEH) 25*32kB (UMH) 5*64kB (U) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3484kB
Jan 28 23:38:57 solaris kernel: HighMem: 273*4kB (UM) 340*8kB (UM) 175*16kB (UM) 66*32kB (UM) 41*64kB (UM) 8*128kB (UM) 9*256kB (UM) 14*512kB (U) 7*1024kB (U) 5*2048kB (U) 49*4096kB (UM) = 239956kB
Jan 28 23:38:58 solaris kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jan 28 23:38:58 solaris kernel: 1067131 total pagecache pages
Jan 28 23:38:58 solaris kernel: 0 pages in swap cache
Jan 28 23:38:58 solaris kernel: Swap cache stats: add 0, delete 0, find 0/0
Jan 28 23:38:58 solaris kernel: Free swap  = 8388604kB
Jan 28 23:38:58 solaris kernel: Total swap = 8388604kB
Jan 28 23:38:58 solaris kernel: 1364619 pages RAM
Jan 28 23:38:58 solaris kernel: 1137394 pages HighMem/MovableOnly
Jan 28 23:38:58 solaris kernel: 19444 pages reserved
Jan 28 23:38:58 solaris kernel: 0 pages hwpoisoned
Jan 28 23:38:58 solaris kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Jan 28 23:38:58 solaris kernel: [  972]     0   972    10059     3770      14       3        0             0 systemd-journal
Jan 28 23:38:58 solaris kernel: [  996]     0   996     4039      545       8       3        0             0 lvmetad
Jan 28 23:38:58 solaris kernel: [ 1009]     0  1009     3640     1176       8       3        0         -1000 systemd-udevd
Jan 28 23:38:58 solaris kernel: [ 1088]     0  1088     3492      752       6       3        0         -1000 auditd
Jan 28 23:38:58 solaris kernel: [ 1109]     0  1109     1470      950       6       3        0             0 smartd
[...]
Jan 28 23:38:59 solaris kernel: Out of memory: Kill process 1116 (rsyslogd) score 2 or sacrifice child
Jan 28 23:38:59 solaris kernel: Killed process 1116 (rsyslogd) total-vm:754420kB, anon-rss:572kB, file-rss:29460kB, shmem-rss:0kB
Jan 28 23:38:59 solaris kernel: systemd-journal invoked oom-killer: gfp_mask=0x2420848(GFP_NOFS|__GFP_NOFAIL|__GFP_HARDWALL|__GFP_MOVABLE), nodemask=0, order=0, oom_score_adj=0
[...]

Comment 16 Maurizio Paolini 2017-01-28 22:56:49 UTC

It is quite frustrating... I had to reinstall an old f23 kernel (4.4.7-300.fc23.i686+PAE) and boot from it to have my backups working as expected :-(

Comment 17 Trevor Cordes 2017-01-28 23:17:07 UTC

Very weird.  There must be some other mitigating factor on my boxes that allow it to not oom with a similar cmdline.

You also seem to be able to make it oom on demand and I have yet to do that... I have to wait overnight for some regular jobs to hit before it ooms for me.  What exactly are you running (exact command/config).  It would be so much easier for me to debug this problem if I could reproduce it on demand in minutes, not days.

Also, when yours oom's, does it oom and recover, or oom a bunch of times until the system hangs?  I get actual hangs/freezes about 75% of the time.  Sometimes I can reboot before it hangs completely.  Rare times it just semi-recovers on its own.

I *may* have a good 4.9 or 4.10 kernel finally I'm testing right now that I can send you link / build instructions for if you want to git and compile yourself.  I'll know for sure after tonight if we finally have patches that solve this problem.  It would actually be very helpful if you guys who have the problem "worse" than me can git/build and confirm it solves the bug on your boxes too.

Comment 18 Trevor Cordes 2017-01-28 23:19:58 UTC

4.7.10 (F23) is the last kernel that doesn't have this bug in Fedora-land AFAIK.  That's the one I boot to when I need it stable.  It never exhibits this bug.  4.8.8 was the first one I used that did (I never tried Fedora-ized 4.8.0-4.8.7 but from my reading/bisecting I'm pretty sure the problem started at 4.8.0).  Of course 4.4.7 would be bug-free too, if that's the only one you can get your hands on.

Comment 19 Maurizio Paolini 2017-01-28 23:30:11 UTC

This the bash script that (usually) causes the oom:

---------------------------------------------------------------
#!/bin/bash

hostname=`hostname`
echo "$hostname: $0"

mp=/mnt/cantor_home_backup
mount -o nfsvers=3 192.168.1.196:/exports/home ${mp}

if [ $? -eq 0 ] ; then

  rsync -aqol ${mp}/matem/ /home/backup/home-backup/matem
  rsync -aqol ${mp}/fisica/ /home/backup/home-backup/fisica
  rsync -aqol ${mp}/tesi/ /home/backup/home-backup/tesi
  rsync -aqol ${mp}/staff/ /home/backup/home-backup/staff
  rsync -aqol ${mp}/pc/ /home/backup/home-backup/pc
  rsync -aqol ${mp}/informatica/ /home/backup/home-backup/informatica
  rsync -aqol ${mp}/scienzeamb/ /home/backup/home-backup/scienzeamb
  rsync -aqol ${mp}/profiles/ /home/backup/home-backup/profiles
  umount ${mp}
else
  echo "Unable to perform home backup"
fi
----------------------------------------------------------------

So it involves an NFS mount (version 3, because version 4 has its own problems)
and an rsync.
The source folders are large home directories:

# du -sxh *
13G     fisica
405M    informatica
151G    matem
177G    pc
35M     profiles
148K    scienzeamb
8.2G    staff
4.0K    temp
65G     tesi

Caching of NFS-filesystem data in RAM might be a trigger for the problem...

Comment 20 Trevor Cordes 2017-01-29 23:13:39 UTC

The LKML guys, specifically Michal Hocko, seem to have produced a working fix! Their latest attempt has survived 3 nights on my box without any cgroup/mem boot options, and for me that means the bug is solved.

However, since others here said that the cgroup/mem boot option didn't help them at all (when it did for me), it might be prudent if another tester could test the fix and report back if this really solves this bug, or if it just mitigates it on my box for whatever reason.

You'll need to compile the kernel from a git tree. It's not too hard.

git clone git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git MHocko
cd MHocko
git checkout since-4.9

# get a Fedora-ized config file, you can use mine (I found it really tricky to get the right config for PAE on a 64-bit capable box):
https://tecnopolis.ca/LinuxTest/dotconfig
# save it in the MHocko directory as .config
# you shouldn't have to answer more than a few config questions, or something isn't working right.

# compile everything:
nice make bzImage && nice make modules

# install
nice make modules_install && nice make install

If you get errors during the compile then you need to dnf install some more dev tools / libraries. Try:
dnf builddep --downloadonly kernel
See what it wants to do and then take out the --downloadonly to actually install. I think sometimes you need a couple more things, the errors will guide you. Warning, the build can take 2+ hours!

Change your grub2.cfg to boot from the newly installed kernel and reboot. Remove the cgroup/memory boot options. Hit it with your oom-causing things as much as you can. Report back here!

Since this fixes it for me, I'm trying to get the patches into 4.9 stable so F24 can pick it up. Good luck everyone!

Comment 21 Maurizio Paolini 2017-01-30 14:49:00 UTC

I tried out the MHocko fixed kernel... Everything seems fine :-) My usual "rsync"
backup completed with no problems and I am now forcing an rsync onto an empty
directory just to stress the system more.
"top" seems normal.  At the moment I see:

KiB Mem : 16542660 total,  7949088 free,   108540 used,  8485032 buff/cache
KiB Swap:  8388604 total,  8388604 free,        0 used. 15797028 avail Mem

Let's hope now that this patch gets into the next kernel-PAE update!

Thank you Travor (and the kernel guys) for your *enormous* effort into the problem

I will add a new entry here as soon as the full rsync has completed.

Comment 22 Maurizio Paolini 2017-01-30 15:20:19 UTC

rsync *did* complete without problems... This is a quite *strong* indication that
the proposed patch solves the problem!

Comment 23 Trevor Cordes 2017-02-05 22:59:20 UTC

The LKML guys have solved this problem and I have tested their newest/final(?) fix extensively.

Michal reports:
"I will send this patch to 4.9+ stable as soon as it hits Linus tree."

That means we should be able to get this into F24 (please please!) and F25 soon-ish, once it comes down from upstream.  If anyone wants the details or to compile one of the test kernels for early testing, google this:
"mm, vmscan: commit makes PAE kernel crash nightly" site:lkml.org

Comment 24 Maurizio Paolini 2017-02-05 23:46:25 UTC

Great news!  I also tested the proposed patch on my kernel-PAE box with
F25.  It used to OOM every night during a backup rsync (involving
more than 200Gb of data, now with the 'fixed' 4.9 kernel is running
smoothly since six days ago.

I am looking forward for an updated f25 kernel-PAE... since the same problem
hits randomly also other computers in our department and I cannot fix all of
them "by hands"

Comment 25 Mark Hittinger 2017-02-11 03:39:27 UTC

F24 4.9.7-101 i686 PAE i am unable to trigger the OOM bug.

My thanks as well.

Comment 26 Trevor Cordes 2017-02-11 05:47:54 UTC

Mark, I'm not sure if 4.9.7.101 has the fix for our bug in it or not.  I'm guessing it doesn't?  So be careful as it may still blow up on you.  Maybe a Fedora kernel maintainer can comment.  It seems to be quite difficult using the web interfaces to find out what patches are in a specific errata release, unless a bz is specifically linked to the release.

Comment 27 Maurizio Paolini 2017-02-11 18:03:22 UTC

Mmh, I will shortly check kernel-PAE-core-4.9.8-201.fc25.i686.rpm
on our box.  Last that we checked was kernel-PAE-core-4.9.5-200.fc25.i686
(it crashed with the OOM bug).

Comment 28 Maurizio Paolini 2017-02-11 18:44:22 UTC

Nope.  kernel-PAE-core-4.9.8-201.fc25.i686 just crashed with the usual OOM
one minute ago (it took just a few minutes of an rsync to jump up with the
kwapd0 CPU usage and then I lost control.  After reboot with a remote "racadm serveraction powercycle" from the control interface I can verify that it OOMed.

For the records:

Feb 11 19:19:39   boot into kernel-PAE-core-4.9.8-201.fc25.i686
Feb 11 19:24:55   systemd-journal invoked oom-killer...

in /var/log/messages I find a strange set of lines:

Feb 11 19:19:39 solaris kernel: ------------[ cut here ]------------
Feb 11 19:19:39 solaris kernel: WARNING: CPU: 0 PID: 0 at arch/x86/kernel/apic/apic.c:2065 __generic_processor_info+0x28c/0x360
Feb 11 19:19:39 solaris kernel: Only 31 processors supported.Processor 32/0x60 and the rest are ignored.
Feb 11 19:19:39 solaris kernel: Modules linked in:
Feb 11 19:19:39 solaris kernel: CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.8-201.fc25.i686+PAE #1
Feb 11 19:19:39 solaris kernel: Hardware name: Dell Inc. PowerEdge R320/0KM5PX, BIOS 2.4.2 01/29/2015
Feb 11 19:19:39 solaris kernel: ce25be30 cdb6c8c0 ce25be74 ce161070 ce25be60 cd86c95a ce1591c0 ce25be94
Feb 11 19:19:39 solaris kernel: 00000000 ce161070 00000811 cd8498fc 00000811 00000060 00000015 00000020
Feb 11 19:19:39 solaris kernel: ce25be80 cd86c9c6 00000009 00000000 ce25be74 ce1591c0 ce25be94 e86116e0
Feb 11 19:19:39 solaris kernel: Call Trace:
Feb 11 19:19:39 solaris kernel: [<cdb6c8c0>] dump_stack+0x58/0x78
[ --- 15 similar lines removed --- ]
Feb 11 19:19:39 solaris kernel: ---[ end trace ea3830f176c360f7 ]---


But this machine has only 4 processors (from /proc/cpuinfo)

Comment 29 Maurizio Paolini 2017-02-21 13:57:14 UTC

I just tried the new kernel 4.9.10-200.fc25.i686+PAE
but the problem is still there: rsync produces oom:

# grep oom-killer /var/log/messages
Feb 21 14:53:02 solaris kernel: rsync invoked oom-killer: gfp_mask=0x2420848(GFP_NOFS|__GFP_NOFAIL|__GFP_HARDWALL|__GFP_MOVABLE), nodemask=0, order=0, oom_score_adj=0
Feb 21 14:54:30 solaris kernel: rsync invoked oom-killer: gfp_mask=0x24000c0(GFP_KERNEL), nodemask=0, order=0, oom_score_adj=0
Feb 21 14:55:50 solaris kernel: jbd2/sda4-8 invoked oom-killer: gfp_mask=0x2420848(GFP_NOFS|__GFP_NOFAIL|__GFP_HARDWALL|__GFP_MOVABLE), nodemask=0, order=0, oom_score_adj=0

:-(

Comment 30 Trevor Cordes 2017-02-21 23:23:18 UTC

I'm pretty sure the commit that fixes this bug is NOT in any vanilla yet.  There was a notice to LKML by a bot that the commit causes a 11% performance hit on FS performance, so maybe it's all delayed until that is resolved.  Until then, we must use the workarounds or our own custom-built kernels.

Comment 31 Justin M. Forbes 2017-04-11 14:39:37 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 25 kernel bugs.

Fedora 25 has now been rebased to 4.10.9-200.fc25.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 26, and are still experiencing this issue, please change the version to Fedora 26.

If you experience different issues, please open a new bug report for those.

Comment 32 Maurizio Paolini 2017-04-11 15:42:50 UTC

Unfortunately I am no longer able to reproduce the problem.  A few weeks ago we moved our file server on a new hardware and reinstalled the OS.  This is the
nfs server used as a client from the box experiencing the OOM problem during an
rsync.

Since that upgrade we did never experience the OOM (let me stress it again: we
did not upgrade the box experiencing the OOM, we upgraded the box that is backupped via rsync via NFS mount on the machine that did experience the OOM).

The problem did not occur with the kernel 4.9.10-200.fc25.i686+PAE
(see Comment 29 above) nor with the current 4.10.6-200.fc25.i686+PAE
kernel.  I will soon test the newer version, but I have a strong feeling that we shall not see the problem again.

It seems it was due to some strange combination of ingredients that includes
the specific NFS version on the client box, or perhaps it was due to some
particular timing or whatever.

Comment 33 Peter Backes 2017-04-19 17:44:12 UTC

After upgrading from 4.9.5 to 4.10.8 I did not see any OOM anymore.

Comment 34 Justin M. Forbes 2017-04-19 18:14:57 UTC

Excellent. Closing this as fixed, if others find that it is not, feel free to reopen.