1373339 – Since kernel 4.7.2 oom-killer continually kills firefox and qemu-system-x86

Bug 1373339 - Since kernel 4.7.2 oom-killer continually kills firefox and qemu-system-x86 [NEEDINFO]

Summary: Since kernel 4.7.2 oom-killer continually kills firefox and qemu-system-x86

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	24
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-09-06 04:03 UTC by Louis van Dyk
Modified:	2017-04-28 17:22 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-04-28 17:22:06 UTC
Type:	Bug
Embargoed:
Dependent Products:
Flags:	jforbes: needinfo?

Attachments	(Terms of Use)
Extract of /var/log/messages (146.55 KB, text/plain) 2016-09-06 04:03 UTC, Louis van Dyk	no flags	Details
/var/log/messages extract of OOM Killer in action (75.33 KB, text/plain) 2016-11-07 04:45 UTC, redhatbugzilla	no flags	Details
Extract of /var/log/messages (same as pastebin) (81.33 KB, text/plain) 2016-11-07 08:28 UTC, Jasper Siero	no flags	Details
Extract of /var/log/messages 20161117 with another oom-killer but without containers running on the system (17.36 KB, text/plain) 2016-11-18 12:05 UTC, Jasper Siero	no flags	Details
Oom-killer killed named again. Extract of the logs (17.35 KB, text/plain) 2016-11-29 07:58 UTC, Jasper Siero	no flags	Details
console log from oom (340.14 KB, text/plain) 2016-12-11 23:54 UTC, Darren Tucker	no flags	Details
View All

Description Louis van Dyk 2016-09-06 04:03:21 UTC

Created attachment 1198061 [details]
Extract of /var/log/messages

Description of problem:
I can no longer work! Since updating to 4.7.2 oom-killer is repeatedly killing firefox and my KVM virtual machine.  I am also running evolution and copying some files to an NTFS external USB drive.

Version-Release number of selected component (if applicable):
kernel-4.7.2-201.fc24.x86_64
kernel-core-4.7.2-201.fc24.x86_64
kernel-devel-4.7.2-201.fc24.x86_64
kernel-headers-4.7.2-201.fc24.x86_64
kernel-modules-4.7.2-201.fc24.x86_64
kernel-modules-extra-4.7.2-201.fc24.x86_64


How reproducible:
It has killed my VM three times in the past 2 hours, and firefox once

Steps to Reproduce:
1. Boot.  Start evolution, firefox, KVM Guest, copy files with Nemo to external HDD
2. Work normally
3. Out of the blue the KVM Guest is killed.

Actual results:


Expected results:
This did not happen in kernel 4.6.  I expect a stable experience.

Additional info:

Comment 1 Jasper Siero 2016-11-01 11:29:45 UTC

We also have this problem since we updated the kernel on Fedora 23 from 4.6.4-201 to 4.7.3-100. There is enough available memory free (14 GB) but oom killer kills our named daemon and we also saw it with different processes.

When we use the old 4.6.4 kernel again this problem did not happen so it looks like the problem started in de 4.7 kernels. 

We now upgraded to Fedora 24 with the latest 4.7.9-200 kernel and now this problem is back again so its not solved yet.


This thread from Linux Torvalds is also about the oom killer not behaving correctly since kernel 4.7:

http://www.spinics.net/lists/linux-mm/msg113661.html

This problem is not specific for Firefox and Qemu-system-x86 so maybe the title should be changed.

Comment 2 Jasper Siero 2016-11-01 11:44:35 UTC

Some additional logs of oom-killer:
http://pastebin.com/avbh4UH4

Comment 3 redhatbugzilla 2016-11-07 04:44:46 UTC

I have had this problem on multiple 4.7 kernels and still have it on 4.8.4-200.fc24.x86_64.

For me it normally kills virtualbox virtual machines.

I have a 16GB machine and typically when OOM killer kills vbox, typically there is around 95% swap free and about 10GB of RAM used by "buff/cache" and about 6GB of RAM total in use (as reported by top).

I've attached a /var/log/messages extract as well.

Comment 4 redhatbugzilla 2016-11-07 04:45:43 UTC

Created attachment 1217855 [details]
/var/log/messages extract of OOM Killer in action

Comment 5 redhatbugzilla 2016-11-07 04:47:12 UTC

Comment on attachment 1217855 [details]
/var/log/messages extract of OOM Killer in action

Total system RAM 16GB, Total Swap 10GB. At the time of the kill, approx RAM in use 6GB, RAM used by buff/cache 10GB, Swap free >9GB

Comment 6 Jasper Siero 2016-11-07 08:28:51 UTC

Created attachment 1217919 [details]
Extract of /var/log/messages (same as pastebin)

Comment 7 redhatbugzilla 2016-11-10 04:38:42 UTC

Of interest I have also noticed that on an almost regular basis (something like every 15-30 minutes) I get a huge surge in the number of kworker processes. My system seems to normally sit at around 100 kworker processes, but the regular surge is to around 1200 kworker processes. This does not seem to alter the amount of RAM allocated or free or buffers in any large way.

For that "problem", I found this link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1626564

This talks about using SLUB vs SLAB, which being related to memory allocation got me wondering about this bug which is also related to memory allocation. 

As of writing this comment I am running kernel 4.8.6-201.fc24.x86_64, which apparently is configured to use SLUB so the above "bug" is not directly relevant - it just seems curious to me how I am experiencing similar symptoms (large number of kworkers) and also this bug with memory allocation problems.

grep -iE 'sl[aou]b' /boot/config-$(uname -r)
CONFIG_SLUB_DEBUG=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
CONFIG_SLAB_FREELIST_RANDOM=y
CONFIG_SLUB_CPU_PARTIAL=y
CONFIG_SLABINFO=y
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set

Comment 8 Jasper Siero 2016-11-16 15:55:08 UTC

Now testing again with the newest kernel 4.7.9-200.fc24.x86_64 but without the containers which were running on this system. The oom-killer problem has not occurred yet (just testing for a day) but I think this is interesting to share.

Comment 9 redhatbugzilla 2016-11-16 20:34:58 UTC

I have used kernel 4.7.9-200.fc24.x86_64 and the problem manifested itself.
I am currently using 4.8.6-201.fc24.x86_64 and the problem still happened.

I used to get this problem up to multiple times a day and at least once a week.

I may have found a work around as below:

For 9 days now I have been flushing the buffer/cache every 2 hours using "sync && echo 1 > /proc/sys/vm/drop_caches" and the problem has not happened again. For my system 2 hours means that the buffer caches never use all available memory and hence I assume that allocations always work properly and hence the OOM killer is never needed. I used to have around 10GB in buffer/cache. Now it only gets to around 6GB, always leaving about 4GB totally free memory.

Comment 10 customercare 2016-11-17 10:34:30 UTC

Kernel: 4.7.10-100.fc23.i686+PAE
Ram: 10 GB
Utilization of Ram: 1.5 GB 

OOM Killer randomly kills processes out of the blue. 

Nov 14 00:20:53 s36 kernel: httpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 00:50:42 s36 kernel: httpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 02:06:40 s36 kernel: httpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 02:26:56 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 02:27:16 s36 kernel: proftpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 02:27:23 s36 kernel: proftpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 02:27:31 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 02:27:38 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 02:27:41 s36 kernel: httpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 02:27:45 s36 kernel: httpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 02:27:47 s36 kernel: xe-update-guest invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 02:27:52 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 02:28:02 s36 kernel: httpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 02:28:20 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 02:28:25 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 02:30:02 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 02:30:22 s36 kernel: httpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 02:32:36 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 02:32:43 s36 kernel: exim invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 02:33:06 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 02:42:19 s36 kernel: httpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 02:42:54 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 02:43:18 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 03:33:31 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 06:15:10 s36 kernel: /usr/sbin/munin invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 06:15:14 s36 kernel: exim invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 06:15:20 s36 kernel: kworker/u8:2 invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 06:15:25 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 06:15:28 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 06:15:36 s36 kernel: httpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 06:15:49 s36 kernel: httpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 06:15:53 s36 kernel: httpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 06:16:41 s36 kernel: httpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 06:16:48 s36 kernel: httpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 07:00:02 s36 kernel: bash invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 07:42:08 s36 kernel: httpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 08:54:06 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 09:00:00 s36 kernel: dovecot invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 14 09:03:47 s36 kernel: exim invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 16 02:31:49 s36 kernel: httpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 16 10:48:01 s36 kernel: java invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 16 10:51:13 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 16 14:26:50 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 16 15:05:40 s36 kernel: systemd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 16 16:07:32 s36 kernel: exim invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 16 17:02:00 s36 kernel: httpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 16 17:46:42 s36 kernel: httpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 16 20:28:50 s36 kernel: exim invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 16 22:09:07 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 16 22:35:00 s36 kernel: dovecot invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 16 23:27:27 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 16 23:50:03 s36 kernel: bash invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 16 23:50:41 s36 kernel: exim invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 16 23:50:52 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 17 00:20:28 s36 kernel: java invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 17 02:23:51 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 17 02:23:58 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 17 02:23:58 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 17 02:26:24 s36 kernel: httpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 17 06:17:06 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 17 06:17:08 s36 kernel: httpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 17 06:17:27 s36 kernel: exim invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 17 06:17:34 s36 kernel: proftpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 17 06:18:29 s36 kernel: httpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 17 06:19:01 s36 kernel: exim invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 17 09:15:09 s36 kernel: httpd invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 17 09:59:52 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0
Nov 17 10:29:24 s36 kernel: mysqld invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0

This needs to be fixed asap.

Comment 11 Jasper Siero 2016-11-18 12:05:19 UTC

Created attachment 1221869 [details]
Extract of /var/log/messages 20161117 with another oom-killer but without containers running on the system

Comment 12 Jasper Siero 2016-11-29 07:58:21 UTC

Created attachment 1225684 [details]
Oom-killer killed named again. Extract of the logs

Comment 13 customercare 2016-11-29 09:01:21 UTC

(In reply to Jasper Siero from comment #12)
> Created attachment 1225684 [details]
> Oom-killer killed named again. Extract of the logs

The OOM-Killer does not target daemons by name, it selects them by the following conditions:

1. the process is relativly new in the processlist
2. it has gained a lot a memory in this "short" time.

So it will kill ANY process, that meets this logic.

Comment 14 Jasper Siero 2016-11-29 09:18:32 UTC

Thanks I understand. Named is usually the process which is being killed but you are right that its because of the rules you mentioned. The numbers and statistics why oom-killer did this can be found in the logs.
Oom killer should not kill anything on the machine because there is enough memory available.

Comment 15 Rolf Fokkens 2016-12-03 17:22:24 UTC

https://www.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.8.12:

commit 7838fbe25a95ce2cd6e8ae27a76d369365da89d4
Author: Michal Hocko <mhocko>
Date:   Tue Nov 29 17:25:15 2016 +0100

    mm, oom: stop pre-mature high-order OOM killer invocations
    
    31e49bfda184 ("mm, oom: protect !costly allocations some more for
    !CONFIG_COMPACTION") was an attempt to reduce chances of pre-mature OOM
    killer invocation for high order requests. It seemed to work for most
    users just fine but it is far from bullet proof and obviously not
    sufficient for Marc who has reported pre-mature OOM killer invocations
    with 4.8 based kernels. 4.9 will all the compaction improvements seems
    to be behaving much better but that would be too intrusive to backport
    to 4.8 stable kernels. Instead this patch simply never declares OOM for
    !costly high order requests. We rely on order-0 requests to do that in
    case we are really out of memory. Order-0 requests are much more common
    and so a risk of a livelock without any way forward is highly unlikely.

Comment 16 Jeff Buhrt 2016-12-05 14:51:09 UTC

I am fighting the same issue after upgrading a 32bit PAE KVM guest to F25. It appears you have the same old problem (many) others are having with still using a PAE kernel.
4.8.8-100.fc23 originally started the problem for me (vs some earlier F23 kernel)

An older ticket from when the problem seems to have started
https://bugzilla.redhat.com/show_bug.cgi?id=1075185

For my testing I have munin logging the guest and I am now at only 3GB of 10GB of (guest assigned) memory, no swap used... and rsync, Tomcat, and anything else that moves now gets nailed. The journalctl logs from the guest look reasonable as well (no actual low memory). Log info available if anyone wants it.
I am testing:
echo 1 > /proc/sys/vm/overcommit_memory

All my 64bit F25 hosts and guests are fine. Only the 32bit PAE guest is getting OOM killed.

Comment 17 Jasper Siero 2016-12-05 15:20:11 UTC

I don't think it's the same bug/problem you mentioned because we are not using a pae kernel and running 64 bit since the original installation, Fedora is running on a physical machine (not a vm). The problem started with the new 4.7 kernel (4.6 kernel runs without problems).

Comment 18 customercare 2016-12-05 16:16:29 UTC

the kernel devs introduced a new oom algorithm with 4.7, which will be replaced in 4.9 with something more aggressive( back to the old behavior ).

You could tryout a 4.9rcx kernel.

Comment 19 Jeff Buhrt 2016-12-08 22:13:57 UTC

(In reply to Jasper Siero from comment #17)
> I don't think it's the same bug/problem you mentioned because we are not
> using a pae kernel and running 64 bit since the original installation,
> Fedora is running on a physical machine (not a vm). The problem started with
> the new 4.7 kernel (4.6 kernel runs without problems).

Jasper, interesting to know if the newer OOM killer is at fault or possibly PAE. Sorry I didn't notice the x86_64 in the kernel list but it wasn't in the header. I am guessing given the OOM killer is now dormant that it was OOM vs just PAE. Ironic though that none of my x86_64 hosts or guests experienced the OOM issue.

I would suggest to anyone having the problem try after rebooting:
# cat /proc/sys/vm/overcommit_memory
assuming it is 0...
# echo 1 > /proc/sys/vm/overcommit_memory

For me the VM that had the problem is now stable. Post updating the system I didn't go above 4GB of 10GB of memory assigned before OOM started randomly killing. Now the VM is back to its old stable self after changing overcommit_memory. 

Hopefully the 4.9 kernel helps long term.

Comment 20 Darren Tucker 2016-12-09 01:42:35 UTC

I downgraded from the PAE kernel to the vanilla i686 kernel and have observed at least one OOM kill with that kernel (kernel-4.8.11-200.fc24.i686) although it did resolve the other problem I was having with the PAE kernel (very slow disk writes: 1MB/s to an SSD compared to ~70MB/s with the i686 kernel on the same hardware).

I've tried setting vm.overcommit_memory and taken out my cron job for drop_caches and I'll see if that helps.

Comment 21 Darren Tucker 2016-12-11 23:54:49 UTC

Created attachment 1230675 [details]
console log from oom

Same problem with kernel-4.8.12-200.fc24.i686.  I ran a "free; sleep 60" loop on the console and rsynced all the file systems to another machine (the regular backup).  After about 10 minutes the OOM killer is invoked:

[ 1291.074643] rsync invoked oom-killer: gfp_mask=0x2420848(GFP_NOFS|__GFP_NOFAIL|__GFP_HARDWALL|__GFP_MOVABLE), order=0, oom_score_adj=0

Comment 22 Trevor Cordes 2016-12-19 06:31:12 UTC

Jeff/Darren & other PAE people, please see my post here for a workaround re: slow I/O:
https://muug.ca/pipermail/roundtable/2016-June/004669.html

I'm pretty sure your bugs are unrelated (or at lease significantly different from) Louis' (original poster) bug, so you should start a new bug if you haven't already.  I personally have stopped filing PAE bugs because no kernel dev cares anymore, especially if you are using >4GB RAM.  If you can get them to care, great!  Otherwise I'm looking for ways to do remote, headless updates from 32-PAE to 64-bit to get out of PAE completely.

Comment 23 Jeff Buhrt 2016-12-20 21:03:43 UTC

Two more updates on the problem. Maybe this will help someone until the OOM killer/kernel is fixed.

1. (same 10GB KVM guest system as above) running 4.8.10-300.fc25.i686+PAE with cat /proc/sys/vm/overcommit_memory = 1. Still OOM kills. The best (ugly) hack so far is:
cat /etc/crontab
* 0,2,3,4,5,6,7,9,11,13,16,18,20 * * * root sync && echo 1 > /proc/sys/vm/drop_caches

[Clear the disk cache during the times it is expected to be expanding cache use.) Disk read I/O is now a lot higher but the last OOM kill was 12/9 before adding the cron entry above. 11 days without an workplace OOM kill!]

2. A 2nd physical machine running 4.8.8-300.fc25.i686 (non PAE) on a 4GB rsync server is now getting OOM kills. I updated it to the newest release 4.8.13-300.fc25.i686 to see if it keeps killing... (no overcommit_memory memory changes made). 

[Given an upgrade from i386 to x86_64 doesn't exist, but it being a big filesystem it will be a while before I blow it away just to run a 64bit kernel.]


Summary: I agree it may not be a PAE problem but a i386 issue that this and other tickets are referencing in general. I have not seen any OOM kills on any x86_64 machines (physical or virtual guest).

Comment 24 Darren Tucker 2016-12-21 04:04:58 UTC

(In reply to Trevor Cordes from comment #22)
> Jeff/Darren & other PAE people, please see my post here for a workaround re:
> slow I/O:
> https://muug.ca/pipermail/roundtable/2016-June/004669.html

Wow, that makes 2 orders of magnitude difference!  The machine has 8G, and since I could try this without rebooting:

# uname -r
4.10.0-0.rc0.git4.1.vanilla.knurd.1.fc24.i686+PAE

# cat /proc/sys/vm/highmem_is_dirtyable
0
# dd if=/dev/zero of=/zero bs=1M count=8
8388608 bytes (8.4 MB, 8.0 MiB) copied, 3.41237 s, 2.5 MB/s

# echo 1 >/proc/sys/vm/highmem_is_dirtyable
# dd if=/dev/zero of=/zero bs=1M count=8
8388608 bytes (8.4 MB, 8.0 MiB) copied, 0.04042 s, 208 MB/s

> I'm pretty sure your bugs are unrelated (or at lease significantly different
> from) Louis' (original poster) bug

Actually I think they might be tangentially related: slow paging caused by poor IO performance might be driving up the memory pressure.

Anyway, thanks for the tip!

Comment 25 customercare 2016-12-21 08:53:19 UTC

> [Given an upgrade from i386 to x86_64 doesn't exist, but it being a big
> filesystem it will be a while before I blow it away just to run a 64bit
> kernel.]

Because of the situation and that i686 sometime in the future will be removed entirely, i played a bit how to more or less autoupgrade to 64bit. 

The fasted way is to safe your data ( dump sql to a file ),
poweroff the vm, get direct disk access and move / to /old_system

then move a fresh 64bit template on the vm, move /old_system/etc/fstab back to /etc/ ,  adjust passwd, groups and shadow and start it up. 

Took me 5 Minutes. restore /home, install your rpm packages  and reimport the SQL Databases. 

You also could use dnf to install 64bit packages, change the boot entry to 64bit and rework you throu a lot of binarydata that needs to be removed, including the old rpm packages. People did that. But i believe the "fresh" way it easier.

Comment 26 Jeff Buhrt 2016-12-21 19:43:06 UTC

1) Darren I tested your settings in a guest VM:
80*1MB file, /proc/sys/vm/highmem_is_dirtyable=0: 604 MB/s, 572 MB/s
80*1MB file, /proc/sys/vm/highmem_is_dirtyable=1: 855 MB/s, 875 MB/s
8000*1MB file, /proc/sys/vm/highmem_is_dirtyable=0: 33.8 MB/s, 89.3 MB/s
8000*1MB file, /proc/sys/vm/highmem_is_dirtyable=1: 117 MB/s, 120 MB/s

So for smaller linear writes ~30% increase.

10GB memory allocated KVM guest 4.8.10-300.fc25.i686+PAE on top of a 4.8.12-300.fc25.x86_64 physical machine with raw LVM guest partitions on top of MD mirrored 3TB drives.

Using:
sync;sleep 5; sync

echo 0 >/proc/sys/vm/highmem_is_dirtyable
/bin/time dd if=/dev/zero of=/data/zero bs=1M count=80
time sync
rm -f /data/zero

sync;sleep 5; sync

echo 1 >/proc/sys/vm/highmem_is_dirtyable
/bin/time dd if=/dev/zero of=/data/zero bs=1M count=80
time sync
rm -f /data/zero
sync;sleep 5; sync

echo 0 >/proc/sys/vm/highmem_is_dirtyable
/bin/time dd if=/dev/zero of=/data/zero bs=1M count=80
time sync
rm -f /data/zero

sync;sleep 5; sync

echo 1 >/proc/sys/vm/highmem_is_dirtyable
/bin/time dd if=/dev/zero of=/data/zero bs=1M count=80
time sync
rm -f /data/zero



2) customercare my problem of 32bit guests is certificating the older 32bit C-code in a 64bit world, not something that I want to take on now.
-My what looks like a show-stopper is I can no longer use 'cp -al' on a USB drive plugged into a 32bit PAE machine on a rsnapshot server. The painful problem of not being able to 'upgrade' from 32bit to 64bit in place is 13M+ inodes and ~1TB used. I am to the point of installing and running with 2 new (mirrored) OS drives and keeping the existing 2 drive array for the files. [Yuck. The problem besides buying another 3TB drive (or two) or running 4 drives is the time to do this vs someone deciding to drop i686 and/or PAE support and install an overly aggressive OOM killer at the same time.]

Comment 27 Justin M. Forbes 2017-04-11 14:47:17 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 24 kernel bugs.

Fedora 25 has now been rebased to 4.10.9-100.fc24.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 26, and are still experiencing this issue, please change the version to Fedora 26.

If you experience different issues, please open a new bug report for those.

Comment 28 Justin M. Forbes 2017-04-28 17:22:06 UTC

*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 2 weeks. If you are still experiencing this issue, please reopen and attach the 
relevant data from the latest kernel you are running and any data that might have been requested previously.

Note You need to log in before you can comment on or make changes to this bug.