915936 – Fedora 18 persistent oom-killer

Bug 915936 - Fedora 18 persistent oom-killer

Summary: Fedora 18 persistent oom-killer

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	18
Hardware:	i686
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-02-26 19:39 UTC by Jeff Hardy
Modified:	2013-10-19 01:06 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-10-19 01:06:12 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Full trace of oom-killer that began recent spiral (20.96 KB, text/plain) 2013-02-26 19:40 UTC, Jeff Hardy	no flags	Details
List of all invocations of oom-killer in spiral (3.54 KB, text/plain) 2013-02-26 19:42 UTC, Jeff Hardy	no flags	Details
List of all applications that have trigger oom-killer since 2013-02-06 (716 bytes, text/plain) 2013-02-26 19:43 UTC, Jeff Hardy	no flags	Details
Full log from boot to OOM death (191.14 KB, text/plain) 2013-04-08 05:06 UTC, Jeff Hardy	no flags	Details
Output of vmstat every 60s, system state last 40m (5.80 KB, text/plain) 2013-04-08 05:09 UTC, Jeff Hardy	no flags	Details
Output of free, system state 50m before end (259 bytes, text/plain) 2013-04-08 05:15 UTC, Jeff Hardy	no flags	Details
Output of ps, system state 50m before end (11.35 KB, text/plain) 2013-04-08 05:21 UTC, Jeff Hardy	no flags	Details
lsmod, after a few hours of X (4.16 KB, text/plain) 2013-05-02 18:01 UTC, Jeff Hardy	no flags	Details
slabinfo, after a few hours of X (10.68 KB, text/plain) 2013-05-02 18:02 UTC, Jeff Hardy	no flags	Details
Stack trace on 3.9.9-201 (12.45 KB, text/plain) 2013-07-22 19:10 UTC, Jeff Hardy	no flags	Details
View All

Description Jeff Hardy 2013-02-26 19:39:12 UTC

Description of problem:
After approximately 4-8 hours of light desktop use in X, oom-killer is invoked in a spiral which ultimately leads to black screen and unusable desktop, though plenty of RAM remains.  Requires hardware reset.  Consistent through XFCE, LXDE, KDE.


Version-Release number of selected component (if applicable):
kernel-3.7.6-201.fc18.i686
xorg-x11-server-Xorg-1.13.2-2.fc18.i686


How reproducible:
Always within 4-6 hours of login.


Steps to Reproduce:
1. Boot machine
2. Login to X
3. Load terminal, firefox, thunderbird, pidgin, wait 4-6 hours.
  
Actual results:
Desktop becomes unresponsive, screen blank, no input at keyboard.  Hard reset, logs reveal oom-killer invoked dozens of times.

Expected results:
Working desktop.

Additional info:
Fedup upgrade from a fully-working Fedora 17.

Comment 1 Jeff Hardy 2013-02-26 19:40:50 UTC

Created attachment 703094 [details]
Full trace of oom-killer that began recent spiral

Comment 2 Jeff Hardy 2013-02-26 19:42:35 UTC

Created attachment 703095 [details]
List of all invocations of oom-killer in spiral

Comment 3 Jeff Hardy 2013-02-26 19:43:46 UTC

Created attachment 703096 [details]
List of all applications that have trigger oom-killer since 2013-02-06

Comment 4 Jeff Hardy 2013-04-08 04:58:31 UTC

I have reproduced this booting and leaving the system unattended in runlevel 3.  Ensuing attachments reflect system state some 36 hours after boot and no real activity.  Tracking ps, free, and vmstat every 60 seconds, and grabbed full system log since boot.  Vmstat reveals iowait jacking up approximately 51 minutes before the end.

Comment 5 Jeff Hardy 2013-04-08 05:06:57 UTC

Created attachment 732547 [details]
Full log from boot to OOM death

Comment 6 Jeff Hardy 2013-04-08 05:09:21 UTC

Created attachment 732549 [details]
Output of vmstat every 60s, system state last 40m

Comment 7 Jeff Hardy 2013-04-08 05:15:34 UTC

Created attachment 732550 [details]
Output of free, system state 50m before end

Comment 8 Jeff Hardy 2013-04-08 05:21:33 UTC

Created attachment 732551 [details]
Output of ps, system state 50m before end

Comment 9 Dave Jones 2013-04-09 01:22:44 UTC

the problem in every one of those traces is that you ran out of DMA memory.
It doesn't matter that there's memory free, because that memory (highmem) isn't suitable for DMA.

Taking just one example..

Mar 17 06:19:26 fritzdesk kernel: [133178.494441] DMA free:3464kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15796kB managed:5816kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:12kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes

Mar 17 06:19:26 fritzdesk kernel: [133178.502596] Normal free:9336kB min:3720kB low:4648kB high:5580kB active_anon:0kB inactive_anon:0kB active_file:40kB inactive_file:76kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:881880kB managed:831704kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:7936kB slab_unreclaimable:36260kB kernel_stack:1136kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:16567 all_unreclaimable? yes

Here it seems that your workload used up all of the 'normal' zone memory, and once this was exhausted it fell back to using up the DMA zone memory too.
Once that reached critical levels, the oom killer does something about it. The normal zone had reclaimable slab caches, but that wouldn't have helped any in allocating a DMA page.

There's not really anything the kernel can do when this happens other than kill a process that's using up the memory.
If it didn't the box would just hang indefinitely waiting for DMA-able memory to become free.

You might try disabling all the stuff you don't need (iscsi, ksmtuned etc) but that's probably just going to at best put things off before it inevitably happens again.

You *might* be able to make things run a little better by changing up some of the /proc/sys/vm/ sysctls, but the effort involved might not make it a worthwhile exercise.

Adding more RAM might cause some more things to move into highmem, freeing up lowmem too.

Comment 10 Jeff Hardy 2013-04-23 00:37:23 UTC

I am far from an expert in this area, but the odd thing is that this same hardware ran Fedora 17 without issue, and practically every other Fedora before that.  I attached the process list and so forth hoping that it might provide some clue, but as I indicated, even a workload of nothing more than boot to runlevel 3 will eventually trigger this, and that is something I have never experienced before on any box.  I will try some of your suggestions.  Unfortunately, my current less-than-acceptable mitigation is panic on oom, and reboot on panic.

Comment 11 Rik van Riel 2013-04-23 00:51:20 UTC

ZONE_NORMAL is just around 830MB in available (managed) memory, of which:
- 9MB free
- 8MB reclaimable slab
- 32MB unreclaimable slab
- 1MB kernel stack
- active/inactive pages: a few kB

That means a total 51MB out of 880MB has been accounted for. The DMA zone has been totally exhausted, too.

This means that either some kernel driver or X is eating up your memory. Can you check your kernel to see what is using up all the memory?

You may be able to check (and exclude?) the graphics subsystem by looking at the contents of this file: /sys/class/drm/ttm/memory_accounting/kernel/used_memory

Comment 12 Jeff Hardy 2013-05-02 17:50:34 UTC

After running X for a few hours, nothing too heavy:

$ cat /sys/class/drm/ttm/memory_accounting/kernel/used_memory
210

Comment 13 Jeff Hardy 2013-05-02 18:01:46 UTC

Created attachment 742832 [details]
lsmod, after a few hours of X

Comment 14 Jeff Hardy 2013-05-02 18:02:32 UTC

Created attachment 742833 [details]
slabinfo, after a few hours of X

Comment 15 Josh Boyer 2013-07-02 20:35:33 UTC

Are you still seeing this issue with an updated F18 and the 3.9.6 or newer kernels?

Comment 16 Jeff Hardy 2013-07-08 17:32:22 UTC

Still seeing the issue on 3.8.1-201.fc18.i686.  Will update to the latest and report back.

Comment 17 Jeff Hardy 2013-07-22 19:08:48 UTC

I have a different issue with system lockup on 3.9.9-201.fc18.i686, typically 15-45 minutes after boot.  "Lockup" consists of frozen X session, no keyboard input, and eventual monitor blank.  If at CLI, stack trace usually appears.  Attached trace usually makes it to the logs.  Updated to 3.9.10-200.fc18.i686 and will report back.

Comment 18 Jeff Hardy 2013-07-22 19:10:12 UTC

Created attachment 777022 [details]
Stack trace on 3.9.9-201

Comment 19 Jeff Hardy 2013-08-21 15:39:40 UTC

Running now on 3.10.7-100.fc18.i686 and that seems to have resolved the issue.  Uptime currently nearing two days, which seemed an impossible feat when I opened this issue:

$ uptime
 11:36:51 up 1 day, 21:45,  9 users,  load average: 0.25, 0.22, 0.17

Comment 20 Justin M. Forbes 2013-10-18 21:17:05 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 18 kernel bugs.

Fedora 18 has now been rebased to 3.11.4-101.fc18.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 19, and are still experiencing this issue, please change the version to Fedora 19.

If you experience different issues, please open a new bug report for those.

Note You need to log in before you can comment on or make changes to this bug.