1075185 – oom-killer invoked, machine hangs on 32-bit PAE kernel 3.12 to 3.17 but not 3.11

Bug 1075185 - oom-killer invoked, machine hangs on 32-bit PAE kernel 3.12 to 3.17 but not 3.11 [NEEDINFO]

Summary: oom-killer invoked, machine hangs on 32-bit PAE kernel 3.12 to 3.17 but not 3.11

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	20
Hardware:	i686
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-03-11 16:34 UTC by Trevor Cordes
Modified:	2015-04-28 18:24 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-04-28 18:24:46 UTC
Type:	Bug
Embargoed:
Dependent Products:
Flags:	kernel-team: needinfo?

Attachments	(Terms of Use)
oom-killer bits from /var/log/messages today (56.58 KB, text/plain) 2014-03-11 16:38 UTC, Trevor Cordes	no flags	Details
*grep oom- messages; to show how mental the kernel gets and what seems to be triggering it** (35.14 KB, text/plain) 2014-03-11 16:42 UTC, Trevor Cordes	no flags	Details
first oom-killer messages output from last night (28.01 KB, text/plain) 2014-05-23 16:51 UTC, Trevor Cordes	no flags	Details
key /proc files before the crash-causer ran (18.32 KB, text/plain) 2014-07-04 19:05 UTC, Trevor Cordes	no flags	Details
key /proc files after the crash-causer started (18.32 KB, text/plain) 2014-07-04 19:06 UTC, Trevor Cordes	no flags	Details
key /proc files around time of oom-killer invocation (34.33 KB, text/plain) 2014-07-28 21:23 UTC, Trevor Cordes	no flags	Details
key /proc files around time of oom-killer invocation (34.33 KB, text/plain) 2014-07-28 21:23 UTC, Trevor Cordes	no flags	Details
key /proc files around time of oom-killer invocation (34.15 KB, text/plain) 2014-07-28 21:23 UTC, Trevor Cordes	no flags	Details
key /proc files around time of oom-killer invocation (34.15 KB, text/plain) 2014-07-28 21:23 UTC, Trevor Cordes	no flags	Details
key /proc files around time of oom-killer invocation (34.15 KB, text/plain) 2014-07-28 21:24 UTC, Trevor Cordes	no flags	Details
key /proc files around time of oom-killer invocation (34.15 KB, text/plain) 2014-07-28 21:24 UTC, Trevor Cordes	no flags	Details
key /proc files around time of oom-killer invocation (34.15 KB, text/plain) 2014-07-28 21:24 UTC, Trevor Cordes	no flags	Details
/var/log/messages overview of oom-killer invocations, note sh ones near end (19.26 KB, text/plain) 2014-08-05 04:03 UTC, Trevor Cordes	no flags	Details
/v/l/message details from first oom-killer (27.23 KB, text/plain) 2014-08-06 18:02 UTC, Trevor Cordes	no flags	Details
oom data from /v/l/messages (30.77 KB, text/plain) 2015-01-04 21:23 UTC, Trevor Cordes	no flags	Details
View All

Description Trevor Cordes 2014-03-11 16:34:25 UTC

Description of problem:
Changing nothing other than the kernel, my system goes from 100% stable on 3.11.10-200.fc19 to having oom-killer fits after 1 to 24 hours of use using any 3.12 or 3.13 kernel. The bug occurs even on the very latest 3.13.5-103.fc19. This is without any other change in workload or setup.

The symptom is oom-killer activates at random, when large amounts of RAM and swap are free, and oom-killer keeps activating every few seconds until I can (sometimes) get to a console to issue a reboot, or the kernel eventually hangs, freezing the entire computer.

When the oom-killer hits my GUI becomes extremely slow (like a click takes 1 minute) and only by switching to a console can I do anything. The console remains quite responsive. If I can issue a reboot in time, it reboots normally.

Running top during the problem shows tons of free (when you include used buffer/cache space) RAM (8GB total) and 99% unused swap (8GB total). In many cases I've barely started much of anything compared to my normal workload when it happens. I noticed that %wa was quite high at 60%+. So I ran iotop, and this last time it showed a weird thing: xfce4-panel and sawfish were using 99% IO though their read/write was negligible! Strange.

I'm not sure if the other instances were high on sawfish/xfce but I will check next time.

I have the oom-killer logs from the times these hit, if they are of any use to anyone. I will attach some samples.

I also noticed that twice when the bug hit some high-IO process was running in the bg. Once it was a find of the whole fs, and once I was doing another find-type crawl of a large fs. I don't know if the other hits also were doing high IO. I will watch for next time.

I tried to get an strace on sawfish/xfce and it attached but the strace froze and no output was ever generated.

Others seeing this bug, please run iotop (make sure it's installed beforehand) and see if anything is chewing up high IO and note if R/W of the same ps is strangely low. Check top to see what your wa% looks like. See who the top CPU users are.

If anyone has any other ideas of what to check let me know and I can get it for you pretty quickly.

I thought this bug was the same as bug #1064076, but it was not.

Version-Release number of selected component (if applicable):

The following kernels were tested and caused the bug:
3.12.9-201.fc19
3.13.5-103.fc19

Last known good kernel:
3.11.10-200.fc19

How reproducible:
Always, but randomly occurring between 1 and 24 hours after booting into "bad" kernel.

Steps to Reproduce:
1. Boot into 3.12 or 3.13 kernel
2. do normal workloads
3. wait for oom-killer problems

Actual results:
oom-killer starts running constantly until system hangs or I can reboot

Expected results:
As with 3.11 kernel, system should run fine and never run oom-killer for no apparent reason.

Additional info:
8GB RAM
8GB swap
graphical workstation type load, some server activities.
"Weird" things I do:
xfce + sawfish wm

Comment 1 Trevor Cordes 2014-03-11 16:38:56 UTC

Created attachment 873210 [details]
oom-killer bits from /var/log/messages today

I can see from this that indeed once again this all happened when my big "find" was running.  My system periodically does a find / of the whole fs and this heavy IO seems to perhaps be related to the oom-killer triggering.  I will keep looking for clues to this connection.

Comment 2 Trevor Cordes 2014-03-11 16:42:39 UTC

Created attachment 873211 [details]
grep oom- messages*; to show how mental the kernel gets and what seems to be triggering it

Comment 3 Trevor Cordes 2014-03-11 16:51:45 UTC

Further investigation into the I/O theory.  I've identified 4 distinct times this bug hit, and each time it was during a find-type fs crawl of the entire root tree, causing heavy random read I/O.  I may be on a wild goose chase, but so far this seems to be the theme.

Feb 10 14:02:40: expunge
Feb 11 05:22:37: bak-find-list
Feb 13 05:32:06: bak-find-list
Mar 11 05:25:05: bak-find-list

These are scripts I have written and used for at least a decade.  You can see that the bak-find-list runs from a nightly cron.

"expunge", a perl script, does find-type traversal of the whole dir tree.  Its memory usage is not large (just enough for the recursive traversal stack).  If needed, I could try to provide a memory profile doing a test run of it.

bak-find-list is even simpler, it just runs
/bin/nice -n19 /usr/bin/ionice -c2 -n7 -t find $dirs -xdev -ls 2>/dev/null > $Ddest/find-list-$host

$dirs is basically every local-attached fs on the box.

I am going to manually run this stuff now to see if I can trigger the bug and gather some further output from /proc.

Comment 4 Trevor Cordes 2014-03-11 17:03:55 UTC

I could not recreate the bug on demand so soon after a reboot by just running bak-find-list.  I have a suspicion it only will trigger the bug after the system has had time to get "messy" with ps's being created and exited.  I will try running it periodically to make the bug hit.

I should also note, the kernel taint I have is due to the nvidia binary driver, which I am forced to use because of kernel/mythtv bug https://code.mythtv.org/trac/ticket/11729 not due to be fixed until 3.14.

Comment 5 Trevor Cordes 2014-03-12 17:11:00 UTC

Happened again, at 05:23:33 March 12 and once again it was within 3 minutes of cron launching the find process.  Once again the machine became hung, didn't even respond to alt-sysrq.  This time it invoked oom-killer 77 times over 6 minutes until it hung.

No extra new data captured because I am asleep at 5am, though it seems that at least I can reliably reproduce the bug simply by waiting until 5am.  I will stick to using 3.11 until I get ideas as to what I can do to pinpoint this problem.

Comment 6 Trevor Cordes 2014-04-02 20:09:35 UTC

I should add that this is happening on 32-bit kernel (PAE), so that's another "odd" thing about my setup, as I'm seeing more and more bugs lately that are specific to 32-bit.  I'm attempting to do a bisect but have run into yet another 32-bit bug in Igor Gnatenko's bisect helper tools.

Comment 7 Trevor Cordes 2014-04-22 16:33:47 UTC

I just tried 3.13.9-100.fc19 overnight for the first time and it too had the bug.  However, this time instead of panic or complete freeze, the system survived, albeit with almost every service, GUI, etc, killed.  When I woke up I was able to login and reboot normally.  I forgot to run my script to capture the state info, though.

The kernel invoked oom-killer 68 times from 05:35:43 and for 3 hours following.  90% of them occurred in just the 4 minutes after 5:35.

I still have no way to bisect as Igor's tools still crash on my 32b system.  I am running out of ideas.

Comment 8 Catalin Patulea 2014-04-26 21:55:29 UTC

I am also seeing lockups associated with oom-killer, on Ubuntu's 3.13.0-24.46 kernel as well as mainline v3.14.1. Repro steps are roughly:

0) Machine is a DigitalOcean paravirtualized VM, 64-bit, 512MB RAM, no swap.
1) Start a multithreaded Python workload that takes increasing amounts of memory. In my case it also has bursty CPU usage averaging ~20% over several minutes. After a few hours, process memory reaches physical RAM and oom-killer is invoked.
2) In most cases, CPU usage goes to 100% and machine becomes unresponsive, to ping, SSH, (virtual) console. CPU usage looks like this:
http://i.imgur.com/7wIhEUD.png

My process does some I/O but I do not believe the lockups are I/O related as in your case. I have not tried 3.11 yet but am about to do this.

My symptoms are very similar to those described in this post (scroll to bottom for useful info):
http://blog.nitrous.io/2014/03/10/stability-and-a-linux-oom-killer-bug.html

They suggest to upgrade past 3.14 in order to run a kernel with these commits:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/?id=4d4048be8a93769350efa31d2482a038b7de73d0&qt=range&q=9853a407b97d8d066b5a865173a4859a3e69fd8a...4d4048be8a93769350efa31d2482a038b7de73d0

but this did not appear to be effective in my case, as v3.14.1 is still problematic and I've confirmed that it includes those commits.

Comment 9 Catalin Patulea 2014-04-27 06:24:09 UTC

I cannot reproduce on 3.11 because oom-killer never gets invoked. Instead, so far, the large process' allocations (mmap) fail with ENOMEM. It is unclear to me whether this happens deterministically or whether oom-killer could still be invoked if a different process allocates/touches a sufficiently large area of memory.

Comment 10 Trevor Cordes 2014-05-07 05:24:09 UTC

Catlin, your bug sounds to me (after so much investigation over these months) to be more like bug #1064076 and there was one other similar one I can't recall at the moment.

With my issue I'm pretty sure nothing is running that eats memory without bounds.  My biggest memory hog is firefox (usually 1 or 2 GB).  I have no indication anything I am doing is dying with ENOMEM in 3.11.

I still want to capture some basic stats next time I get the oom problem, and have a script ready to do so, but this is my main machine and I need it stable, and so don't like to crash it "on purpose" :-)

Your nitrous.io link sounds very similar, though in many cases my system does not freeze to pings/ssh/console as it oom-killers dozens of ps's a minute.  It has frozen, but not always.

Plus, I'm nearly positive I'm never actually getting anywhere close to being oom in the first place, as it seems like 3.12/3.13 are calling oom-killer for no reason.

Fedora people: is there any chance we can get 3.14 out for Fedora 19, to test if the new commits fix anything for me?  Or backport the relevant patches?

Comment 11 Josh Boyer 2014-05-07 12:17:46 UTC

F19 is being rebased to 3.14 this week.  You should be able to just use the F20 kernels if you want to test though.

Comment 12 Harry Rarig 2014-05-15 04:52:57 UTC

In FC20 the same problem was observed on my 32-bit DELL machine with vmlinuz-3.13.10-200.fc20.i686+PAE running RAID1 with md0 and md1.  

Below is a snapshot of /var/log/messages that was taken after the system locked up.

May 11 01:39:20 spruce kernel: [695803.519110] raid-check invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0
May 11 01:39:20 spruce kernel: [695803.519121] raid-check cpuset=/ mems_allowed=0
May 11 01:39:20 spruce kernel: [695803.519130] CPU: 0 PID: 11135 Comm: raid-check Not tainted 3.13.10-200.fc20.i686+PAE #1
May 11 01:39:20 spruce kernel: [695803.519134] Hardware name: Dell Inc.                 Dell DM051                   /0HJ054, BIOS A05 03/31/2006
May 11 01:39:20 spruce kernel: [695803.519138]  00000000 00000000 c0111da4 c09c3fab e88ec380 c0111dfc c09c1d81 c0b7720c
May 11 01:39:20 spruce kernel: [695803.519153]  e88ec5b8 003000d0 00000001 00000000 c043c603 00000015 c0111e0c c09ca45c
May 11 01:39:20 spruce kernel: [695803.519165]  ffffffff 00000000 00000206 c0111dfc c069f783 003000d0 00000001 00000000
May 11 01:39:20 spruce kernel: [695803.519178] Call Trace:
May 11 01:39:20 spruce kernel: [695803.519194]  [<c09c3fab>] dump_stack+0x41/0x52
May 11 01:39:20 spruce kernel: [695803.519205]  [<c09c1d81>] dump_header.isra.10+0x73/0x1bc
May 11 01:39:20 spruce kernel: [695803.519216]  [<c043c603>] ? native_apic_icr_read+0x23/0x40
May 11 01:39:20 spruce kernel: [695803.519225]  [<c09ca45c>] ? apic_timer_interrupt+0x34/0x3c
May 11 01:39:20 spruce kernel: [695803.519235]  [<c069f783>] ? ___ratelimit+0x83/0xf0
May 11 01:39:20 spruce kernel: [695803.519244]  [<c051a47b>] oom_kill_process+0x16b/0x2b0
May 11 01:39:20 spruce kernel: [695803.519252]  [<c063dedc>] ? security_capable_noaudit+0x1c/0x30
May 11 01:39:20 spruce kernel: [695803.519260]  [<c045e65a>] ? has_capability_noaudit+0x1a/0x30
May 11 01:39:20 spruce kernel: [695803.519268]  [<c051a9ab>] out_of_memory+0x22b/0x260
May 11 01:39:20 spruce kernel: [695803.519276]  [<c051f000>] __alloc_pages_nodemask+0x830/0x8b0
May 11 01:39:20 spruce kernel: [695803.519285]  [<c04541f2>] copy_process.part.36+0xd2/0x1090
May 11 01:39:20 spruce kernel: [695803.519292]  [<c045533a>] do_fork+0x9a/0x340
May 11 01:39:20 spruce kernel: [695803.519298]  [<c04556a5>] SyS_clone+0x25/0x30
May 11 01:39:20 spruce kernel: [695803.519305]  [<c09d118d>] sysenter_do_call+0x12/0x28
May 11 01:39:20 spruce kernel: [695803.519309] Mem-Info:
May 11 01:39:20 spruce kernel: [695803.519312] DMA per-cpu:
May 11 01:39:20 spruce kernel: [695803.519316] CPU    0: hi:    0, btch:   1 usd:   0
May 11 01:39:20 spruce kernel: [695803.519320] CPU    1: hi:    0, btch:   1 usd:   0
May 11 01:39:20 spruce kernel: [695803.519323] Normal per-cpu:
May 11 01:39:20 spruce kernel: [695803.519327] CPU    0: hi:  186, btch:  31 usd:   0
May 11 01:39:20 spruce kernel: [695803.519331] CPU    1: hi:  186, btch:  31 usd: 200
May 11 01:39:20 spruce kernel: [695803.519334] HighMem per-cpu:
May 11 01:39:20 spruce kernel: [695803.519337] CPU    0: hi:  186, btch:  31 usd:   0
May 11 01:39:20 spruce kernel: [695803.519341] CPU    1: hi:  186, btch:  31 usd:  30
May 11 01:39:20 spruce kernel: [695803.519349] active_anon:74319 inactive_anon:91025 isolated_anon:0
May 11 01:39:20 spruce kernel: [695803.519349]  active_file:8554 inactive_file:75918 isolated_file:2
May 11 01:39:20 spruce kernel: [695803.519349]  unevictable:0 dirty:4 writeback:22 unstable:0
May 11 01:39:20 spruce kernel: [695803.519349]  free:320406 slab_reclaimable:4991 slab_unreclaimable:15137
May 11 01:39:20 spruce kernel: [695803.519349]  mapped:8740 shmem:18291 pagetables:1861 bounce:0
May 11 01:39:20 spruce kernel: [695803.519349]  free_cma:0
May 11 01:39:20 spruce kernel: [695803.519363] DMA free:3368kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15996kB managed:5840kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:120kB slab_unreclaimable:1916kB kernel_stack:8kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
May 11 01:39:20 spruce kernel: [695803.519365] lowmem_reserve[]: 0 839 3015 3015
May 11 01:39:20 spruce kernel: [695803.519376] Normal free:27460kB min:3692kB low:4612kB high:5536kB active_anon:15004kB inactive_anon:16744kB active_file:244kB inactive_file:344kB unevictable:0kB isolated(anon):0kB isolated(file):8kB present:890872kB managed:859152kB mlocked:0kB dirty:0kB writeback:0kB mapped:24kB shmem:16828kB slab_reclaimable:19844kB slab_unreclaimable:58632kB kernel_stack:1944kB pagetables:688kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
May 11 01:39:20 spruce kernel: [695803.519379] lowmem_reserve[]: 0 0 17412 17412
May 11 01:39:20 spruce kernel: [695803.519389] HighMem free:1250796kB min:512kB low:2904kB high:5300kB active_anon:282272kB inactive_anon:347356kB active_file:33972kB inactive_file:303328kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2228776kB managed:2228776kB mlocked:0kB dirty:16kB writeback:88kB mapped:34936kB shmem:56336kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:6756kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
May 11 01:39:20 spruce kernel: [695803.519391] lowmem_reserve[]: 0 0 0 0
May 11 01:39:20 spruce kernel: [695803.519396] DMA: 154*4kB (EM) 90*8kB (M) 1*16kB (R) 1*32kB (R) 1*64kB (R) 1*128kB (R) 1*256kB (R) 1*512kB (R) 1*1024kB (R) 0*2048kB 0*4096kB = 3368kB
May 11 01:39:20 spruce kernel: [695803.519416] Normal: 6671*4kB (M) 15*8kB (MR) 9*16kB (R) 2*32kB (R) 0*64kB 1*128kB (R) 1*256kB (R) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 27396kB
May 11 01:39:20 spruce kernel: [695803.519433] HighMem: 30*4kB (UM) 10596*8kB (UM) 3631*16kB (UM) 9241*32kB (UM) 5652*64kB (UM) 2577*128kB (M) 441*256kB (M) 15*512kB (UM) 0*1024kB 0*2048kB 0*4096kB = 1250856kB
May 11 01:39:20 spruce kernel: [695803.519454] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
May 11 01:39:20 spruce kernel: [695803.519456] 106670 total pagecache pages
May 11 01:39:20 spruce kernel: [695803.519459] 3917 pages in swap cache
May 11 01:39:20 spruce kernel: [695803.519462] Swap cache stats: add 35963, delete 32046, find 3612812/3613910
May 11 01:39:20 spruce kernel: [695803.519464] Free swap  = 2990240kB
May 11 01:39:20 spruce kernel: [695803.519466] Total swap = 3080188kB
May 11 01:39:20 spruce kernel: [695803.519468] 783911 pages RAM
May 11 01:39:20 spruce kernel: [695803.519470] 557194 pages HighMem/MovableOnly
May 11 01:39:20 spruce kernel: [695803.519472] 0 pages reserved
May 11 01:39:20 spruce kernel: [695803.519474] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
May 11 01:39:20 spruce kernel: [695803.519486] [  510]     0   510    15389     6775      30       42             0 systemd-journal
May 11 01:39:20 spruce kernel: [695803.519490] [  518]     0   518     3823      103       7      229             0 lvmetad
May 11 01:39:20 spruce kernel: [695803.519494] [  525]     0   525     3493      166      10      354         -1000 systemd-udevd
May 11 01:39:20 spruce kernel: [695803.519500] [  634]     0   634     3628      181       6       60         -1000 auditd

yada...yada...yada....

May 11 01:39:20 spruce kernel: [695803.519797] [11129]     0 11129     2288      418       9      101             0 crond
May 11 01:39:20 spruce kernel: [695803.519800] [11135]     0 11135     1642      312       8        0             0 raid-check
May 11 01:39:20 spruce kernel: [695803.519804] [11214]   989 11214    11902     9818      35        0             0 BackupPC_dump
May 11 01:39:20 spruce kernel: [695803.519808] [11215]   989 11215    12915    10212      38        0             0 BackupPC_dump
May 11 01:39:20 spruce kernel: [695803.519811] [11216]   989 11216    13015    10925      40        0             0 BackupPC_dump
May 11 01:39:20 spruce kernel: [695803.519815] [11242]   989 11242     7462     5362      25        0             0 BackupPC_dump
May 11 01:39:20 spruce kernel: [695803.519818] [12411]   989 12411     9531     7522      31        0             0 BackupPC_dump
May 11 01:39:20 spruce kernel: [695803.519822] [12467]   989 12467    11484     9203      36        0             0 BackupPC_dump
May 11 01:39:20 spruce kernel: [695803.519826] [12520]   989 12520    14602    11610      41        0             0 BackupPC_dump
May 11 01:39:20 spruce kernel: [695803.519830] [15611]     0 15611     2288      438       9      101             0 crond
May 11 01:39:20 spruce kernel: [695803.519833] [15613]     0 15613     1641      283       8        0             0 sh
May 11 01:39:20 spruce kernel: [695803.519837] [15662]     0 15662     1375      115       7        0             0 sleep
May 11 01:39:20 spruce kernel: [695803.519840] Out of memory: Kill process 767 (polkitd) score 32 or sacrifice child
May 11 01:39:20 spruce kernel: [695803.519845] Killed process 767 (polkitd) total-vm:267572kB, anon-rss:196552kB, file-rss:2560kB
May 11 01:39:20 spruce kernel: raid-check invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0
May 11 01:39:21 spruce kernel: raid-check cpuset=/ mems_allowed=0
May 11 01:39:21 spruce systemd: polkit.service: main process exited, code=killed, status=9/KILL
May 11 01:39:23 spruce systemd: Unit polkit.service entered failed state.
May 11 01:39:26 spruce gnome-session: PolicyKit daemon disconnected from the bus.
May 11 01:39:29 spruce gnome-session: We are no longer a registered authentication agent.
May 11 01:39:29 spruce kernel: CPU: 0 PID: 11135 Comm: raid-check Not tainted 3.13.10-200.fc20.i686+PAE #1
May 11 01:39:30 spruce kernel: Hardware name: Dell Inc.                 Dell DM051                   /0HJ054, BIOS A05 03/31/2006
May 11 01:39:30 spruce kernel: 00000000 00000000 c0111da4 c09c3fab e88ec380 c0111dfc c09c1d81 c0b7720c
May 11 01:39:31 spruce kernel: e88ec5b8 003000d0 00000001 00000000 c043c603 00000015 c0111e0c c09ca45c
May 11 01:39:31 spruce kernel: ffffffff 00000000 00000206 c0111dfc c069f783 003000d0 00000001 00000000
May 11 01:39:31 spruce kernel: Call Trace:
May 11 01:39:32 spruce kernel: [<c09c3fab>] dump_stack+0x41/0x52
May 11 01:39:32 spruce kernel: [<c09c1d81>] dump_header.isra.10+0x73/0x1bc
May 11 01:39:33 spruce kernel: [<c043c603>] ? native_apic_icr_read+0x23/0x40
May 11 01:39:33 spruce kernel: [<c09ca45c>] ? apic_timer_interrupt+0x34/0x3c
May 11 01:39:33 spruce kernel: [<c069f783>] ? ___ratelimit+0x83/0xf0
May 11 01:39:33 spruce kernel: [<c051a47b>] oom_kill_process+0x16b/0x2b0
May 11 01:39:33 spruce kernel: [<c063dedc>] ? security_capable_noaudit+0x1c/0x30
May 11 01:39:34 spruce kernel: [<c045e65a>] ? has_capability_noaudit+0x1a/0x30
May 11 01:39:34 spruce kernel: [<c051a9ab>] out_of_memory+0x22b/0x260
May 11 01:39:34 spruce kernel: [<c051f000>] __alloc_pages_nodemask+0x830/0x8b0
May 11 01:39:35 spruce kernel: [<c04541f2>] copy_process.part.36+0xd2/0x1090
May 11 01:39:35 spruce kernel: [<c045533a>] do_fork+0x9a/0x340
May 11 01:39:35 spruce kernel: [<c04556a5>] SyS_clone+0x25/0x30
May 11 01:39:35 spruce kernel: [<c09d118d>] sysenter_do_call+0x12/0x28
May 11 01:39:36 spruce kernel: Mem-Info:
May 11 01:39:36 spruce kernel: DMA per-cpu:
May 11 01:39:36 spruce kernel: CPU    0: hi:    0, btch:   1 usd:   0
May 11 01:39:37 spruce kernel: CPU    1: hi:    0, btch:   1 usd:   0
May 11 01:39:37 spruce kernel: Normal per-cpu:
May 11 01:39:38 spruce kernel: CPU    0: hi:  186, btch:  31 usd:   0
May 11 01:39:38 spruce kernel: CPU    1: hi:  186, btch:  31 usd: 200
May 11 01:39:39 spruce kernel: HighMem per-cpu:
May 11 01:39:39 spruce kernel: CPU    0: hi:  186, btch:  31 usd:   0
May 11 01:39:39 spruce kernel: CPU    1: hi:  186, btch:  31 usd:  30
May 11 01:39:40 spruce systemd: Failed to mark scope session-12956.scope as abandoned : Stale file handle
May 11 01:39:41 spruce kernel: active_anon:74319 inactive_anon:91025 isolated_anon:0
 active_file:8554 inactive_file:75918 isolated_file:2
 unevictable:0 dirty:4 writeback:22 unstable:0
 free:320406 slab_reclaimable:4991 slab_unreclaimable:15137
 mapped:8740 shmem:18291 pagetables:1861 bounce:0
 free_cma:0
May 11 01:39:44 spruce kernel: DMA free:3368kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15996kB managed:5840kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:120kB slab_unreclaimable:1916kB kernel_stack:8kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
May 11 01:39:44 spruce kernel: lowmem_reserve[]: 0 839 3015 3015
May 11 01:39:45 spruce kernel: Normal free:27460kB min:3692kB low:4612kB high:5536kB active_anon:15004kB inactive_anon:16744kB active_file:244kB inactive_file:344kB unevictable:0kB isolated(anon):0kB isolated(file):8kB present:890872kB managed:859152kB mlocked:0kB dirty:0kB writeback:0kB mapped:24kB shmem:16828kB slab_reclaimable:19844kB slab_unreclaimable:58632kB kernel_stack:1944kB pagetables:688kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
May 11 01:39:45 spruce kernel: lowmem_reserve[]: 0 0 17412 17412
May 11 01:39:45 spruce kernel: HighMem free:1250796kB min:512kB low:2904kB high:5300kB active_anon:282272kB inactive_anon:347356kB active_file:33972kB inactive_file:303328kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2228776kB managed:2228776kB mlocked:0kB dirty:16kB writeback:88kB mapped:34936kB shmem:56336kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:6756kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
May 11 01:39:45 spruce kernel: lowmem_reserve[]: 0 0 0 0
May 11 01:39:46 spruce kernel: DMA: 154*4kB (EM) 90*8kB (M) 1*16kB (R) 1*32kB (R) 1*64kB (R) 1*128kB (R) 1*256kB (R) 1*512kB (R) 1*1024kB (R) 0*2048kB 0*4096kB = 3368kB
May 11 01:39:46 spruce kernel: Normal: 6671*4kB (M) 15*8kB (MR) 9*16kB (R) 2*32kB (R) 0*64kB 1*128kB (R) 1*256kB (R) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 27396kB
May 11 01:39:47 spruce kernel: HighMem: 30*4kB (UM) 10596*8kB (UM) 3631*16kB (UM) 9241*32kB (UM) 5652*64kB (UM) 2577*128kB (M) 441*256kB (M) 15*512kB (UM) 0*1024kB 0*2048kB 0*4096kB = 1250856kB
May 11 01:39:47 spruce kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
May 11 01:39:47 spruce kernel: 106670 total pagecache pages
May 11 01:39:48 spruce kernel: 3917 pages in swap cache
May 11 01:39:48 spruce kernel: Swap cache stats: add 35963, delete 32046, find 3612812/3613910
May 11 01:39:49 spruce kernel: Free swap  = 2990240kB
May 11 01:39:49 spruce kernel: Total swap = 3080188kB
May 11 01:39:50 spruce kernel: 783911 pages RAM
May 11 01:39:50 spruce kernel: 557194 pages HighMem/MovableOnly
May 11 01:39:50 spruce kernel: 0 pages reserved
May 11 01:39:50 spruce kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
May 11 01:39:51 spruce kernel: [  510]     0   510    15389     6775      30       42             0 systemd-journal
May 11 01:39:51 spruce kernel: [  518]     0   518     3823      103       7      229             0 lvmetad
May 11 01:39:51 spruce kernel: [  525]     0   525     3493      166      10      354         -1000 systemd-udevd


NOTE: This RAID Status below taken following upgrade to 3.14.3-200.fc20.i686+PAE

[root@spruce tmp]# /usr/local/bin/raid_status
---------------------------------------------------------------------------------
/dev/md0:
        Version : 1.0
  Creation Time : Mon Oct 25 18:52:00 2010
     Raid Level : raid1
     Array Size : 511988 (500.07 MiB 524.28 MB)
  Used Dev Size : 511988 (500.07 MiB 524.28 MB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Mon May 12 22:55:19 2014
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : localhost.localdomain:0
           UUID : 8a8ef0df:e25cee96:66fcc1cf:1043d370
         Events : 1324

    Number   Major   Minor   RaidDevice State
       3       8        1        0      active sync   /dev/sda1
       2       8       17        1      active sync   /dev/sdb1
---------------------------------------------------------------------------------
/dev/md1:
        Version : 1.1
  Creation Time : Mon Oct 25 18:52:03 2010
     Raid Level : raid1
     Array Size : 976247676 (931.02 GiB 999.68 GB)
  Used Dev Size : 976247676 (931.02 GiB 999.68 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Mon May 12 23:20:43 2014
          State : active
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : localhost.localdomain:1
           UUID : d0200041:57c65bc2:870f2f1e:b84fa712
         Events : 192907

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2
---------------------------------------------------------------------------------
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1]
      976247676 blocks super 1.1 [2/2] [UU]
      bitmap: 3/8 pages [12KB], 65536KB chunk

md0 : active raid1 sda1[3] sdb1[2]
      511988 blocks super 1.0 [2/2] [UU]

unused devices: <none>
---------------------------------------------------------------------------------
[root@spruce tmp]#

The system responded to "ping", but "ssh" failed and the console was frozen and unresponsive.

I reported the above in bug 1096414 for FC20, but the consensus seemed to be that this is a different problem, which sounds very much like the problem reported in this FC19 bug.

Comment 13 Trevor Cordes 2014-05-20 04:18:42 UTC

Yes!  Finally someone with (so far) similarities to mine.  I am starting to think this is 32-bit only, which is why no one is seeing it.

I also am running RAID1.

Also, it seems perhaps high I/O (raidcheck) triggered the bug.

I suppose it's impossible to run 3.11.x on F20??

Has the bug hit in 3.14.x yet?

I haven't seen 3.14 released for F19 yet.

How often do you see this bug?  If I leave 3.12 or 3.13 running I get it nearly once a day.  Try doing high-I/O like a find -ls on the whole fs to see if you can get it to trigger on demand.

Thanks!

Comment 14 Justin M. Forbes 2014-05-21 19:30:38 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs.

Fedora 19 has now been rebased to 3.14.4-100.fc19.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 20, and are still experiencing this issue, please change the version to Fedora 20.

If you experience different issues, please open a new bug report for those.

Comment 15 Trevor Cordes 2014-05-22 18:00:08 UTC

I'm testing 3.14.4-100.fc19.i686.PAE as of last night.  So far it survived the morning cron jobs.  But I wasn't in graphical mode (X) so we'll see what happens tonight and the next few nights.

Comment 16 Trevor Cordes 2014-05-23 16:49:32 UTC

3.14 still has the bug.  :-(

Last night crashed at the same time as usual.

May 23 05:58:45 pog kernel: [100620.358862] gkrellm invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0
May 23 05:59:15 pog kernel: [100650.337304] apcupsd invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0
May 23 06:00:48 pog kernel: [100743.497170] smbd invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0
May 23 06:00:49 pog kernel: [100744.906158] curl invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0
May 23 06:00:50 pog kernel: [100745.582613] nmbd invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0
May 23 06:00:55 pog kernel: [100750.854371] gkrellm invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0
May 23 06:00:57 pog kernel: [100752.461612] apcupsd invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0

Last /v/l/messages entry is May 23 06:04:18 pog kernel: [100953.815572]

Then a hard crash this time (numlock frozen, won't respond on network, etc).  So I couldn't run any tests / debug outputs I had planned.

As with the others, bak-find-list was running at the time of the first oom.  I'll attach the first oom log entry.  This night I did have X running and a few graphical apps.

What can I do to get this problem solved?  I want to bisect the kernel, but Igor's tools that allow one to easily bisect Fedora-customized kernels is also broken on 32-bit.  I have no idea what to do next?  Is there another "Fedora" way to bisect kernels?  Instructions link?

Comment 17 Trevor Cordes 2014-05-23 16:50:03 UTC

Igor's tools bug: Bug 1090186

Comment 18 Trevor Cordes 2014-05-23 16:51:17 UTC

Created attachment 898761 [details]
first oom-killer messages output from last night

Comment 19 Harry Rarig 2014-06-01 18:06:45 UTC

Saw the same problem occur again on my FC20 32-bit backup server running 3.14.3-200.fc20.i686+PAE.  Seems to happen while BackupPC is running and oom-killer is invoked (see log snippet below from /var/log/messages).  Following this, system is no longer accessible from the console, but will respond to "ping".

Need to add FC20 to this BUG, since this issue is not unique to FC19.

Jun  1 08:01:09 spruce systemd: Starting Session 10963 of user root.
Jun  1 08:01:09 spruce systemd: Started Session 10963 of user root.
Jun  1 08:02:04 spruce systemd: Starting Session 10965 of user root.
Jun  1 08:02:04 spruce systemd: Started Session 10965 of user root.
Jun  1 08:02:12 spruce kernel: [588837.824723] sh invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0
Jun  1 08:02:12 spruce kernel: [588837.824731] sh cpuset=/ mems_allowed=0
Jun  1 08:02:12 spruce kernel: [588837.824737] CPU: 1 PID: 31373 Comm: sh Not tainted 3.14.3-200.fc20.i686+PAE #1
Jun  1 08:02:12 spruce kernel: [588837.824740] Hardware name: Dell Inc.                 Dell DM051                   /0HJ054, BIOS A05 03/31/2006
Jun  1 08:02:12 spruce kernel: [588837.824743]  c0c90907 098cbbec 00000000 e2c4fd44 c0a091c7 eb458f90 e2c4fda0 c0a06da7
Jun  1 08:02:12 spruce kernel: [588837.824753]  c0bc49bc eb459268 003000d0 00000001 00000000 00000000 00000000 00000004
Jun  1 08:02:12 spruce kernel: [588837.824761]  ec07ca03 00000015 e2c4fdfc e2c4fd88 00000206 e2c4fda0 003000d0 00000001
Jun  1 08:02:12 spruce kernel: [588837.824770] Call Trace:
Jun  1 08:02:12 spruce kernel: [588837.824784]  [<c0a091c7>] dump_stack+0x41/0x52
Jun  1 08:02:12 spruce kernel: [588837.824790]  [<c0a06da7>] dump_header.isra.10+0x7e/0x1d8
Jun  1 08:02:12 spruce kernel: [588837.824797]  [<c052bc6d>] oom_kill_process+0x1bd/0x300
Jun  1 08:02:12 spruce kernel: [588837.824805]  [<c065d55c>] ? security_capable_noaudit+0x1c/0x30
Jun  1 08:02:12 spruce kernel: [588837.824809]  [<c052b9b2>] ? oom_badness+0x92/0xf0
Jun  1 08:02:12 spruce kernel: [588837.824814]  [<c052c238>] out_of_memory+0x298/0x2d0
Jun  1 08:02:12 spruce kernel: [588837.824819]  [<c0530ae9>] __alloc_pages_nodemask+0x829/0x930
Jun  1 08:02:12 spruce kernel: [588837.824827]  [<c045928a>] copy_process.part.36+0xfa/0x1510
Jun  1 08:02:12 spruce kernel: [588837.824833]  [<c0a12d4f>] ? __do_page_fault+0x23f/0x510
Jun  1 08:02:12 spruce kernel: [588837.824838]  [<c045a842>] do_fork+0xb2/0x350
Jun  1 08:02:12 spruce kernel: [588837.824843]  [<c06c8d84>] ? _copy_to_user+0x24/0x30
Jun  1 08:02:12 spruce kernel: [588837.824848]  [<c045aba5>] SyS_clone+0x25/0x30
Jun  1 08:02:12 spruce kernel: [588837.824853]  [<c0a174cd>] sysenter_do_call+0x12/0x28
Jun  1 08:02:12 spruce kernel: [588837.824856] Mem-Info:
Jun  1 08:02:12 spruce kernel: [588837.824859] DMA per-cpu:
Jun  1 08:02:12 spruce kernel: [588837.824862] CPU    0: hi:    0, btch:   1 usd:   0
Jun  1 08:02:12 spruce kernel: [588837.824864] CPU    1: hi:    0, btch:   1 usd:   0
Jun  1 08:02:12 spruce kernel: [588837.824867] Normal per-cpu:
Jun  1 08:02:12 spruce kernel: [588837.824869] CPU    0: hi:  186, btch:  31 usd:  58
Jun  1 08:02:12 spruce kernel: [588837.824872] CPU    1: hi:  186, btch:  31 usd: 156
Jun  1 08:02:12 spruce kernel: [588837.824874] HighMem per-cpu:
Jun  1 08:02:12 spruce kernel: [588837.824876] CPU    0: hi:  186, btch:  31 usd:  21
Jun  1 08:02:12 spruce kernel: [588837.824878] CPU    1: hi:  186, btch:  31 usd:   0
Jun  1 08:02:12 spruce kernel: [588837.824884] active_anon:1064 inactive_anon:41392 isolated_anon:0
Jun  1 08:02:12 spruce kernel: [588837.824884]  active_file:6255 inactive_file:193240 isolated_file:0
Jun  1 08:02:12 spruce kernel: [588837.824884]  unevictable:0 dirty:1 writeback:61 unstable:0
Jun  1 08:02:12 spruce kernel: [588837.824884]  free:314552 slab_reclaimable:9390 slab_unreclaimable:15438
Jun  1 08:02:12 spruce kernel: [588837.824884]  mapped:8591 shmem:17096 pagetables:1599 bounce:0
Jun  1 08:02:12 spruce kernel: [588837.824884]  free_cma:0
Jun  1 08:02:12 spruce kernel: [588837.824896] DMA free:3372kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_fi
le:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15996kB managed:5520kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB sla
b_reclaimable:344kB slab_unreclaimable:516kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_
unreclaimable? yes
Jun  1 08:02:12 spruce kernel: [588837.824898] lowmem_reserve[]: 0 839 3015 3015
Jun  1 08:02:13 spruce kernel: [588837.824910] Normal free:3676kB min:3692kB low:4612kB high:5536kB active_anon:0kB inactive_anon:12kB active_file:408kB
 inactive_file:596kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:890872kB managed:859152kB mlocked:0kB dirty:0kB writeback:132kB mappe
d:4kB shmem:0kB slab_reclaimable:37216kB slab_unreclaimable:61236kB kernel_stack:2032kB pagetables:580kB unstable:0kB bounce:0kB free_cma:0kB writeback_
tmp:0kB pages_scanned:1823 all_unreclaimable? yes
Jun  1 08:02:13 spruce kernel: [588837.824912] lowmem_reserve[]: 0 0 17412 17412
Jun  1 08:02:13 spruce kernel: [588837.824923] HighMem free:1251160kB min:512kB low:2908kB high:5304kB active_anon:4256kB inactive_anon:165556kB active_
file:24612kB inactive_file:772364kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2228776kB managed:2228776kB mlocked:0kB dirty:4kB writ
eback:112kB mapped:34360kB shmem:68384kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:5816kB unstable:0kB bounce:0kB free_cma
:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Jun  1 08:02:13 spruce kernel: [588837.824925] lowmem_reserve[]: 0 0 0 0
Jun  1 08:02:13 spruce kernel: [588837.824930] DMA: 6*4kB (UM) 5*8kB (UM) 5*16kB (M) 7*32kB (UM) 7*64kB (UM) 4*128kB (UM) 0*256kB 0*512kB 0*1024kB 1*204
8kB (R) 0*4096kB = 3376kB
Jun  1 08:02:13 spruce kernel: [588837.824949] Normal: 490*4kB (M) 0*8kB 6*16kB (R) 2*32kB (R) 9*64kB (R) 6*128kB (R) 1*256kB (R) 0*512kB 0*1024kB 0*204

etc, etc, etc,....

Comment 20 Harry Rarig 2014-06-01 18:15:31 UTC

Not sure what the proper convention is to follow here, but should we change the "Version" field to the latest version in which this bug is detected, i.e., "FC20"?

Comment 21 Trevor Cordes 2014-06-02 20:06:38 UTC

Harry, looking for commonalities between your and my setup:

32-bit
PAE
RAID1

Here's a few things I am running, let me know if you are too:
SCSI (aic7xxx)
nvidia binary driver

For both of us the bug seems to hit during relatively heavy I/O (in both our cases it's during some sort of backup operation).

Perhaps the bug is in the storage layer, buffers for dentries or something, some mem leak, that is only hitting on our unique situation.  I'd love to bisect but as I mention above, I have no clue how to "nicely" do that in Fedora without using Igor's instructions.  I'm waiting for someone to point me in the right direction.

As for F20/F19, leave it on F19 so it is open against my OS.  It doesn't help me if they fix it in 20 and not 19.  However, if they fix it in 19 they will also automatically fix it in 20.  When 19 disappears, I'll switch it to 20 when I do the upgrade at that time.

Comment 22 Harry Rarig 2014-06-05 13:49:36 UTC

Trevor,

Per your request, here are the specifics of my system -- a stock Dell Dimension 5150 with 2 Samsung 1TB SATA drives. Doesn't appear to be any commonality with your SCSI or Video drivers.

[root@spruce proc]# dmesg | grep "scsi\|i915"
[    0.767529] scsi0 : ata_piix
[    0.767721] scsi1 : ata_piix
[    0.919731] scsi2 : ata_piix
[    0.919921] scsi3 : ata_piix
[    0.964208] scsi 0:0:0:0: CD-ROM            TSSTcorp DVD-ROM TS-H352C DE02 PQ: 0 ANSI: 5
[    0.974824] sr0: scsi3-mmc drive: 4x/48x cd/rw xa/form2 cdda tray
[    0.975082] sr 0:0:0:0: Attached scsi CD-ROM sr0
[    0.975292] sr 0:0:0:0: Attached scsi generic sg0 type 5
[    1.158452] scsi 2:0:0:0: Direct-Access     ATA      SAMSUNG HD103SJ  1AJ1 PQ: 0 ANSI: 5
[    1.158790] sd 2:0:0:0: Attached scsi generic sg1 type 0
[    1.159062] scsi 2:0:1:0: Direct-Access     ATA      SAMSUNG HD103SJ  1AJ1 PQ: 0 ANSI: 5
[    1.159477] sd 2:0:1:0: Attached scsi generic sg2 type 0
[    2.066797] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device
[    2.066800] i915 0000:00:02.0: registered panic notifier
[    2.066815] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0
[root@spruce proc]#
[root@spruce proc]# lsmod
Module                  Size  Used by
fuse                   80309  1
bnep                   18895  2
bluetooth             355063  5 bnep
6lowpan_iphc           18167  1 bluetooth
cfg80211              434893  0
rfkill                 20934  4 cfg80211,bluetooth
ip6t_rpfilter          12490  1
ip6t_REJECT            12782  2
xt_conntrack           12664  19
ebtable_nat            12695  0
ebtable_broute         12635  0
bridge                 94695  1 ebtable_broute
stp                    12756  1 bridge
llc                    13645  2 stp,bridge
ebtable_filter         12715  0
ebtables               21315  3 ebtable_broute,ebtable_nat,ebtable_filter
ip6table_nat           12871  1
nf_conntrack_ipv6      14186  11
nf_defrag_ipv6         26115  1 nf_conntrack_ipv6
nf_nat_ipv6            13053  1 ip6table_nat
ip6table_mangle        12620  1
ip6table_security      12630  1
ip6table_raw           12603  1
ip6table_filter        12711  1
ip6_tables             17634  5 ip6table_filter,ip6table_mangle,ip6table_security,ip6table_nat,ip6table_raw
iptable_nat            12867  1
nf_conntrack_ipv4      14192  10
nf_defrag_ipv4         12601  1 nf_conntrack_ipv4
nf_nat_ipv4            13039  1 iptable_nat
nf_nat                 20113  4 nf_nat_ipv4,nf_nat_ipv6,ip6table_nat,iptable_nat
nf_conntrack           82581  8 nf_nat,nf_nat_ipv4,nf_nat_ipv6,xt_conntrack,ip6table_nat,iptable_nat,nf_conntrack_ipv4,nf_conntrack_ipv6
iptable_mangle         12615  1
iptable_security       12625  1
iptable_raw            12598  1
snd_hda_codec_idt      48883  1
snd_hda_codec_generic    57111  1 snd_hda_codec_idt
snd_hda_intel          42689  4
snd_hda_codec         111410  3 snd_hda_codec_idt,snd_hda_codec_generic,snd_hda_intel
snd_hwdep              13232  1 snd_hda_codec
snd_seq                54679  0
snd_seq_device         13824  1 snd_seq
iTCO_wdt               13256  0
gpio_ich               13229  0
iTCO_vendor_support    13243  1 iTCO_wdt
dcdbas                 14403  0
ppdev                  17363  0
snd_pcm                87188  2 snd_hda_codec,snd_hda_intel
microcode              19262  0
serio_raw              13189  0
e100                   35909  0
mii                    13311  1 e100
lpc_ich                16864  0
mfd_core               13022  1 lpc_ich
snd_timer              23750  2 snd_pcm,snd_seq
snd                    59277  17 snd_hwdep,snd_timer,snd_hda_codec_idt,snd_pcm,snd_seq,snd_hda_codec_generic,snd_hda_codec,snd_hda_intel,snd_seq_device
parport_pc             27416  0
i2c_i801               17783  0
soundcore              14123  1 snd
parport                39185  2 ppdev,parport_pc
nfsd                  246795  1
auth_rpcgss            48490  1 nfsd
nfs_acl                12653  1 nfsd
lockd                  78065  1 nfsd
sunrpc                237582  5 nfsd,auth_rpcgss,lockd,nfs_acl
raid1                  34828  2
i915                  718106  3
video                  18720  1 i915
i2c_algo_bit           13065  1 i915
drm_kms_helper         45081  1 i915
ata_generic            12790  0
pata_acpi              12886  0
drm                   231021  4 i915,drm_kms_helper
i2c_core               32746  5 drm,i915,i2c_i801,drm_kms_helper,i2c_algo_bit
[root@spruce proc]#

Comment 23 Trevor Cordes 2014-07-04 19:00:04 UTC

I got some further, useful info this time. I was running 3.11 for over a month with no problems. I decided to try 3.14.8-100.fc19 to see if it fixed anything. 3.14.8-100.fc19 ran fine for 5 days, but then I realized my bak-find-list script wasn't running due a stuck lockfile. So I unstuck it and that very night (last night) the whole system oom/crashed right on schedule about 8 minutes after starting the script.

This time I was running a stat capturing script that copied the following /proc files into a timestamped directory every 5 mins:
loadavg meminfo pagetypeinfo slabinfo swaps vmstat

So now I have a snapshot of those files at 05:19 just before the script started, and 05:24, 4 minutes after it started, and 4 minutes before the system crashed. Unfortunately I didn't get a copy right before the crash at 05:28. In fact, the first oom-killer didn't happen until 05:25:31, but hopefully the memory pressure was starting to show up by 05:24.

I'll attach separately the relevant info, and my script.

I can now reliably make "bad" kernels blow up every single night just by running a simple backup program. I can now capture snapshots of key system stats files to see what is happening. Surely someone can help take a look at this very serious blatant kernel bug for me now?

Also, I found the following kernel.org thread which seems entirely relevant:
http://marc.info/?t=139258120400001&r=1&w=2
There they fixed a bug regarding RAID1 (which I and the other commenter are using) memory leaks. I think someone commented in there that the same bug may be present in other code areas.

Comment 24 Trevor Cordes 2014-07-04 19:05:37 UTC

Created attachment 914750 [details]
key /proc files before the crash-causer ran

Comment 25 Trevor Cordes 2014-07-04 19:06:12 UTC

Created attachment 914751 [details]
key /proc files after the crash-causer started

Comment 26 Trevor Cordes 2014-07-04 19:20:39 UTC

Here is the meat of my bak-find-list script, which is extremely simple:

#!/bin/perl
$dirs="/ /mnt/peecee/DATA";
$Ddest="/data/Bak/FindList";
system "/bin/nice -n19 /usr/bin/ionice -c2 -n7 -t find $dirs -xdev -ls 2>/dev/null > $Ddest/find-list";

I'll note two potentially "weird" things that may be related:
1. One of $dirs (/mnt/peecee) is a CIFS mount into an old XP box.  It has maybe 200,000 files.  / has 2.2M files.

Harry, is one of your things dealing with files over a CIFS mount?

Also, my $Ddest ("/data") is an NFS mount from my file server.  That's where the final find list gets saved.

Surely this is an obvious kernel bug.  If that simple one find command runs on >=3.12 the kernel blows up.  If it doesn't then all is well.  This bug doesn't exist in 3.11.  I'm not sure how much more cut & dried it could be.

Comment 27 Trevor Cordes 2014-07-04 20:26:47 UTC

Note, I still cannot reproduce this bug on demand at other times of the day by running the bak-find-list script.  I tried running it 10 times just now and it won't cause a problem.  It's only at 5:2x in the morning that it blows up.

It's strange, but when I run the script on my unloaded system now it finishes in ~ 1min on the first run and a few seconds on the subsequent runs (cached).  From the logs I can see that at 5:2x in the morning it takes much much longer to run, probably as the system grinds down into oom hell.

Jul  4 05:20:01 pog CROND[21031]: (root) CMD (nice -19 /usr/local/script/bak-find-list)
ll /data/Bak/FindList/
28426240 Jul  4 05:26 find-list-pog

So it was still running after 6 minutes at 5am.

Checking the line counts in the resulting file I see at 5am it gets about half-way through:
wc:
  421661  4717711 55786568 /data/Tmp/non-5am-test
  213490  2410154 28426240 /data/Bak/FindList/find-list-pog

The 5am one made it to:
  3506    4 -rw-r--r--   1 root     root         1199 Feb 15  2013 /usr/share/doc/hsqldb-1.8.1.3/zaurus/images/sqlSelect.gif
and died.

There's not much else cronning around that time: from /var/log/cron.log
Jul  4 05:05:01 pog CROND[20329]: (mailman) CMD (/usr/lib/mailman/cron/gate_news)
Jul  4 05:05:01 pog CROND[20330]: (root) CMD (nice -19 /usr/local/script/bak-mysql)
Jul  4 05:10:01 pog CROND[20631]: (mailman) CMD (/usr/lib/mailman/cron/gate_news)
Jul  4 05:10:01 pog CROND[20632]: (root) CMD (/usr/lib/sa/sa1 1 1)
Jul  4 05:15:01 pog CROND[20832]: (mailman) CMD (/usr/lib/mailman/cron/gate_news)
Jul  4 05:20:01 pog CROND[21030]: (mailman) CMD (/usr/lib/mailman/cron/gate_news)
Jul  4 05:20:01 pog CROND[21032]: (root) CMD (/usr/lib/sa/sa1 1 1)
Jul  4 05:20:01 pog CROND[21031]: (root) CMD (nice -19 /usr/local/script/bak-find-list)
Jul  4 05:25:01 pog CROND[21297]: (mailman) CMD (/usr/lib/mailman/cron/gate_news)
*crash*

The mailman thing takes <1s to run, so that's probably not important.

I'm running my script in an endless while loop right now to see if I can force recreation of the bug on demand (at not-5am).  So far it's 20 iterations through and no crash.  Each one takes only 15s now that everything is cached.

Besides cron stuff I just can't think of anything else that would magically be different at 5:20am each morning.  I looked at all crons, the hourly, daily, monthly, the major users' crons, including root and myself, etc.

Comment 28 Trevor Cordes 2014-07-28 21:21:09 UTC

Further developments.  I had this bug hit a couple of times during normal hours on 3.14.8 and newer, and not while my find script was running.  That expands the problem a bit.  I was doing some big web browsing sessions using both opera and firefox and they grew quite large and CPU intensive, especially opera which bogs down when 30+  modern pages are open that make heavy use of js.  In both instances, the system recovered once oom-killer ran its course.  I also captured once-every-minute (instead of my previous every-5-minutes) stats using my stat-copy script.

I noticed at these times, while looking at top, that swap use was near zero (a few kb only).  I started to watch closely what 3.11 (the working kernel) does after a few days and noticed after a day or so I'm using at least 200MB swap.  I seem to get up to 300-500MB swap and sit there with 3.11.  This is how linux has usually behaved for me over the years, and it works great.  Recall, I have 8GB RAM and 8GB swap.

So perhaps the real problem is that for whatever reason >3.11 (ie 3.14) doesn't want to swap anything out, fills up the ram, and runs out, going nuts with oom-killer.

I am *not* tweaking any swappiness sysctl vars at all.  As far as I can tell by looking at the /proc/sys/vm files I am using the normal defaults.

I have booted into 3.14.13 just now and will try changing swap parameters to more aggressively swap, to try to get to 3.11 swap usage patterns.

I will in a moment attach the stat capture script results on and around the time of the bug hitting.  Note, in every case there was plenty of RAM free!  Like 2GB free and 5-6GB "available"!  If that isn't a buggy kernel, I don't know what is!

The /v/l/m entries from the occurrences this time:
messages.3:Jul 10 15:11:43 pog kernel: [524570.539889] apcupsd invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0
messages.3:Jul 10 15:11:43 pog kernel: [524570.542446] apcupsd invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0
messages.3:Jul 10 15:11:43 pog kernel: [524570.692433] apcupsd invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0
messages.3:Jul 10 15:11:43 pog kernel: [524570.695825] Socket Thread invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0
messages.3:Jul 10 23:59:54 pog kernel: [556261.912120] apcupsd invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0
messages.3:Jul 10 23:59:55 pog kernel: [556262.520679] gkrellm invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0

Comment 29 Trevor Cordes 2014-07-28 21:23:01 UTC

Created attachment 921943 [details]
key /proc files around time of oom-killer invocation

Comment 30 Trevor Cordes 2014-07-28 21:23:18 UTC

Created attachment 921944 [details]
key /proc files around time of oom-killer invocation

Comment 31 Trevor Cordes 2014-07-28 21:23:35 UTC

Created attachment 921945 [details]
key /proc files around time of oom-killer invocation

Comment 32 Trevor Cordes 2014-07-28 21:23:52 UTC

Created attachment 921946 [details]
key /proc files around time of oom-killer invocation

Comment 33 Trevor Cordes 2014-07-28 21:24:10 UTC

Created attachment 921947 [details]
key /proc files around time of oom-killer invocation

Comment 34 Trevor Cordes 2014-07-28 21:24:28 UTC

Created attachment 921948 [details]
key /proc files around time of oom-killer invocation

Comment 35 Trevor Cordes 2014-07-28 21:24:44 UTC

Created attachment 921949 [details]
key /proc files around time of oom-killer invocation

Comment 36 Trevor Cordes 2014-08-05 04:02:11 UTC

I was in 3.14.13-100.fc19 for all of 12 hours, left it overnight, and it oom-killed into near oblivion again at 06:27 (strange, one hour later than usual).  Didn't freeze this time, but killed nearly everything, including X, and even trying to login to console as root got oom'ed a few times before I got in.  I looked at what was going on and nearly nothing was running, yet oom-killer was still going at it!!

Here's top at 13:37 before I rebooted, after hours of oom thrashing:
top - 13:37:28 up 21:53,  1 user,  load average: 0.12, 0.07, 0.06
Tasks: 160 total,   1 running, 159 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.1 us,  1.0 sy,  9.8 ni, 84.1 id,  2.9 wa,  0.1 hi,  0.1 si,  0.0 st
KiB Mem:   8285084 total,  4028976 used,  4256108 free,      196 buffers
KiB Swap:  8388604 total,      580 used,  8388024 free,  3179800 cached

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
22827 root      20   0    3520   1264    932 R  86.4  0.0   0:00.17 top         
   43 root      20   0       0      0      0 S  28.8  0.0  11:36.32 kswapd0     
    1 root      20   0    7848   3768   2076 S   0.0  0.0   1:23.83 systemd     
    2 root      20   0       0      0      0 S   0.0  0.0   1:03.52 kthreadd    
    3 root      20   0       0      0      0 S   0.0  0.0   0:01.43 ksoftirqd/0 
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:+ 
    7 root      20   0       0      0      0 S   0.0  0.0   0:17.77 rcu_sched   
    8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh      
    9 root      rt   0       0      0      0 S   0.0  0.0   0:01.14 migration/0 
   10 root      rt   0       0      0      0 S   0.0  0.0   0:00.09 watchdog/0  
   11 root      rt   0       0      0      0 S   0.0  0.0   0:00.09 watchdog/1  
   12 root      rt   0       0      0      0 S   0.0  0.0   0:00.83 migration/1 
   13 root      20   0       0      0      0 S   0.0  0.0   0:01.82 ksoftirqd/1 
   15 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/1:+ 
   16 root      rt   0       0      0      0 S   0.0  0.0   0:00.08 watchdog/2  
   17 root      rt   0       0      0      0 S   0.0  0.0   0:00.63 migration/2 
   18 root      20   0       0      0      0 S   0.0  0.0   0:01.66 ksoftirqd/2 

Note kswapd0 keeps high cpu when system is hitting this bug.  Note also the 4GB free RAM, 8GB free swap, and not even much RAM cached/buffers.  There was literally gobs of RAM free!!  Please, someone help me with this bug!  I don't know what else I can do to help.

I'll attach next the sequence of oom's in /v/l/m from 6:27 until reboot at 13:36.  Late in the sequence you can see when I try to login and it oom's my shell!

Again, I have all the /proc/files as before, every 60s, if anyone thinks any of them may be helpful I can attach.  Since I have no idea (no feedback) as to what is helpful, I won't bother attaching at the moment.

I'm back in 3.11 until something new comes along to try.  Please help!

Comment 37 Trevor Cordes 2014-08-05 04:03:30 UTC

Created attachment 924075 [details]
/var/log/messages overview of oom-killer invocations, note sh ones near end

Comment 38 Trevor Cordes 2014-08-06 17:59:44 UTC

I just tested, to be thorough, 3.14 while untainted (got rid of nvidia binary drivers completely, switched back to nouveau).  The system crashed again (cold, dead)  with the bug at the usual time in the wee hours.

Aug  6 05:49:42 pog kernel: [46747.588519] Cache I/O invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0
Aug  6 05:49:42 pog kernel: [46747.588525] Cache I/O cpuset=/ mems_allowed=0
Aug  6 05:49:42 pog kernel: [46747.588528] CPU: 1 PID: 13453 Comm: Cache I/O Not tainted 3.14.13-100.fc19.i686.PAE #1

Shows the version and taintless status.  I will attach the full /v/l/m log from the first oom-killer instance.  I'm not sure I've seen "Cache I/O invoked oom-killer" before.  There were 59 oom-killer invokeds between 
05:49 and 05:54, then the system froze solid.  bak-find-list script was still running.

I was also running at swappiness 90 to try to get more pages to swap to mitigate oom.  Didn't help at all.  (I was also 90 in the last crash I reported; default values for all tests before that.)

Comment 39 Trevor Cordes 2014-08-06 18:02:30 UTC

Created attachment 924550 [details]
/v/l/message details from first oom-killer

Comment 40 Trevor Cordes 2014-08-07 17:57:58 UTC

After doing more research I am starting to have a hunch this problem is limited to 32-bit PAE.  It may not be present on 32-bit (non-PAE), as that could explain why so few people are seeing this bug.  I have changed the summary accordingly.

I also have a hunch it is further limited to RAID1 users, but I won't commit to that statement at the moment.

I have since posted to the LKML and receive zero replies.  Since no one is helping with this bug, there's nothing left for me to do except attempt to bisect the kernel to figure out exactly which commit caused this.  As per previous comments, I have no idea how I will bisect "the Fedora way" without Igor's tool (also 32-bit buggy), but I'll hack up something so I can bisect vanilla kernel with Fedora SPEC tweaks.  It seems I'll have to bisect 3.11 with 3.13 vanilla, which means a LOT of bisecting and since I have to wait until I'm asleep at 5-6am for the machine to crash (really messes up people who rely on my machine, not just me), it may take a long while.

Comment 41 Ignacio H 2014-08-14 17:35:48 UTC

Hello, i have the same problem, but in my case is in RHEL 4.4 (Linux 2.6.9-42.0.10.ELsmp #1 SMP x86_64 GNU/Linux) with raid 5 storage.

I read somewhere that is not common in 64 bits platform.

I think is because of the io, i started with problems when i tried to backup a database to a usb device.

I'm still finding a solution.

Comment 42 Trevor Cordes 2014-08-15 04:44:06 UTC

Hi Ignacio.  I'm pretty sure my bug is limited to 3.x series kernels, UNLESS your 2.6.9-42 has backports of patches that are the same ones causing my problem.  Do you or anyone else know if that might be the case?

Yes, so far everyone who has reported me-too (and what I've read elsewhere) seems to indicate it may be related to RAID.  However, if you are running 64-bit then you might be hitting a completely different bug?  I'm pretty sure this one is limited to 32-bit PAE.

Also, USB devices sometimes make systems go crazy even with bugs specific to USB, I've found in my experience.  You should try to recreate the bug using a non-USB direct-attached (i.e. SATA) or network-mounted storage device.

Comment 43 Trevor Cordes 2014-08-25 17:36:04 UTC

Further update, 3.14.17-100.fc19 still has the bug.  Survived one night without the bug but then hit at the exact same time 05:25am the next night.  Complete freeze this time.  Non-tainted.  I haven't been able to figure out a bisect method yet.  I am getting close to giving up and just reinstalling with a 64-bit kernel.  If anyone cares about getting this bug solved, please respond with things to test as once I convert over, I'm never going back to PAE.

Comment 44 Josh Boyer 2014-08-25 17:45:23 UTC

F19 will never move off of 3.14.y at this point.  It is unlikely this bug will be solved in the F19 release.

It wasn't clear from any of the preceeding information that you had a machine with a 64-bit CPU in it.  If you said this and we missed it, our apologies.  It is always recommeneded to run the 64-bit kernel on a 64-bit capable machine.  PAE is somewhat of a hack and hits corner cases in the memory allocation areas that x86_64 avoids.  Updating to a 64-bit kernel may or may not improve your situation, but it is certainly recommended.

Comment 45 Trevor Cordes 2014-08-25 19:53:34 UTC

Thanks for answering.  I will be moving to F20 or F21 (when available) soon anyhow but I stick with PAE as I just do yum upgrades.  I can switch to F20 now for PAE testing if that would help.  I thought F20 was on the same kernel as F19, but if it's a higher rev perhaps trying that might help.

Yes, my CPU is 64b capable, but for historical and compatibility issues with some older machines I run, I prefer to run 32b on this computer.  However, switching to 64b is not unreasonable at this stage, especially if there is a PAE bug that will never be fixed.  (Of course, it'll also be a day wasted on my part switching my runs-every-daemon-under-the-sun box to 64b, but c'est la vie.)

My main concern at this point is that I am the only one who can reproduce this bug pretty much on demand, and if I go 64b and quit this bug, it'll linger on forever, slowly screwing over the few other people who will hit the bug from time to time.  I've read the blogs, etc, regarding PAE being garbage and de facto deprecated, but no one has made that official, giving us users a false sense of "supportedness".  Surely if the current and next release of RHEL ship with a PAE kernel option then Red Hat has not only an interest, but an obligation (for paying RHEL customers), to make sure it isn't buggy.  Assuming anyone actually uses it, that is.

The kernel people should make a choice and quit or deprecate PAE completely, or provide it with the same support as the other versions.

Thanks for replying.

Comment 46 Josh Boyer 2014-08-25 20:27:00 UTC

The most recent release of RHEL, RHEL 7, does not support 32-bit x86 CPUs at all.  It only ships an x86_64 kernel.  While Fedora is certainly not RHEL, and RHEL is certainly not Fedora, it would be disingenuous of us to pretend that 32-bit PAE is well supported in Fedora at this point.  The upstream developers don't focus on it at all.  We do minimal automated testing, but it is very low on the priority list, particularly for things involving memory over-commit, etc.  There have been discussions of demoting i686 to a secondary architecture in Fedora as well, but that hasn't really seriously been proposed for consideration yet.  It likely will be at some point in the not distant future.

Comment 47 Trevor Cordes 2014-11-06 04:56:34 UTC

Still happens in latest kernel kernel-PAE-3.14.22-100.fc19.i686.  Happened while I was sitting at the computer at 5:25am once.  Still haven't switched to x86_64 yet.

I was checking other similar bug reports and started hunting around for further causes, still wondering why no one else is seeing this.

I found another "odd" setting that may be unique to me.  On the linux cmdline I have:
scsi_mod.scan=sync rootflags=data=writeback

I doubt the SCSI one is a problem, but perhaps there's some interplay with PAE and writeback?  I saw this because I saw in some other similar bugs they found bugs with barriers in RAID, which is somewhat related.  When I get time and can afford another crash, I'll boot into 3.14.* with writeback taken out.

Comment 48 Trevor Cordes 2015-01-04 21:18:53 UTC

But still exists as of 3.17.7-200.fc20.i686+PAE.  Just upgraded to F20 and its newest kernel hoping it has been fixed, but within 24 hours the bug hit (at 5:22am again).  I also saw a precursor to the bug within an hour of booting F20 doing a simple find . -newer on my home dir (83k files).  I could tell the whole system bogged down, and just switching windows I could see the frames draw slowly until the find was done.  In fact, it seemed boggy even about 10-20s after the find was done.  Weird.

Now I am stuck.  I need this box to work and now F19 is gone I'm forced to do the painful switch to 64-bit, which I'm going to do next.  I hope the bug does not exist in 64-bit.

Comment 49 Trevor Cordes 2015-01-04 21:23:57 UTC

Created attachment 976105 [details]
oom data from /v/l/messages

Comment 50 Trevor Cordes 2015-01-09 08:43:08 UTC

After my last comment, I "upgraded" from F20 32-bit to F21 x86_64. Joyous fun that was.

It's been 5 days of uptime, with lots of stuff running, including the 5am find job. No bogginess or oom's or crashes yet! If this stability continues for another week or two, I'll say this bug is fixed by x86_64. Is that even a bugzilla resolution possibility?

I suspect people with PAE will continue to see this bug, if they hit the unknown cause I hit.

While I was "upgrading" to 64-bit and recreating all my config, I noticed one or two things that *may* be unique to my computer that *may* have something to do with the problem. I hadn't thought to check/test these things as these are files you set and never go back into until it comes time to wipe/reinstall (which I hadn't done in 10 years!).

The following were on my old system and are not *yet* on the new system. After I'm sure x86_64 is stable I'm going to *possibly* add these back in, and we'll see if these are the cause of the bug.

#cat etc/sysconfig/harddisks
LOOKAHEAD=1
MULTIPLE_IO=16
EIDE_32BIT=3
USE_DMA=1

and my root fs (as well as all other local disks) in fstab were set to:
rw,data=writeback,barrier=0

*If* these have anything to do with it, perhaps the combination of oldstyle UltraWide SCSI + aggressive tuning (and maybe + PAD) triggered the bug after some kernel change after 3.11. This is all just a wild hunch. Anyone else seeing this bug in PAE could try eliminating 1 or 2 of these things to see if that solves the problem.

Comment 51 Fedora End Of Life 2015-01-09 21:13:17 UTC

This message is a notice that Fedora 19 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 19. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained. Approximately 4 (four) weeks from now this bug will
be closed as EOL if it remains open with a Fedora 'version' of '19'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 19 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 52 Fedora End Of Life 2015-02-17 20:02:13 UTC

Fedora 19 changed to end-of-life (EOL) status on 2015-01-06. Fedora 19 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 53 Trevor Cordes 2015-02-17 21:09:55 UTC

As per comment #48 this bug still exists in F20 3.17.7-200.fc20.i686+PAE.

The more I think about this, the more I am convinced the problem must be either (or both) i686 PAE or the disk optimizations described in comment #50.  Probably in some weird combination of those 2 things plus RAID1 (and maybe SCSI).

This bug is confirmed to NOT occur in the latest F21 kernel in x86_64 arch without those disk opts.  RAID1 is still present.  My 64-bit reinstall has been rock solid since install date.  I will one day put my disk opts back in, I think.  We'll see if that alone triggers it.  In the meantime I leave this bug here for posterity and future victims, it will autoclose when F20 goes EOL and I won't reopen.

Comment 54 Fedora Kernel Team 2015-02-24 16:23:56 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs.

Fedora 20 has now been rebased to 3.18.7-100.fc20.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 21, and are still experiencing this issue, please change the version to Fedora 21.

If you experience different issues, please open a new bug report for those.

Comment 55 Fedora Kernel Team 2015-04-28 18:24:46 UTC

*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in over 4 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.

Note You need to log in before you can comment on or make changes to this bug.