Bug 712198

Summary: NULL pointer called from shrink_slab
Product: [Fedora] Fedora Reporter: Josh Stone <jistone>
Component: kernelAssignee: Dave Airlie <airlied>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 15CC: gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, uckelman
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-02-13 08:31:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Screenshot of crashed console
none
full dmesg after crash none

Description Josh Stone 2011-06-09 18:44:43 UTC
Created attachment 503970 [details]
Screenshot of crashed console

Description of problem:
After this machine has been running for a while, it crashes with a null RIP.

Version-Release number of selected component (if applicable):
kernel-2.6.38.7-30.fc15.x86_64

How reproducible:
Happens every few days.

Steps to Reproduce:
No specific steps, just seems to occur when total memory usage is near capacity.
  
Actual results:
Crash, see the attached png for backtrace.

Expected results:
No crash.

Additional info:
It may matter that I don't have any swap enabled on this machine.  It has 4GB of RAM though, and I'm not running memory-heavy workloads.  Most of the memory use is buffers/cache.

I'm dealing with this machine remotely, so the only info I have of this crash is a VNC screencap of the KVM's console, attached as png.  It shows that RIP is null, called near ffffffff810e462d, shrink_slab+0x6d.  objdump -S:

                max_pass = (*shrinker->shrink)(shrinker, 0, gfp_mask);
ffffffff810e4623:       44 89 f2                mov    %r14d,%edx
ffffffff810e4626:       31 f6                   xor    %esi,%esi
ffffffff810e4628:       48 89 df                mov    %rbx,%rdi
ffffffff810e462b:       ff 13                   callq  *(%rbx)
                delta = (4 * scanned) / shrinker->seeks;
ffffffff810e462d:       48 63 4b 08             movslq 0x8(%rbx),%rcx

So it appears that the ->shrink pointer was NULL when called.

Comment 1 Chuck Ebbert 2011-06-24 11:03:18 UTC
You pretty much have to have some swap enabled or Bad Things will happen.

Comment 2 Josh Stone 2011-06-24 15:04:49 UTC
I don't recall why this machine doesn't have swap.  I think it may have been an oversight when I was shuffling other partitions, I forgot to add it back.  I'll shrink some space for swap next time I get a chance to boot into recovery.

But still, such Bad Things seem like bugs to me...

Comment 3 Josh Stone 2011-06-24 15:56:29 UTC
For now I just added a file-based swap, and I was still able to trigger a failure.  I ran a "grep foobarbaz -R /" to cause a lot of caching without much hard memory use, but it got killed before completion.  Doesn't appear to be OOM kill though -- this is what "free" reported:

             total       used       free     shared    buffers     cached
Mem:       3785704    3690888      94816          0     433760    2892500
-/+ buffers/cache:     364628    3421076
Swap:       524284          0     524284

It was alive enough this time to grab dmesg, which I'll attach in a moment.

Comment 4 Josh Stone 2011-06-24 16:03:42 UTC
Created attachment 509800 [details]
full dmesg after crash

There's a list_add corruption early on that's foreboding:

[   16.400934] list_add corruption. prev->next should be next (ffffffff81a310d0), but was ffffea000421c0f0. (prev=ffff88012ba13590).

The next Oops from kswapd is near those bad addresses:

[  319.316573] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[  319.316643] BUG: unable to handle kernel paging request at ffffea000421c080
[  319.316790] IP: [<ffffea000421c080>] 0xffffea000421c080

Comment 5 Chuck Ebbert 2011-06-30 09:12:53 UTC
(In reply to comment #4)
> Created attachment 509800 [details]
> full dmesg after crash
> 
> There's a list_add corruption early on that's foreboding:
> 
> [   16.400934] list_add corruption. prev->next should be next
> (ffffffff81a310d0), but was ffffea000421c0f0. (prev=ffff88012ba13590).
> 

This may be grasping at straws, but can you look in /proc/modules and see if any of those three addresses is near the address of a loaded module?

Comment 6 Josh Stone 2011-07-05 00:30:49 UTC
(In reply to comment #5)
> This may be grasping at straws, but can you look in /proc/modules and see if
> any of those three addresses is near the address of a loaded module?

The loaded modules are all ffffffffa... but grasping at a different straw, /proc/kallsyms has an exact hit: ffffffff81a310d0 d shrinker_list

Comment 7 Josh Stone 2011-07-06 00:14:38 UTC
That shrinker_list is only written in [un]register_shrinker().  There aren't a whole lot of callers -- sunrpc is the one crashing here.  But I noticed that i915 also calls register_shrinker, and there's this tidbit earlier in boot:

> [    2.734161] [drm] Initialized drm 1.1.0 20060810
> [    2.749517] i915 0000:00:02.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
> [    2.782219] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
> [    2.782221] [drm] Driver supports precise vblank timestamp query.
> [    2.809880] [drm:init_ring_common] *ERROR* failed to set render ring head to zero ctl ffffffff head ffffffff tail ffffffff start ffffffff
> [    2.809960] [drm:init_ring_common] *ERROR* render ring initialization failed ctl ffffffff head ffffffff tail ffffffff start ffffffff
> [    2.810121] [drm:init_ring_common] *ERROR* failed to set bsd ring head to zero ctl ffffffff head ffffffff tail ffffffff start ffffffff
> [    2.810198] [drm:init_ring_common] *ERROR* bsd ring initialization failed ctl ffffffff head ffffffff tail ffffffff start ffffffff
> [    2.810364] [drm:i915_driver_load] *ERROR* failed to init modeset
> [    2.824071] i915 0000:00:02.0: PCI INT A disabled
> [    2.824077] i915: probe of 0000:00:02.0 failed with error -5

I basically use this as a remote machine only, so as a workaround I tried booting nomodeset.  The above drm errors went away, the list_add corruption went away, and I could no longer reproduce the crash.

This may all be moot now though.  Since this report was originally on kernel 30, I finally did an update to kernel-2.6.38.8-32.fc15.x86_64, and now I can't reproduce it at all.  Even with modeset enabled, while I still get the drm errors, I don't get the list_add corruption, nor do I get the eventual BUG with heavy cache usage.  I hope it hasn't just moved the corruption elsewhere, but for now it seems fine.

Comment 8 Dave Jones 2011-07-20 18:31:30 UTC
*** Bug 723349 has been marked as a duplicate of this bug. ***

Comment 9 Joel Uckelman 2011-07-21 21:06:19 UTC
The bug I reported, Bug 723349, I believe is not a duplicate of this one. I've installed kernel 2.6.38.8-35.fc15, and the problem still persists for me, unlike for the reporter of this bug.

Comment 10 Josh Stone 2011-07-21 21:56:29 UTC
Hmm, I just tried 35, and the problem with the list_add corruption is back.  I guess my success on 32 was just lucky.  35 with nomodeset is ok though, so I'll be running that way for now.

Joel, I'd suggest looking at dmesg right after boot for signs of trouble.  I can see my list_add corruption within seconds of boot, even though it doesn't manifest into the shrink_slab problem until later hammering the disk.

Comment 11 Joel Uckelman 2011-07-22 11:06:10 UTC
(In reply to comment #10)
> Hmm, I just tried 35, and the problem with the list_add corruption is back.  I
> guess my success on 32 was just lucky.  35 with nomodeset is ok though, so I'll
> be running that way for now.
> 
> Joel, I'd suggest looking at dmesg right after boot for signs of trouble.  I
> can see my list_add corruption within seconds of boot, even though it doesn't
> manifest into the shrink_slab problem until later hammering the disk.

The only thing I see immediately after boot in dmesg which looks like an error are these two (which I've copied out by hand):

[    2.007077] [drm:init_ring_common] *ERROR* failed to set render ring head to zer octl ffffffff head ffffffff tail ffffffff start ffffffff
[    2.007137] [drm:init_ring_common] *ERROR* failed to set render ring head to zer octl ffffffff head ffffffff tail ffffffff start ffffffff

I don't have the list_add corruption warning on booting. There's *loads* more error output once I've triggered the problem with disk I/O, however. I'd be willing to copy it out from the screen, or take photos, if that would be useful in tracking down the problem. (Just say so.

Also: I discovered that I can trigger the oops with solely internal disk access. 'md5sum /dev/hda' will set it off within 30 seconds or so. I had been triggering it by copying from a USB drive to the internal drive before, so this eliminates one more moving part.

Comment 12 Joel Uckelman 2011-07-22 11:33:19 UTC
I tried the suggestion to boot with 'nomodeset', I'm no longer able to trigger the bug when I do that. 

It's probably worth pointing out that my hardware also uses the i915 driver.

Comment 13 Dave Jones 2011-07-22 17:12:40 UTC
Ok, this is sounding like the drm code is corrupting memory.

Comment 14 Dave Airlie 2012-02-12 16:14:26 UTC
Does this still happen, look like a hw issue as well, it appears the driver is failing to bind because the hw isn't there properly, so it fails to clean up after itself.

Comment 15 Joel Uckelman 2012-02-12 20:37:39 UTC
This doesn't happen for me with the 3.2.5-3.fc16.x86_64 kernel.