| Summary: | NULL pointer called from shrink_slab | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Josh Stone <jistone> | ||||||
| Component: | kernel | Assignee: | Dave Airlie <airlied> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 15 | CC: | gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, uckelman | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2012-02-13 08:31:27 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
You pretty much have to have some swap enabled or Bad Things will happen. I don't recall why this machine doesn't have swap. I think it may have been an oversight when I was shuffling other partitions, I forgot to add it back. I'll shrink some space for swap next time I get a chance to boot into recovery. But still, such Bad Things seem like bugs to me... For now I just added a file-based swap, and I was still able to trigger a failure. I ran a "grep foobarbaz -R /" to cause a lot of caching without much hard memory use, but it got killed before completion. Doesn't appear to be OOM kill though -- this is what "free" reported:
total used free shared buffers cached
Mem: 3785704 3690888 94816 0 433760 2892500
-/+ buffers/cache: 364628 3421076
Swap: 524284 0 524284
It was alive enough this time to grab dmesg, which I'll attach in a moment.
Created attachment 509800 [details]
full dmesg after crash
There's a list_add corruption early on that's foreboding:
[ 16.400934] list_add corruption. prev->next should be next (ffffffff81a310d0), but was ffffea000421c0f0. (prev=ffff88012ba13590).
The next Oops from kswapd is near those bad addresses:
[ 319.316573] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[ 319.316643] BUG: unable to handle kernel paging request at ffffea000421c080
[ 319.316790] IP: [<ffffea000421c080>] 0xffffea000421c080
(In reply to comment #4) > Created attachment 509800 [details] > full dmesg after crash > > There's a list_add corruption early on that's foreboding: > > [ 16.400934] list_add corruption. prev->next should be next > (ffffffff81a310d0), but was ffffea000421c0f0. (prev=ffff88012ba13590). > This may be grasping at straws, but can you look in /proc/modules and see if any of those three addresses is near the address of a loaded module? (In reply to comment #5) > This may be grasping at straws, but can you look in /proc/modules and see if > any of those three addresses is near the address of a loaded module? The loaded modules are all ffffffffa... but grasping at a different straw, /proc/kallsyms has an exact hit: ffffffff81a310d0 d shrinker_list That shrinker_list is only written in [un]register_shrinker(). There aren't a whole lot of callers -- sunrpc is the one crashing here. But I noticed that i915 also calls register_shrinker, and there's this tidbit earlier in boot:
> [ 2.734161] [drm] Initialized drm 1.1.0 20060810
> [ 2.749517] i915 0000:00:02.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
> [ 2.782219] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010).
> [ 2.782221] [drm] Driver supports precise vblank timestamp query.
> [ 2.809880] [drm:init_ring_common] *ERROR* failed to set render ring head to zero ctl ffffffff head ffffffff tail ffffffff start ffffffff
> [ 2.809960] [drm:init_ring_common] *ERROR* render ring initialization failed ctl ffffffff head ffffffff tail ffffffff start ffffffff
> [ 2.810121] [drm:init_ring_common] *ERROR* failed to set bsd ring head to zero ctl ffffffff head ffffffff tail ffffffff start ffffffff
> [ 2.810198] [drm:init_ring_common] *ERROR* bsd ring initialization failed ctl ffffffff head ffffffff tail ffffffff start ffffffff
> [ 2.810364] [drm:i915_driver_load] *ERROR* failed to init modeset
> [ 2.824071] i915 0000:00:02.0: PCI INT A disabled
> [ 2.824077] i915: probe of 0000:00:02.0 failed with error -5
I basically use this as a remote machine only, so as a workaround I tried booting nomodeset. The above drm errors went away, the list_add corruption went away, and I could no longer reproduce the crash.
This may all be moot now though. Since this report was originally on kernel 30, I finally did an update to kernel-2.6.38.8-32.fc15.x86_64, and now I can't reproduce it at all. Even with modeset enabled, while I still get the drm errors, I don't get the list_add corruption, nor do I get the eventual BUG with heavy cache usage. I hope it hasn't just moved the corruption elsewhere, but for now it seems fine.
*** Bug 723349 has been marked as a duplicate of this bug. *** The bug I reported, Bug 723349, I believe is not a duplicate of this one. I've installed kernel 2.6.38.8-35.fc15, and the problem still persists for me, unlike for the reporter of this bug. Hmm, I just tried 35, and the problem with the list_add corruption is back. I guess my success on 32 was just lucky. 35 with nomodeset is ok though, so I'll be running that way for now. Joel, I'd suggest looking at dmesg right after boot for signs of trouble. I can see my list_add corruption within seconds of boot, even though it doesn't manifest into the shrink_slab problem until later hammering the disk. (In reply to comment #10) > Hmm, I just tried 35, and the problem with the list_add corruption is back. I > guess my success on 32 was just lucky. 35 with nomodeset is ok though, so I'll > be running that way for now. > > Joel, I'd suggest looking at dmesg right after boot for signs of trouble. I > can see my list_add corruption within seconds of boot, even though it doesn't > manifest into the shrink_slab problem until later hammering the disk. The only thing I see immediately after boot in dmesg which looks like an error are these two (which I've copied out by hand): [ 2.007077] [drm:init_ring_common] *ERROR* failed to set render ring head to zer octl ffffffff head ffffffff tail ffffffff start ffffffff [ 2.007137] [drm:init_ring_common] *ERROR* failed to set render ring head to zer octl ffffffff head ffffffff tail ffffffff start ffffffff I don't have the list_add corruption warning on booting. There's *loads* more error output once I've triggered the problem with disk I/O, however. I'd be willing to copy it out from the screen, or take photos, if that would be useful in tracking down the problem. (Just say so. Also: I discovered that I can trigger the oops with solely internal disk access. 'md5sum /dev/hda' will set it off within 30 seconds or so. I had been triggering it by copying from a USB drive to the internal drive before, so this eliminates one more moving part. I tried the suggestion to boot with 'nomodeset', I'm no longer able to trigger the bug when I do that. It's probably worth pointing out that my hardware also uses the i915 driver. Ok, this is sounding like the drm code is corrupting memory. Does this still happen, look like a hw issue as well, it appears the driver is failing to bind because the hw isn't there properly, so it fails to clean up after itself. This doesn't happen for me with the 3.2.5-3.fc16.x86_64 kernel. |
Created attachment 503970 [details] Screenshot of crashed console Description of problem: After this machine has been running for a while, it crashes with a null RIP. Version-Release number of selected component (if applicable): kernel-2.6.38.7-30.fc15.x86_64 How reproducible: Happens every few days. Steps to Reproduce: No specific steps, just seems to occur when total memory usage is near capacity. Actual results: Crash, see the attached png for backtrace. Expected results: No crash. Additional info: It may matter that I don't have any swap enabled on this machine. It has 4GB of RAM though, and I'm not running memory-heavy workloads. Most of the memory use is buffers/cache. I'm dealing with this machine remotely, so the only info I have of this crash is a VNC screencap of the KVM's console, attached as png. It shows that RIP is null, called near ffffffff810e462d, shrink_slab+0x6d. objdump -S: max_pass = (*shrinker->shrink)(shrinker, 0, gfp_mask); ffffffff810e4623: 44 89 f2 mov %r14d,%edx ffffffff810e4626: 31 f6 xor %esi,%esi ffffffff810e4628: 48 89 df mov %rbx,%rdi ffffffff810e462b: ff 13 callq *(%rbx) delta = (4 * scanned) / shrinker->seeks; ffffffff810e462d: 48 63 4b 08 movslq 0x8(%rbx),%rcx So it appears that the ->shrink pointer was NULL when called.