Bug 433286 - [ext4dev] Unable to handle kernel paging request at 0xffff810055481000
Summary: [ext4dev] Unable to handle kernel paging request at 0xffff810055481000
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: x86_64
OS: Linux
low
low
Target Milestone: ---
Assignee: Eric Sandeen
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-02-18 12:32 UTC by Joachim Frieben
Modified: 2008-02-21 14:56 UTC (History)
4 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2008-02-20 21:42:15 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Kernel oops section in "messages" for kernel 2.6.25-0.50.rc2.fc9 (10.91 KB, text/plain)
2008-02-18 12:32 UTC, Joachim Frieben
no flags Details
my oops (3.83 KB, text/plain)
2008-02-19 15:30 UTC, Eric Sandeen
no flags Details

Description Joachim Frieben 2008-02-18 12:32:03 UTC
Description of problem:
It happens frequently that while building some RPM package as an ordinary
user in /home, a kernel oops is reported in the shell. Afterwards, it's
impossible to enter any new command in the shell. New terminal windows
remain blank. Likewise, it's impossible to log in as an ordinary user
after switching to a virtual console.
It's possible though to login is as "root", probably because in this case
the home directory is not located in /home which is the only ext4dev
volume on this system.
Furthermore, it's impossible to reboot/shutdown the system by any normal
means. It will acknowledge the request but simply sit there afterwards.
It's actually necessary to cut the power to shut it down.

Version-Release number of selected component (if applicable):
kernel-2.6.25-0.50.rc2.fc9

How reproducible:
Most of the time.

Steps to Reproduce:
1. Build some RPM package as ordinary user in /home.
  
Actual results:
During compilation, the system beeps and reports a kernel oops.

Expected results:
Compilation completes successfully.

Additional info:
Symptoms are similar to those of bug 428329. The currently installed
compiler collection is gcc-4.3.0-0.9. If I remember correctly, this
issue showed up some time earlier this month.

Comment 1 Joachim Frieben 2008-02-18 12:32:03 UTC
Created attachment 295163 [details]
Kernel oops section in "messages" for kernel 2.6.25-0.50.rc2.fc9

Comment 2 Jon Stanley 2008-02-18 16:51:56 UTC
Reassigning to Eric.

Comment 3 Eric Sandeen 2008-02-18 17:06:57 UTC
Joachim, thanks for the new bug.  I don't see that this is at all similar to bug
428329... what am I missing?

I'll look into this.

-Eric

Comment 4 Joachim Frieben 2008-02-18 17:45:30 UTC
(In reply to comment #3)
There is this same "BUG: unable to handle kernel paging request at
virtual address .." message and some "ext3" related enries. Maybe
rather generic, so just forget about that part of my bug report.

Comment 5 Eric Sandeen 2008-02-18 18:18:11 UTC
Does it seem to matter which rpm you're rebuilding?

Jarod has been doing mock builds on ext4 w/o trouble, and I just did a quick
test of an e2fsprogs rebuild under 2.6.25-rc1 on ext4dev w/o problems.

Comment 6 Jarod Wilson 2008-02-18 18:33:07 UTC
So... Most of my building on top of ext4 has been under 2.6.24.x rawhide
kernels. This morning, a simple scp of a file onto the ext4 volume, now running
2.6.25-rc1-git2 or so, and I believe I hit the exact same oops.

Comment 7 Jarod Wilson 2008-02-18 18:39:21 UTC
My oops output:
http://people.redhat.com/jwilson/misc/ext4-go-boom.txt

Comment 8 Eric Sandeen 2008-02-19 03:50:12 UTC
Ok, I can hit this too.  On a "vanilla" 2.6.25-rc2 kernel from rawhide.  On my
own 2.6.25-rc1 kernel, seems I don't hit it, testing "vanilla" 2.6.25-rc1 from
rawhide now.

I suppose I'll go look at the oopsing function, too :)

Comment 9 Eric Sandeen 2008-02-19 15:30:34 UTC
Created attachment 295297 [details]
my oops

Here's an oops I hit with all the mballoc stuff uninlined; shows a bit more
about how we got there:

ext4dev:ext4_mb_regular_allocator
    ext4dev:ext4_mb_simple_scan_group
	find_next_zero_bit

Comment 10 Eric Sandeen 2008-02-19 23:16:32 UTC
So, this is a little weird.

This is the mballoc code trying to find bits set in a bitmap at the end of a page.

Some test code that does similar things as far as the bitmap testing goes:

	unsigned long *p;
	unsigned long *p2;
	unsigned long bit;

	p = kzalloc(8192, GFP_KERNEL);
	/* set first 4k to 1's, no 0 'til 4098 bytes in */
	memset(p, 0xFFFFFFFF, 4098);

	p2 = (unsigned long *)((char *)p + 4092);
	printk("p at %p, p2 at %p\n", p, p2);
	/* search within 16 bits (2 bytes) from p2 for a zero */
	bit = find_next_zero_bit(p2, 16, 0);	
	printk("found 0 bit at offset %lu\n", bit);

... and this finds bit 48... which means it has gone into the next page.
We asked to search 16 bits (2 bytes) at 4 bytes from the end of the page,
but it continued into the next.  This (walking off the page) is what was causing
the oops I think.  This is not the behavior I expect from find_next_zero_bit, so
I'm confused here.

Comment 11 Chuck Ebbert 2008-02-20 01:57:23 UTC
The semantics of find_next_zero_bit() are not very well defined. It is assuming
it can access at least one unsigned long starting at offset though, that's for sure.


Comment 12 Eric Sandeen 2008-02-20 03:17:19 UTC
For what it's worth, the generic implementation seems to behave differently, and
just returns 16 ("no zeros found") like I'd expect.

I'm not sure offhand if ext4 or find_next_zero_bit needs to be fixed here; also
not quite sure why this was working up 'til now.

I'll see if we can work around it in ext4 for starters, at least.

Comment 13 Eric Sandeen 2008-02-20 03:53:27 UTC
Joy.  ext4 used to work around it already, but it was taken out because it
wasn't aesthetically pleasing, or something.

Comment 14 Eric Sandeen 2008-02-20 21:42:15 UTC
I've committed a patch to rawhide which should fix this (along with a few other
issues)

Next kernel build should have it...

Thanks for pointing it out,
-Eric

Comment 15 Joachim Frieben 2008-02-21 14:56:34 UTC
kernel-2.6.25-0.61.rc2.git4.fc9 works fine again. Thanks!


Note You need to log in before you can comment on or make changes to this bug.