Description of problem: It happens frequently that while building some RPM package as an ordinary user in /home, a kernel oops is reported in the shell. Afterwards, it's impossible to enter any new command in the shell. New terminal windows remain blank. Likewise, it's impossible to log in as an ordinary user after switching to a virtual console. It's possible though to login is as "root", probably because in this case the home directory is not located in /home which is the only ext4dev volume on this system. Furthermore, it's impossible to reboot/shutdown the system by any normal means. It will acknowledge the request but simply sit there afterwards. It's actually necessary to cut the power to shut it down. Version-Release number of selected component (if applicable): kernel-2.6.25-0.50.rc2.fc9 How reproducible: Most of the time. Steps to Reproduce: 1. Build some RPM package as ordinary user in /home. Actual results: During compilation, the system beeps and reports a kernel oops. Expected results: Compilation completes successfully. Additional info: Symptoms are similar to those of bug 428329. The currently installed compiler collection is gcc-4.3.0-0.9. If I remember correctly, this issue showed up some time earlier this month.
Created attachment 295163 [details] Kernel oops section in "messages" for kernel 2.6.25-0.50.rc2.fc9
Reassigning to Eric.
Joachim, thanks for the new bug. I don't see that this is at all similar to bug 428329... what am I missing? I'll look into this. -Eric
(In reply to comment #3) There is this same "BUG: unable to handle kernel paging request at virtual address .." message and some "ext3" related enries. Maybe rather generic, so just forget about that part of my bug report.
Does it seem to matter which rpm you're rebuilding? Jarod has been doing mock builds on ext4 w/o trouble, and I just did a quick test of an e2fsprogs rebuild under 2.6.25-rc1 on ext4dev w/o problems.
So... Most of my building on top of ext4 has been under 2.6.24.x rawhide kernels. This morning, a simple scp of a file onto the ext4 volume, now running 2.6.25-rc1-git2 or so, and I believe I hit the exact same oops.
My oops output: http://people.redhat.com/jwilson/misc/ext4-go-boom.txt
Ok, I can hit this too. On a "vanilla" 2.6.25-rc2 kernel from rawhide. On my own 2.6.25-rc1 kernel, seems I don't hit it, testing "vanilla" 2.6.25-rc1 from rawhide now. I suppose I'll go look at the oopsing function, too :)
Created attachment 295297 [details] my oops Here's an oops I hit with all the mballoc stuff uninlined; shows a bit more about how we got there: ext4dev:ext4_mb_regular_allocator ext4dev:ext4_mb_simple_scan_group find_next_zero_bit
So, this is a little weird. This is the mballoc code trying to find bits set in a bitmap at the end of a page. Some test code that does similar things as far as the bitmap testing goes: unsigned long *p; unsigned long *p2; unsigned long bit; p = kzalloc(8192, GFP_KERNEL); /* set first 4k to 1's, no 0 'til 4098 bytes in */ memset(p, 0xFFFFFFFF, 4098); p2 = (unsigned long *)((char *)p + 4092); printk("p at %p, p2 at %p\n", p, p2); /* search within 16 bits (2 bytes) from p2 for a zero */ bit = find_next_zero_bit(p2, 16, 0); printk("found 0 bit at offset %lu\n", bit); ... and this finds bit 48... which means it has gone into the next page. We asked to search 16 bits (2 bytes) at 4 bytes from the end of the page, but it continued into the next. This (walking off the page) is what was causing the oops I think. This is not the behavior I expect from find_next_zero_bit, so I'm confused here.
The semantics of find_next_zero_bit() are not very well defined. It is assuming it can access at least one unsigned long starting at offset though, that's for sure.
For what it's worth, the generic implementation seems to behave differently, and just returns 16 ("no zeros found") like I'd expect. I'm not sure offhand if ext4 or find_next_zero_bit needs to be fixed here; also not quite sure why this was working up 'til now. I'll see if we can work around it in ext4 for starters, at least.
Joy. ext4 used to work around it already, but it was taken out because it wasn't aesthetically pleasing, or something.
I've committed a patch to rawhide which should fix this (along with a few other issues) Next kernel build should have it... Thanks for pointing it out, -Eric
kernel-2.6.25-0.61.rc2.git4.fc9 works fine again. Thanks!