Bug 138100 - Kernel oopses related to block layer
Kernel oopses related to block layer
Status: CLOSED RAWHIDE
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
rawhide
All Linux
medium Severity high
: ---
: ---
Assigned To: Dave Jones
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2004-11-04 11:51 EST by Elliot Lee
Modified: 2015-01-04 17:11 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-10-06 00:04:46 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Log from another oops (7.27 KB, text/plain)
2004-11-04 11:53 EST, Elliot Lee
no flags Details

  None (edit)
Description Elliot Lee 2004-11-04 11:51:30 EST
Here's the info on problems Cristian Gafton hit on a Fedora Core 3
install. One is attached as a log. The other is at
http://people.redhat.com/sopwith/install-trace.jpg :) Since the tree he
was installing from was *supposed* to be final, the first thing I'm
interested in knowing is how many users will be impacted by this problem.
And if it has a substantial impact, hopefully a fix will be easy to
create.

The problems occur with kernel-2.6.9-1.667 (.src.rpm is at
http://download.fedora.redhat.com/pub/fedora/linux/core/development/SRPMS/kernel-2.6.9-1.667.src.rpm.)
The only 'different' thing of note is that these are happening with
selinux OFF, which we haven't tested much with FC3 so far. (Everyone
wants to run SELinux, right? :)
Comment 1 Elliot Lee 2004-11-04 11:53:52 EST
Created attachment 106174 [details]
Log from another oops
Comment 2 Elliot Lee 2004-11-04 11:55:39 EST
From Andrew Morton:
This looks like the one Dave is working on - it appears to be a null
bh->b_this_page in the buffer ring.

Are any non-4k-blocksize devices involved here?
Comment 3 Elliot Lee 2004-11-04 11:56:36 EST
From Andrew Morton:

Cristian Gafton <gafton@redhat.com> wrote:
>
>  > Are any non-4k-blocksize devices involved here?
>
>  Nope, these are standard 120G IDE drives that are chopped up in 6
>  partitions, with RAID1 across each of the identical partition sets.

Any small filesystem will end up with small software blocksize by default.

Are there any small filesystems or blockdevices in use during this
process?
 ramdisks?  Anything like that?

A full `df' after a successful boot would tell us.
Comment 4 Elliot Lee 2004-11-04 11:57:24 EST
From Stephen Tweedie:

There are two different problems being exhibited here.

The screenshot shows a BUG() in submit_bh(): we failed the check at

        BUG_ON(!buffer_mapped(bh));

That's during unmount; unmount race for Al?


The one from the text log is a null pointer deref, but not on
b_this_page.  I just checked the disassembly of that build (btw, Dave,
there's nothing in the oops that tells us it's an i586 build, not i686
--- we added that for RHEL3, and it would be _really_ useful to have it
here too!)

It's actually the initial deref in __find_get_block_slow that oopsed:
        struct inode *bd_inode = bdev->bd_inode;
but bdev is NULL.

The call chain is:

ext3_new_inode():
  bh2 = <bitmap buffer>;
  err = ext3_journal_get_write_access(handle, bh2);
    rc = do_get_write_access(handle, jh, 0, credits);
      journal_cancel_revoke(handle, jh);
        bh2 = __find_get_block(bh->b_bdev, bh->b_blocknr, bh->b_size);

yet when we get here, we've got bh->bdev==NULL, bh->blocknr==2, and
bh->b_size==0.

Weird.  How reproducible is this?  When do they occur --- always
during cleanup/unmount?
Comment 5 Elliot Lee 2004-11-04 12:03:06 EST
Creating a bugzilla report to track all this info...
Comment 6 Elliot Lee 2004-11-04 13:31:18 EST
I did an 'ext3-on-LVM-on-RAID1' install just now without any problems.
The setup is:
/dev/hda: /boot partition plus six RAID partitions
Three RAID-1 arrays (two partitions each, from the six)
LVM - three PV's on the three RAID-1 arrays, five LV's of varying
sizes, including a couple of filesystems with a 1024-byte block size.

So far the two variables that may be in play are whether LVM is in use
and whether SELinux is in use.
Comment 7 Chris Ricker 2004-11-04 13:35:32 EST
For whatever it's worth, I've been doing RAID-1 with SELinux off and
without LVM. I've not been seeing this.... That's on x86_64, though,
not  on i386
Comment 8 Stephen Tweedie 2004-11-04 14:02:35 EST
Cristian adds:

> Are there any small filesystems or blockdevices in use during this 
> process?  ramdisks?  Anything like that?

Yes, there are. The /boot partition is also a RAID1 that is 100M in
size, created with a 1K block size.
Comment 9 Stephen Tweedie 2004-11-04 14:11:47 EST
Can we check whether it's /boot-on-raid1 that's causing the problem
here?  Elliot's last test did not do that, from what I see.
Comment 10 Elliot Lee 2004-11-04 14:16:46 EST
I just did another install without LVM, but with /boot on RAID1. No
problems.

gafton has another install or two going that may provide additional
insights, but it's not sounding like this is a widespread problem.
Comment 11 Stephen Tweedie 2004-11-04 14:35:36 EST
Andrew Morton adds:

OK.  Is it possible to force the mkfs on that partition to use 4k
blocks?  (mkfs -b 4096)?  That'll use a bit more space so you might
need to increase the size a bit.

If that fixes it then perhaps we've hit a snag when the kernel is
switching block sizes on a device.

Maybe.  IIRC we used to have problems where the fs mount code reads
the filesystem using a 1k block size and we then switch to 4k block
size but find we hadn't successfully stripped the pagecache page's 1k
buffers.

But then, all of this should have happened:

        printk("__find_get_block_slow() failed. "
                "block=%llu, b_blocknr=%llu\n",
                (unsigned long long)block, (unsigned long
ong)bh->b_blocknr);
        printk("b_state=0x%08lx, b_size=%u\n", bh->b_state, bh->b_size);
        printk("device blocksize: %d\n", 1 << bd_inode->i_blkbits);

So I dunno, sorry.
Comment 12 Elliot Lee 2004-11-04 16:47:32 EST
Because of an inability to reproduce the problem during several
installs, we've decided to go ahead with the FC3 release as-is. The
bug still needs fixing, of course. :)
Comment 13 Dave Jones 2005-09-28 04:51:26 EDT
did this bug ever show up again with later kernels ?

Note You need to log in before you can comment on or make changes to this bug.