Bug 150568 - Fix kjournald oops for U2
Fix kjournald oops for U2
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: Stephen Tweedie
Brian Brock
Depends On: 146037 147443 150135
Blocks: 158363
  Show dependency treegraph
Reported: 2005-03-08 09:11 EST by Stephen Tweedie
Modified: 2010-06-07 00:58 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2010-06-07 00:58:19 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
Fix journal_unmap_buffer race (1.36 KB, patch)
2005-03-08 09:14 EST, Stephen Tweedie
no flags Details | Diff
Fix journal_put_journal_head() release of in-use buffers (4.94 KB, patch)
2005-03-08 09:17 EST, Stephen Tweedie
no flags Details | Diff

  None (edit)
Description Stephen Tweedie 2005-03-08 09:11:31 EST
Description of problem:

There is a race condition in ext3 leading to various problems, most frequently
an oops in kjournald:jbd_commit_transaction().  We have a small, low-risk patch
to fix the reproducible case, but also a larger, higher-risk and more complete
patch that should be queued for later (U2) release once it has passed
appropriate testing.

This bug is open to track the full fix for U2.

Version-Release number of selected component (if applicable):
GA (2.6.9-5.EL)

How reproducible:
Takes time and many CPUs.  

Steps to Reproduce:
Variable; different users have reported different stress loads to reproduce, but
I have not personally been able to reproduce in-house yet.
Comment 1 Stephen Tweedie 2005-03-08 09:14:29 EST
Created attachment 111778 [details]
Fix journal_unmap_buffer race

This is the simpler, low-risk patch which fixes the race window that seems to
be reproducible on demand in testing: journal_unmap_buffer() racing with a
buffer refile in journal_commit_transaction().
Comment 2 Stephen Tweedie 2005-03-08 09:17:55 EST
Created attachment 111779 [details]
Fix journal_put_journal_head() release of in-use buffers

This is the fuller fix proposed to fix this problem both in upstream 2.6 and in
U2.  More testing is needed at this point, though.  The patch includes a
comment describing its operation in more detail.

This patch should not only close the hole seen in testing, but also some
theoretical race windows that in practice are probably only ever going to be
possible in the presence of other IO errors.  But it is definitely a more
robust approach.
Comment 3 Stephen Tweedie 2005-04-15 11:10:51 EDT
This is still getting churned about upstream.  The patch in attachment #111779 [details]
is now in the upstream 2.6 kernels.  It's not complete, though, and we're likely
to be improving locking further by getting rid of one of the locks involved

This is all higher-risk change, and is aimed more at long term maintainability,
performance and theoretical correctness than at fixing any known problems.  The
patch in the U1 kernel fixes all situations we've been able to reproduce up to now.
Comment 8 Rick Hester 2005-10-12 10:48:23 EDT
We just hit a panic  in ext3 in RHEL4U1. It doesn't have the same footprint
as the original panic that initiated this bz, but I'm wondering
if it is related?   Did this 'fuller' patch find its way into

FAT: invalid media value (0x5f)
EXT3-fs warning: checktime reached, running e2fsck is recommended
EXT3-fs warning (device loop0): dx_probe: Unrecognised inode hash code 240
Assertion failure in dx_probe() at fs/ext3/namei.c:381: "dx_get_limit(entries)
== dx_root_limit(dir, root->info.info_length)"
kernel BUG at fs/ext3/namei.c:381!
pax[9524]: bugcheck! 0 [1]
Modules linked in: cramfs loop md5 ipv6 autofs4 sunrpc ds yenta_socket
pcmcia_core vfat fat dm_mod button ohci_hcd ehci_hcd tg3 sg ext3 jbd cciss
sym53c8xx scsi_transport_spi sd_mod scsi_mod

Note You need to log in before you can comment on or make changes to this bug.