Bug 453809

Summary: Problems with jbd error handling
Product: Red Hat Enterprise Linux 5 Reporter: Bryn M. Reeves <bmr>
Component: kernelAssignee: Josef Bacik <jbacik>
Status: CLOSED CURRENTRELEASE QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.2CC: ahecox, dejohnso, tao
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-04-21 15:37:20 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Bryn M. Reeves 2008-07-02 17:41:10 UTC
Description of problem:
Hidehiro Kawai discovered some problems with jbd's error handling that can in
rare failure situations lead to file system corruption during journal recovery:

http://lkml.org/lkml/2008/4/18/154

Although upstream is planning to move past the current jbd implementation in a
way that may make these changes irrelevant users of the current code are still
vulnerable to these problems. Environments where a very large number of disks
are in use increases the probability that one of these problems will occur.

Version-Release number of selected component (if applicable):
2.6.18-*

How reproducible:
Very difficult; may require hardware with fault injection capabilities. Problems
discovered via code inspection.

Steps to Reproduce:
1. n/a
  
Additional info:
[PATCH 1/4] jbd: strictly check for write errors on data buffers
[PATCH 2/4] jbd: ordered data integrity fix
[PATCH 3/4] jbd: abort when failed to log metadata buffers
[PATCH 4/4] jbd/ext3: fix error handling for checkpoint io

Comment 2 RHEL Program Management 2009-02-16 15:27:14 UTC
Updating PM score.

Comment 3 RHEL Program Management 2009-02-24 17:32:43 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Debbie Johnson 2009-04-09 16:32:18 UTC
Josef,

What is the status of this BZ?  Will it be going into 5.4?  It is unclear by the comments and I have a customer that is in need of this.  I attached the IT to this.

Debbie

Errors they are seeing...

Mar  4 15:08:47 jb-601 kernel: journal_bmap: journal block not found at offset 24588 on sdb3
Mar  4 15:08:47 jb-601 kernel: JBD: bad block at offset 24588
Mar  4 15:08:47 jb-601 kernel: journal_bmap: journal block not found at offset 24588 on sdb3
Mar  4 15:08:47 jb-601 kernel: JBD: bad block at offset 24588
Mar  4 15:08:47 jb-601 kernel: JBD: Failed to read block at offset 24585
Mar  4 15:08:47 jb-601 kernel: JBD: recovery failed
Mar  4 15:08:47 jb-601 kernel: EXT3-fs: error loading journal.
Mar  4 15:08:47 jb-601 kernel: journal_bmap: journal block not found at offset 21516 on sdb4
Mar  4 15:08:47 jb-601 kernel: JBD: bad block at offset 21516
Mar  4 15:08:47 jb-601 kernel: journal_bmap: journal block not found at offset 21685 on sdb4
Mar  4 15:08:47 jb-601 kernel: JBD: bad block at offset 21685
Mar  4 15:08:47 jb-601 kernel: JBD: recovery failed
Mar  4 15:08:47 jb-601 kernel: EXT3-fs: error loading journal.

Comment 5 Josef Bacik 2009-04-09 16:39:49 UTC
I'm pretty sure Hitachi has already posted these, but they will not fix the problem it looks like your customer is having.  These patches are simply to make sure we always abort the transaction when we are supposed to, it seems like your customers is suffering from data corruption.

Comment 6 Josef Bacik 2009-04-21 15:37:20 UTC
the recovery patches referenced in c1 have largely been accepted already via other bz's.  I'm closing this bz.