Bug 839891

Summary:	e2fsck leaves ext4 journal unfixed
Product:	[Fedora] Fedora	Reporter:	Bojan Smojver <bojan>
Component:	kernel	Assignee:	Kernel Maintainer List <kernel-maint>
Status:	CLOSED NEXTRELEASE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	17	CC:	esandeen, gansalmon, itamar, jonathan, josef, kernel-maint, kzak, madhu.chinakonda, oliver, redhat
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-08-31 04:29:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Bojan Smojver 2012-07-13 07:13:23 UTC

Description of problem:

A recent power outage left my VM with ext4 errors. I do not have console access to this box, so I created /fsckoptions that contained "-y" and touched /forcefsck, followed by a reboot. This fixed inodes etc. just fine.

However, on mount of the root file system, I get:
----------------------
[    6.186393] EXT4-fs warning (device dm-0): ext4_clear_journal_err:4102: Filesystem error recorded from previous mount: IO failure
[    6.186441] EXT4-fs warning (device dm-0): ext4_clear_journal_err:4103: Marking fs in need of filesystem check.
----------------------

I does not matter how many times I reboot the system (which then runs e2fsck -f -y). The warning/error persists.

Version-Release number of selected component (if applicable):
e2fsprogs-1.42.3-2.fc17.i686

How reproducible:
Always.

Steps to Reproduce:
1. Won't be easy - an actual power outage caused my ext4 root FS to go bonkers.
  
Actual results:
The file system works, but every mount produces the warnings.

Expected results:
e2fsck should clear all errors and warning from the file system.

Additional info:
This is a VM running inside VMware ESX, I believe.

Comment 1 Bojan Smojver 2012-07-13 07:18:26 UTC

Looking at the second entry of the release notes for the latest e2fsprogs version (http://e2fsprogs.sourceforge.net/e2fsprogs-release.html#1.42.4), maybe this was fixed?
------------------
Fixed e2fsck's handling of the journal's s_errno field. E2fsck was not properly propagating the journal's s_errno field to the superblock field; it was not checking this field if the journal had already been replayed, and if the journal *was* being replayed, the "error bit" wasn't getting flushed out to disk.
------------------

Comment 2 Bojan Smojver 2012-07-13 07:47:53 UTC

(In reply to comment #1)
> Looking at the second entry of the release notes for the latest e2fsprogs
> version (http://e2fsprogs.sourceforge.net/e2fsprogs-release.html#1.42.4),
> maybe this was fixed?
> ------------------
> Fixed e2fsck's handling of the journal's s_errno field. E2fsck was not
> properly propagating the journal's s_errno field to the superblock field; it
> was not checking this field if the journal had already been replayed, and if
> the journal *was* being replayed, the "error bit" wasn't getting flushed out
> to disk.
> ------------------

Nah. Upgrading to e2fsprogs from Rawhide did not help.

Comment 3 Bojan Smojver 2012-07-13 08:00:52 UTC

The complete log relevant to this (just in case it helps):
------------------------------
[    3.696074] dracut: 4 logical volume(s) in volume group "vg00" now active
[    3.812967] EXT4-fs warning (device dm-0): ext4_clear_journal_err:4102: Filesystem error recorded from previous mount: IO failure
[    3.812985] EXT4-fs warning (device dm-0): ext4_clear_journal_err:4103: Marking fs in need of filesystem check.
[    3.813126] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
[    3.838787] dracut: Checking ext4: /dev/vg00/lv00
[    3.838804] dracut: issuing e2fsck -f -y /dev/vg00/lv00
[   25.976305] dracut: Pass 1: Checking inodes, blocks, and sizes
[   25.976547] dracut: Pass 2: Checking directory structure
[   25.976786] dracut: Pass 3: Checking directory connectivity
[   25.979912] dracut: Pass 4: Checking reference counts
[   25.980102] dracut: Pass 5: Checking group summary information
[   25.980371] dracut: /: 153962/516096 files (2.5% non-contiguous), 1154240/2064384 blocks
[   25.980914] dracut: Remounting /dev/vg00/lv00 with -o ro
[   26.050506] EXT4-fs warning (device dm-0): ext4_clear_journal_err:4102: Filesystem error recorded from previous mount: IO failure
[   26.050702] EXT4-fs warning (device dm-0): ext4_clear_journal_err:4103: Marking fs in need of filesystem check.
[   26.051473] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
[   26.058566] dracut: Mounted root filesystem /dev/mapper/vg00-lv00
[   26.101207] dracut: Switching root
------------------------------

Comment 4 Eric Sandeen 2012-07-13 13:35:25 UTC

Could you provide an "e2image -r" of /dev/vg00/lv00 ?  You can obfuscate names if you like via other cmdline options, and/or provide it to me offline.  That way I can recreate exactly what you're seeing, and investigate quickly.

If you're not comfortable with that it'll take a bit more time to look into.

Thanks,
-Eric

Comment 5 Bojan Smojver 2012-07-15 01:02:58 UTC

Image file sent offline.

Comment 6 Eric Sandeen 2012-07-16 18:43:57 UTC

Ok, if the image you sent got properly copied by e2image, it seems to be in fairly bad shape.  Was it created while the fs was unmounted?

It won't even mount for me, and the kernel's attempt at log replay borks it to the point where it won't even try again:

# blkid bojan-image.img
bojan-image.img: LABEL="/" UUID="1f59e667-4871-4fc9-a30a-6027249112b7" TYPE="ext4" 

# mount -o loop bojan-image.img mnt/
# dmesg | tail

[ 3218.196407] EXT4-fs error (device loop2): ext4_map_blocks:491: inode #8: block 9624: comm mount: lblock 8901 mapped to illegal pblock (length 1)
[ 3218.209410] jbd2_journal_bmap: journal block not found at offset 8901 on loop2-8
[ 3218.216810] JBD2: bad block at offset 8901
[ 3218.221088] JBD2: recovery failed
[ 3218.224409] EXT4-fs (loop2): error loading journal

# mount -o loop bojan-image.img mnt/
mount: you must specify the filesystem type

# blkid bojan-image.img
#

Ugh.

e2fsck's log replay eats it too:

# e2fsck -fy bojan-image.img
e2fsck 1.41.12 (17-May-2010)
e2fsck: Superblock invalid, trying backup blocks...
e2fsck: Bad magic number in super-block while trying to open bojan-image.img

The superblock could not be read or does not describe a correct ext2
filesystem.

# e2fsck -fy bojan-image.img
e2fsck 1.41.12 (17-May-2010)
/: recovering journal
e2fsck: Bad magic number in super-block while trying to re-open /
e2fsck: io manager magic bad!

Double ugh.

Comment 7 Bojan Smojver 2012-07-16 21:34:35 UTC

(In reply to comment #6)

> Double ugh.

As you asked in the private e-mail, I did take the image while the root FS was mounted, so this may be causing the problems you are seeing (I have never used e2image before, so just assumed it would be OK).

I do not have console access to this VM, so I cannot put the root FS into read only mode or unmount it (although I do have root on it).

If there is some other way to collect useful info, let me know.

Comment 8 Eric Sandeen 2012-07-20 19:23:58 UTC

There is some risk to it, but you might be able to do 

# fsfreeze -f /; e2image ....; fsfreeze -u /

There's probably some possibility of a deadlock though (and of course e2image would need to write to some other fs)

If that's too risky for you I'll see if I can repro this some other way.

Comment 9 Eric Sandeen 2012-07-20 20:15:52 UTC

OK I think I can repro this now.

-Eric

Comment 10 Eric Sandeen 2012-07-20 20:31:04 UTC

Ok, the upstream fix is working if the fs is unmounted (without it, it's not cleared even for an unmounted fs).

But it's not working for an ro-mounted filesytem.

Comment 11 Mike 2012-07-21 00:29:23 UTC

I also see these errors on my fedora 17 x64 machine. Ive run fsck on it before and it always seems to come back with this error...but everything on the file system seem to be ok. Only thing i see different is my file system is ext3 and not ext4...but in dmesg i see it says its using the ext4 subsystem:

[ 8979.778909] EXT4-fs (dm-3): mounting ext3 file system using the ext4 subsystem
[ 8979.838400] EXT4-fs warning (device dm-3): ext4_clear_journal_err:4102: Filesystem error recorded from previous mount: IO failure
[ 8979.838403] EXT4-fs warning (device dm-3): ext4_clear_journal_err:4103: Marking fs in need of filesystem check.
[ 8979.839250] EXT4-fs (dm-3): warning: mounting fs with errors, running e2fsck is recommended
[ 8979.840297] EXT4-fs (dm-3): mounted filesystem with ordered data mode. Opts: (null)

Comment 12 Eric Sandeen 2012-07-21 00:32:10 UTC

Yeah, when fsck runs on the ro-mounted fs it doesn't get cleared.

On F17 ext3 is handled by the ext4 driver, that's why you see all the references to ext4....  

I can reproduce this pretty easily - I'm not yet sure what the right fix is, but we'll get there.

-Eric

Comment 13 Bojan Smojver 2012-07-21 02:46:37 UTC

(In reply to comment #12)

> I can reproduce this pretty easily - I'm not yet sure what the right fix is,
> but we'll get there.

So, I'm guessing you don't need me to run fsfreeze thing, right?

Comment 14 Eric Sandeen 2012-07-23 18:19:47 UTC

Right, I don't think I'll need more info from you, thanks.

-Eric

Comment 15 Bojan Smojver 2012-08-12 05:28:00 UTC

Just FYI, still the case with kernel-3.5.1-1.fc17.i686.

Comment 16 Bojan Smojver 2012-08-23 02:14:46 UTC

Fix?

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commit;h=d796c52ef0b71a988364f6109aeb63d79c5b116b

Reassigning to kernel.

Comment 17 Eric Sandeen 2012-08-25 02:16:23 UTC

Yeah, that's the fix, thanks.  it cc's stable so hopefully it'll get picked up in fedora soon.

-Eric

Comment 18 Bojan Smojver 2012-08-28 00:20:07 UTC

(In reply to comment #17)
> Yeah, that's the fix, thanks.  it cc's stable so hopefully it'll get picked
> up in fedora soon.

Looks like it got merged into 3.5.3.

Comment 19 Bojan Smojver 2012-08-31 04:29:26 UTC

Fixed in kernel-3.5.3-1.fc17.x86_64. Thanks.

Comment 20 Eric Sandeen 2012-08-31 05:12:30 UTC

Thank you for your expert management of this bug ;)