Bug 442106

Summary: on initial reboot, filesystem had errors
Product: [Fedora] Fedora Reporter: Bill Nottingham <notting>
Component: anacondaAssignee: Anaconda Maintenance Team <anaconda-maint-list>
Status: CLOSED RAWHIDE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: low Docs Contact:
Priority: low    
Version: rawhideCC: cpanceac, esandeen, rvokal
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-04-21 21:38:52 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 235706    
Attachments:
Description Flags
dumpe2fs output
none
dmesg none

Description Bill Nottingham 2008-04-11 19:03:44 UTC
Description of problem:

So, it fscked and then rebooted.

Have not tried to reproduce yet, will do shortly.

Version-Release number of selected component (if applicable):

F9 PR (x86_64)

Comment 1 Bill Nottingham 2008-04-11 19:31:56 UTC
Created attachment 302162 [details]
dumpe2fs output

Here's the dumpe2fs output.

Comment 2 Bill Nottingham 2008-04-11 19:48:00 UTC
[root@localhost ~]# e2fsck -n -v /dev/mapper/moofoo 
e2fsck 1.40.8 (13-Mar-2008)
/ contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -(577587--577611)
Fix? no


/: ********** WARNING: Filesystem still has errors **********


   87480 inodes used (0.87%)
     517 non-contiguous inodes (0.6%)
         # of inodes with ind/dind/tind blocks: 6144/30/0
 1200262 blocks used (2.99%)
       0 bad blocks
       1 large file

   65955 regular files
    7776 directories
       8 character device files
       0 block device files
       0 fifos
    3499 links
   13732 symbolic links (13697 fast symbolic links)
       0 sockets
--------
   90970 files


Comment 3 Bill Nottingham 2008-04-11 19:52:38 UTC
I have in the log:

EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks not in datazone -
block = 202366355, count = 1
EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks not in datazone -
block = 1673128691, count = 1
EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks not in datazone -
block = 1148093888, count = 1
EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks not in datazone -
block = 1338544658, count = 1
EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks not in datazone -
block = 3927257361, count = 1
EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks not in datazone -
block = 1674848470, count = 1
EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks not in datazone -
block = 2292416204, count = 1
EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks not in datazone -
block = 2231364692, count = 1
EXT3-fs error (device dm-2): ext3_free_blocks_sb: bit already cleared for block
28802063
....



Comment 4 Bill Nottingham 2008-04-11 19:57:56 UTC
Created attachment 302164 [details]
dmesg

Comment 5 Bill Nottingham 2008-04-11 20:20:52 UTC
This appears to be happening (at least the errors appear) when running mkinitrd
in the post-install step. No, that doesn't make sense to me either.

Comment 6 Eric Sandeen 2008-04-11 21:28:48 UTC
hrm, can't do an x86_64 install right now, so, I tried:

get the livecd iso notting used.
jump through the hoops to get ext3fs.img extracted from it.
under the 2.6.25-0.218.rc8.git7.fc9.x86_64 kernel...
truncate the image to 164587324416 (the size of notting's lvm volume)
mount the image
grow the image
mount up /sys /boot /proc /dev and whatnot, and chroot in
try mkinitrd
... no problems.

Also, the ext3fs.img seems clean before & after the resize.

when I can try a livecd i'll either try with errors=panic (assuming ext3 can be
remounted that way... need to check...) or try replacing ext3 to get more info
on how it got to that error point.

-Eric

Comment 7 Eric Sandeen 2008-04-11 21:30:51 UTC
notting, if you get to it before me, see if you can pop over to a shell and do
mount -o remount,errors=panic on the dm-2 fs early in the install process...

Comment 8 Bill Nottingham 2008-04-14 22:07:34 UTC
Initial error is:

EXT3-fs error (device dm-2): ext3_free_blocks: Freeing blocks not in datazone -
block = 202366355, count = 1
Kernel panic - not syncing: EXT3-fs (device dm-2): panic forced after error


Comment 9 Bill Nottingham 2008-04-14 22:08:20 UTC
panic isn't doing a stack trace. Haven't tried to rebuild the kernel for the livecd.

Comment 10 Eric Sandeen 2008-04-14 22:14:46 UTC
yep... yay, ext3_error() :/

rebuilding ext3.ko to provide more info would maybe help.  I could provide that
if you want... put it in the initrd... rebuild/reburn... yuck.

FWIW I can hit this too.  :)

Trying to cook up a way to hit it outside of anaconda so I can debug it more
easily...

so far, no luck.

-Eric

Comment 11 Eric Sandeen 2008-04-15 22:10:14 UTC
What does anaconda actually copy when it copies the fs image to the system root
disk?

I put a pre-resize early return in livecd.py and the un-resized image has the
corruption that notting and I both saw...

however:

[root@localhost]# e2fsck -fn /dev/mapper/live-osimg-min 

this checks clean...

is this what anaconda is copying?

Thanks,
-Eric

Comment 12 Bill Nottingham 2008-04-15 22:15:48 UTC
Believe so, yes.

Comment 13 Eric Sandeen 2008-04-15 22:26:43 UTC
hrm.

Ok will keep trying to narrow it down....

The inode for the corrupted file itself is intact, but the first indirect block
that it points to is full of garbage, even on my not-resized-but-just-copied LV.

The inode metadata matches the original fs image; i.e. atime, ctime etc are not
changed; whatever corruption happened doesn't seem related to operations on this
inode, it was just that an indirect block seems clobbered.

-Eric

Comment 14 Eric Sandeen 2008-04-15 22:28:29 UTC
in debugfs-speak:

[root@localhost foo]# debugfs /dev/mapper/live-osimg-min
debugfs 1.40.8 (13-Mar-2008)
debugfs:  stat <86555>
Inode: 86555   Type: regular    Mode:  0644   Flags: 0x0   Generation:
2462646136
User:     0   Group:     0   Size: 150359
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 304
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x47ff7b37 -- Fri Apr 11 09:52:39 2008
atime: 0x47ff78ca -- Fri Apr 11 09:42:18 2008
mtime: 0x47ff78ca -- Fri Apr 11 09:42:18 2008
Size of extra inode fields: 4
Extended attributes stored in inode body:
  selinux = "system_u:object_r:modules_dep_t:s0\000" (35)
BLOCKS:
(0-11):577574-577585, (IND):577586, (12-36):577587-577611
TOTAL: 38

debugfs:  quit
[root@localhost foo]# debugfs /dev/mapper/VolGroup00-LogVol00
debugfs 1.40.8 (13-Mar-2008)
debugfs:  stat <86555>
Inode: 86555   Type: regular    Mode:  0644   Flags: 0x0   Generation:
2462646136
User:     0   Group:     0   Size: 150359
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 304
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x47ff7b37 -- Fri Apr 11 09:52:39 2008
atime: 0x47ff78ca -- Fri Apr 11 09:42:18 2008
mtime: 0x47ff78ca -- Fri Apr 11 09:42:18 2008
Size of extra inode fields: 4
Extended attributes stored in inode body:
  selinux = "system_u:object_r:modules_dep_t:s0\000" (35)
BLOCKS:
(0-11):577574-577585, (IND):577586, (12):2943334241, (13):1038652331,

so the indirect block (577586) is just full of junk at this point.

Comment 15 Bill Nottingham 2008-04-15 22:34:39 UTC
And this block is correct on the 'live' filesystem, theoretically?

Comment 16 Eric Sandeen 2008-04-16 00:07:44 UTC
Because of the way the copy loop gets sizes:

        readamt = 1024 * 1024 * 8 # 8 megs at a time
        size = float(self._getLiveSizeMB() * 1024 * 1024)
        copied = 0
        while copied < size:
            buf = os.read(osfd, readamt)
            written = os.write(rootfd, buf)
            if (written < readamt) and (written < len(buf)):
                raise RuntimeError, "error copying filesystem!"
            copied += written
            progress.set_fraction(pct = copied / size)
            progress.processEvents()

and _getLiveSizeMB does:

        return blkcnt * blksize / 1024 / 1024

This is going to round the copied size down to the nearest megabyte, no?  And
miss the last part of the filesystem... where this corrupt block we saw just
happens to live...

-Eric


Comment 17 Bill Nottingham 2008-04-16 02:03:36 UTC
Thanks for the debugging!

Fixed in
http://git.fedorahosted.org/git/?p=anaconda.git;a=commit;h=9083f70668bfeb72b5dfea73d5cc68685e057e8b

Comment 18 Eric Sandeen 2008-04-16 03:16:27 UTC
*** Bug 431647 has been marked as a duplicate of this bug. ***

Comment 19 Bill Nottingham 2008-04-21 21:38:52 UTC
Closing, did a test of this patch.