817262 – File system errors left uncorrected, failure to install along side Mac OS, Mactel boot

Bug 817262 - File system errors left uncorrected, failure to install along side Mac OS, Mactel boot

Summary: File system errors left uncorrected, failure to install along side Mac OS, Ma...

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	anaconda
Sub Component:
Version:	17
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Anaconda Maintenance Team
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-04-28 12:05 UTC by Chris Murphy
Modified:	2013-01-10 08:29 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-07-18 17:46:11 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
program.log (284.84 KB, text/plain) 2012-04-28 12:06 UTC, Chris Murphy	no flags	Details
anaconda.log (11.00 KB, text/plain) 2012-04-28 12:06 UTC, Chris Murphy	no flags	Details
storage.log (436.42 KB, text/plain) 2012-04-28 12:06 UTC, Chris Murphy	no flags	Details
storage.state (attempt 2) (44.00 KB, text/plain) 2012-04-28 12:21 UTC, Chris Murphy	no flags	Details
program.log (attempt 2) (314.76 KB, application/octet-stream) 2012-04-28 12:22 UTC, Chris Murphy	no flags	Details
photo of failed verification (69.51 KB, image/jpeg) 2012-04-28 21:30 UTC, Chris Murphy	no flags	Details
dmesg.txt (116.65 KB, text/plain) 2012-05-03 10:47 UTC, Chris Murphy	no flags	Details
dd first 1.5MB of lv_root (32.33 KB, application/x-bzip2) 2012-05-03 18:40 UTC, Chris Murphy	no flags	Details
dd 4K at 67959 (4.00 KB, application/octet-stream) 2012-05-03 21:02 UTC, Chris Murphy	no flags	Details
View All

Description Chris Murphy 2012-04-28 12:05:14 UTC

Description of problem:
Existing Mac OS 10.7 installation with free space, installer fails with File system errors left uncorrected message.


Version-Release number of selected component (if applicable):
Fedora-17.TC2-x86_64-Live-Desktop.iso

How reproducible:
1 for 1

Steps to Reproduce:

Installation of Mac OS 10.7:

1. Single partition installation produces three partitions: EFI System, Mac OS, Recovery HD. Use JHFSX as the file system. System installed and rebooted normally.

2. Within Mac OS, used the following to resize JHFSX
diskutil resizevolume /dev/disk0s2 20G

Result of this command is the same number of partitions as before: EFI, Mac OS, Recovery, but the last ~480GB are unallocated now.

Execute Fedora Installation:

3. dd ISO to USB stick from Mac OS command line.

4. Boot from USB stick choosing EFI Boot option.

5. Run Fedora Installer.

6. Choose installation type "Use Free Space" with review of partition layout, accepted partition layout.
  
Actual results:
ext4 filesystem check failure immediately after it was created on vf_f17s-lv_root

Expected results:
proper creation of the file system, then copying of live image to hard drive.

Additional Information:
10.7.0
MBP4,1
Within past 7 days the internal disk has receive an extended SMART scan, no read or sector errors. Within past 60 days ATA Secure Erase was issued to the drive.

When I e2fsck -f I get hundreds of errors to fix and just gave up. The file system is completely fakaked as far as I can tell.

Comment 1 Chris Murphy 2012-04-28 12:06:02 UTC

Created attachment 580962 [details]
program.log

Comment 2 Chris Murphy 2012-04-28 12:06:22 UTC

Created attachment 580963 [details]
anaconda.log

Comment 3 Chris Murphy 2012-04-28 12:06:44 UTC

Created attachment 580964 [details]
storage.log

Comment 4 Chris Murphy 2012-04-28 12:20:18 UTC

Did the following immediately after saving logs:

Reboot Mac OS. Fine.
Reboot off USB stick, EFI Boot F17 TC2.
Choose Replace Existing Linux Systems, and review part layout.
Accept the default layout, write changes to disk.

Right after ext4 file system creation on vg_f17s-lv_root, but before live image starts getting copied, I get the same error message. Attaching 2nd copy program.log as well as storage.state which I forgot to do the first time around.

Comment 5 Chris Murphy 2012-04-28 12:21:23 UTC

Created attachment 580965 [details]
storage.state (attempt 2)

Comment 6 Chris Murphy 2012-04-28 12:22:11 UTC

Created attachment 580966 [details]
program.log (attempt 2)

Comment 7 Chris Murphy 2012-04-28 12:44:01 UTC

OK and now 4 for 4 failure/attempts. Twice with Use Free Space and twice with Replace Existing.

Comment 8 Chris Murphy 2012-04-28 21:05:48 UTC

Does not occur when installing from Live-Desktop ISO burned to DVD-RW media.
Does not occur when installing from DVD ISO burned to DVD-RW media.

Does occur when installing from Live-Desktop ISO to USB stick using dd. 

The GRUB option to "Verify and Boot" Fedora goes by so quickly, that it's not at all obvious it has in fact failed. And I'm not finding a log or console where the results of this verification test are found.

Removing rghb and quiet, and a moderate shutter speed, I was able to photograph what I could not see. Although blurry, the message plainly reads:

Checking 004.8%
The media check is complete, the result is <cutoff>
It is not recommended to use this media.

Comment 9 Chris Murphy 2012-04-28 21:30:38 UTC

Created attachment 581006 [details]
photo of failed verification

This probably needs to be reassigned to livecd-tools, it's not an anaconda problem. The media fails verification, but the user is not informed.

Comment 10 Chris Murphy 2012-04-29 22:48:24 UTC

Since media fails verification, I think this bug should be closed or marked as duplicate of bug 817419.

Comment 11 Matthew Garrett 2012-04-30 15:23:13 UTC

I suspect that this is unrelated to the media verification failure.

Comment 12 Chris Murphy 2012-04-30 15:40:29 UTC

This bug is not reproducible when the USB stick is produced with livecd-iso-to-disk.

Comment 13 Matthew Garrett 2012-04-30 17:24:40 UTC

Can't reproduce this here - I wrote TC2 to a USB stick, booted it, installed it into 40GB of free space and the install completed successfully.

Comment 14 Chris Murphy 2012-04-30 17:43:21 UTC

I can only try it on one Lexar 2G stick and a MBP 4,1,[1] but have re-imaged the stick twice, zeroing in between, and have reproduced the results 6+ times in a row. It's jerry-rigged, but I could dd a 4G CF card in a Firewire adapter and see what happens.

[1]MBP8,2 hangs so not testable. And a Kingston 16G stick isn't compatible with the MBP4,1, taking so long to load initramfs that GRUB times out and reboots, and Mac OS X takes 45 minutes to boot.

Comment 15 Eric Sandeen 2012-04-30 17:52:22 UTC

Can you pop over to a shell after a failure, and see what's in dmesg after the failed mount?

Ideally, getting an "e2image -r" of the problematic device might be interesting so I can see just what's corrupted.

Is it an encrypted root?  Does the storage configuration affect it at all?  What if you install to a plain partition?

Comment 16 Chris Murphy 2012-04-30 18:50:23 UTC

I didn't originally capture dmesg so I will need to reimage the USB stick to get it.

I am not configuring an encrypted root on the target disk, however the target disk did previously contain Mac OS X and F17 Beta encrypted roots. The target disk was not wiped, merely repartitioned for this round of testing.

Does not fail "Use All Space", only fails when installing along side Mac OS X partitions: EFI System (FAT32), MacOS X (JHFSX, case-sensitive journaled), and Recovery HD (JHFS+, case-insensitive journaled).

Comment 17 Chris Murphy 2012-04-30 20:31:26 UTC

*sigh* OK now with the same stick imaged a 3rd time, I get a hang at dracut:starting plymouth daemon. So I'm not sure if there are still problems dd'ing the image to stick, or if the stick itself is having issues, and I don't have a substitute as this MBP's firmware (or whatever) is beyond fussy about booting off USB anyway.

dd'ing to a FW400 CF card, boots, installs, no errors.

Comment 18 Chris Murphy 2012-04-30 22:19:45 UTC

Same stick imaged 4th time, boot, installs, no errors. *shrug*

It's possible it's stick related. It's possible the state of the target disk, if it could possibly be a factor, has changed enough that the problem won't reproduce until I regress way back (full disk encryption) and then try again.

For grins, I did reinstall Mac OS X as a single partition, and resize, as in steps 1 and 2 in the Description. The problem still does not occur.

Comment 19 Chris Murphy 2012-05-03 10:47:29 UTC

Created attachment 581825 [details]
dmesg.txt

Reproduced it. But I don't know exactly what option I'm choosing causes it.

1. DVD ISO. "Use All Space" + encryption + XFS for lv_root and lv_home + default packages.
2. Firstboot, launched firefoxed, went to a random web site.
3. Reboot Mac OS X 10.7 from external disk (or DVD), repartition single partition, install OS, no encryption.
4. Reboot to new installation of Mac OS.
5. Resize file system to 20.4GB.
6. Reboot off dd'd USB stick of F17 TC2.
7. Default "Replace Existing" installation type + Review partitions.

Fail after formatting, before copying image.

I have a dmesg, will attach.

# e2image -r /dev/mapper/vg_f17s/lv_root e2img_lvroot.img
e2image: Corrupt extent header while iterating over inode 20

Resulting file size for *img is 0 so nothing to attach.

Comment 20 Eric Sandeen 2012-05-03 14:54:04 UTC

# e2image -r /dev/mapper/vg_f17s/lv_root e2img_lvroot.img
e2image: Corrupt extent header while iterating over inode 20

Perhaps a dd of the first .... meg or so of /dev/mapper/vg_f17s/lv_root might offer a clue as to what's in there.

-Eric

Comment 21 Chris Murphy 2012-05-03 18:40:56 UTC

Created attachment 581928 [details]
dd first 1.5MB of lv_root

dd if=/dev/vg_f17s/lv_root of=lv_root.img bs=1536K count=1

Comment 22 Eric Sandeen 2012-05-03 19:22:13 UTC

It appears that the corruption in inode 20 is well beyond this, out at 265 megs or so:

debugfs:  stat <20>       
Inode: 20   Type: regular    Mode:  0644   Flags: 0x80000
Generation: 480441713    Version: 0x00000000:00000001
User:     0   Group:     0   Size: 38453248
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 75104
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x4f9b5918:8a1e4594 -- Fri Apr 27 21:42:32 2012
 atime: 0x4f9b5906:670113ec -- Fri Apr 27 21:42:14 2012
 mtime: 0x4f9b5906:50da59c4 -- Fri Apr 27 21:42:14 2012
crtime: 0x4f9b574a:19c2a0e0 -- Fri Apr 27 21:34:50 2012
Size of extra inode fields: 28
Extended attributes stored in inode body: 
  selinux = "system_u:object_r:rpm_var_lib_t:s0\000" (35)
EXTENTS:
debugfs:

hm, no extents?

debugfs:  dump_extents <20>
Level Entries       Logical          Physical Length Flags
 0/ 1   1/  1     0 -  9387   67959             9388
<no more>

There is another extent node out at physical block 67959 ... so if you want, you could attach the 67959'th 4k block - or poke around with debugfs as above
 
I was hoping to see "MAC WAS HERE" in ASCII or something but no luck so far.

Comment 23 Chris Murphy 2012-05-03 21:02:57 UTC

Created attachment 581964 [details]
dd 4K at 67959

dd if=/dev/vg_f17s/lv_root of=lv_root_67959.img bs=4K skip=67959 count=1

I'm not debugfs proficient. I can make this computer available by ssh if you want to poke around, just contact me offline.

Comment 24 Eric Sandeen 2012-05-03 23:15:33 UTC

I logged in with the info you gave me; not sure if you were on at the same time, but the /dev/vg_f17s/lv_root device disappeared while I was there.

Looking at dmesg, there were then lots of errors:

[ 7257.140183] SQUASHFS error: xz_dec_run error, data probably corrupt
[ 7257.140189] SQUASHFS error: squashfs_read_data failed to read block 0x25c6ec
[ 7257.140191] SQUASHFS error: Unable to read data cache entry [25c6ec]
[ 7257.140193] SQUASHFS error: Unable to read page, block 25c6ec, size 3e7c
[ 7257.140199] SQUASHFS error: Unable to read data cache entry [25c6ec]
[ 7257.140200] SQUASHFS error: Unable to read page, block 25c6ec, size 3e7c
[ 7257.140205] SQUASHFS error: Unable to read data cache entry [25c6ec]
[ 7257.140207] SQUASHFS error: Unable to read page, block 25c6ec, size 3e7c
[ 7257.140212] SQUASHFS error: Unable to read data cache entry [25c6ec]
[ 7257.140213] SQUASHFS error: Unable to read page, block 25c6ec, size 3e7c
[ 7257.140218] SQUASHFS error: Unable to read data cache entry [25c6ec]
[ 7257.140220] SQUASHFS error: Unable to read page, block 25c6ec, size 3e7c
[ 7257.140224] Buffer I/O error on device loop3, logical block 15904
....

Have you done a memtest on the box?  I hate to ask, but something seems pretty odd here.

Comment 25 Chris Murphy 2012-05-03 23:23:54 UTC

We may have bumped each other, I decided to see if the state of the computer is still reproducing the problem and it is. But I'm out of it now if you want to do more work.

As mentioned in my email, I remain suspicious of memory corruptions when booted EFI mode on Apple hardware, as this has been a problem manifesting itself in other ways, like with nouveau, which appears thus far to be fixed.

There is no memtest86 for EFI as 16-bit isn't available EFI mode. But I have run it in CSM-BIOS mode extensively (days) with no errors. And of course Apple supplies an EFI based tool that does an extended memory test, and other hardware tests as well. And it comes up clean for all.

Comment 26 Chris Murphy 2012-05-03 23:26:10 UTC

When you're done, if you want I can reboot and try to do another "Use All Space" and see what happens. In all previous attempts, Use All Space does not reproduce this bug, all other things equal.

Comment 27 Eric Sandeen 2012-05-03 23:49:39 UTC

and 

/dev/loop3: [1794]:3 (/LiveOS/ext3fs.img)

and that's the USB stick no?  Hrm.

If i look at the bad lvm image:

[root@f16s ~]# debugfs /dev/mapper/vg_f16s-lv_root 
debugfs 1.42 (29-Nov-2011)
debugfs:  stat <20>
Inode: 20   Type: regular    Mode:  0644   Flags: 0x80000
Generation: 480441713    Version: 0x00000000:00000001
User:     0   Group:     0   Size: 38453248
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 75104
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x4f9b5918:8a1e4594 -- Fri Apr 27 22:42:32 2012
 atime: 0x4f9b5906:670113ec -- Fri Apr 27 22:42:14 2012
 mtime: 0x4f9b5906:50da59c4 -- Fri Apr 27 22:42:14 2012
crtime: 0x4f9b574a:19c2a0e0 -- Fri Apr 27 22:34:50 2012
Size of extra inode fields: 28
Extended attributes stored in inode body: 
  selinux = "system_u:object_r:rpm_var_lib_t:s0\000" (35)
EXTENTS:
(ETB0):67959

there should be more extents there. On the original image:


[root@f16s ~]# debugfs /mnt/squashfs/LiveOS/ext3fs.img 
debugfs 1.42 (29-Nov-2011)
debugfs:  stat <20>
Inode: 20   Type: regular    Mode:  0644   Flags: 0x80000
Generation: 480441713    Version: 0x00000000:00000001
User:     0   Group:     0   Size: 38453248
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 75104
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x4f9b5918:8a1e4594 -- Fri Apr 27 22:42:32 2012
 atime: 0x4f9b5906:670113ec -- Fri Apr 27 22:42:14 2012
 mtime: 0x4f9b5906:50da59c4 -- Fri Apr 27 22:42:14 2012
crtime: 0x4f9b574a:19c2a0e0 -- Fri Apr 27 22:34:50 2012
Size of extra inode fields: 28
Extended attributes stored in inode body: 
  selinux = "system_u:object_r:rpm_var_lib_t:s0\000" (35)
EXTENTS:
(ETB0):67959, (0):33280, (1):34304, (2):33281, (3-2047):65536-67580, (2048-4095):69632-71679, (4096-5885):110592-112381, (5887-6143):112383-112639, (614
4-8191):159744-161791, (8192-9387):231424-232619

we start out the same, but the extent list is correct - i.e. the extent block at 67959 isn't corrupt.

I dd'd that block out of each, and it's completely different on the system vs. the original image:

# cmp -l orig.img lv_root_67959.img  | wc -l
4085

(I guess we got about 11 "lucky" bytes that matched ;)

So if I understand right - at one point, anaconda was doing a mkfs+fsck, and that failed with corruption.

In this case it seems to have gotten past that, but now the image which was copied onto the laptop's storage is _also_ corrupt....

For some reason the image on the hard drive is also lacking a journal.  Random corruption?

And when I look at the USB stick I get IO errors:

# cmp -l /dev/vg_f16s/lv_root /mnt/squashfs/LiveOS/ext3fs.img  | more
cmp:  1073 215 140
      1074  17 131
      1075 243 233
      1083   2   1
      1117  70  74
      1249   0  10
      1401 271 265
   1185537   0 200
   1185538   0 201
   1185544   0  10
   1185545   0 111
/mnt/squashfs/LiveOS/ext3fs.img: Input/output error

...

not sure what's going on here but IO errors from the install medium are not encouraging ....

Comment 28 Chris Murphy 2012-05-04 04:52:12 UTC

Installation type "Use All Space" = same error.

dd /dev/zero'd the first 21G of the target drive and tried again = same error.

Poweroff let the computer sit for 2 hours, tried again, "Use All Space" =  same error.

So it's looking like an intermittant/failing USB stick, is my best guess. Odd that in 5 or 6 separate dd imaging attempt of this stick, that only the first, and most recent, manifest with this bug.

Comment 29 Jesse Keating 2012-07-18 17:46:11 UTC

We can't really fix broken install sources :/  Looks like you figured out the root cause though.

Note You need to log in before you can comment on or make changes to this bug.