1196481 – 22 Alpha TC5 32-bit netinst/boot images fail to boot with many "cp will not create hard link" messages (bad product.img files?)

Bug 1196481 - 22 Alpha TC5 32-bit netinst/boot images fail to boot with many "cp will not create hard link" messages (bad product.img files?)

Summary: 22 Alpha TC5 32-bit netinst/boot images fail to boot with many "cp will not c...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	lorax
Sub Component:
Version:	22
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Brian Lane
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	AcceptedBlocker
Depends On:
Blocks:	F22AlphaBlocker
TreeView+	depends on / blocked

Reported:	2015-02-26 03:43 UTC by Chris Murphy
Modified:	2015-02-28 05:18 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-02-28 05:18:29 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
photo of failure (371.09 KB, image/jpeg) 2015-02-26 03:43 UTC, Chris Murphy	no flags	Details
View All

Description Chris Murphy 2015-02-26 03:43:33 UTC

Created attachment 995447 [details]
photo of failure

When booting
http://dl.fedoraproject.org/pub/alt/stage/22_Alpha_TC5/Workstation/i386/os/images/boot.iso

I get dozens of pages of dracut-initqueue cp errors, see photo attached. The failure is a showstopper, I don't get a prompt or the installer.

This boot.iso was downloaded twice, and dd'd to two different USB sticks, and the media check didn't complain on either.

Comment 1 Chris Murphy 2015-02-26 03:45:27 UTC

Marking as AcceptedBlocker per "Complete failure of any release-blocking TC/RC image to boot at all under any circumstance - "DOA" image (conditional failure is not an automatic blocker)" 
http://fedoraproject.org/wiki/QA:SOP_blocker_bug_process#Automatic_blockers

Comment 2 Chris Murphy 2015-02-26 03:46:27 UTC

Looks like this is dracut-041-10.fc22.i686.rpm  but it's not changed since 2/19 so I don't know why I'm seeing this problem today... probably not actually a dracut bug, something else may be instigating this.

Comment 3 Adam Williamson 2015-02-26 04:19:58 UTC

'input/output error' usually screams bad media to me...but you did try two different sticks. do you see the same thing in a VM?

Comment 4 Adam Williamson 2015-02-26 04:22:42 UTC

BTW, I'd say one tester is a tad early to invoke 'automatic blocker', it's best to confirm at least one or two others see the same.

Comment 5 Adam Williamson 2015-02-26 04:28:15 UTC

Hum, yup, I see the same in a VM. interesting. let's check some of the other images...

it seems like it thinks it has an updates image to unpack or something?

Comment 6 Adam Williamson 2015-02-26 04:56:37 UTC

note: the 'boot.iso' and 'netinst' images for a given Flavor are the same - that is, https://dl.fedoraproject.org/pub/alt/stage/22_Alpha_TC5/Workstation/i386/os/images/boot.iso is the same thing as https://dl.fedoraproject.org/pub/alt/stage/22_Alpha_TC5/Workstation/i386/iso/Fedora-Workstation-netinst-i386-22_Alpha_TC5.iso , there is no difference between them.

This seems to affect both Server and Workstation i386 boot/netinst, but x86_64 is fine. No idea what the problem is, yet.

Comment 7 Adam Williamson 2015-02-26 05:04:41 UTC

So I have a *tentative* idea here. Compare:

https://dl.fedoraproject.org/pub/alt/stage/22_Alpha_TC5/Server/i386/os/images/
https://dl.fedoraproject.org/pub/alt/stage/22_Alpha_TC5/Server/x86_64/os/images/

note the 12-byte product.img file in the i386 dir. That doesn't look right. The x86_64 one is 1k, and when run through 'unxz', produces what looks like a valid CPIO archive (i.e. anaconda updates image). The i386 one doesn't unxz at all, xz says it's too small to be a valid archive.

Assuming this is the flavor branding stuff and it's baked into the boot/netinst images somehow or other, the apparently broken i386 one certainly seems like it could be causing the trouble. CCing the usual suspects.

Comment 8 Adam Williamson 2015-02-26 05:37:45 UTC

yeah, it looks like the product.img file *is* baked into the ISO - if you examine the ISO it's in the /images/ directory. So I'm guessing it's the source of the trouble here. Question now is probably how does it get generated and why's that going wrong. (perhaps anaconda shouldn't die on a bad updates/product image, but that seems like a minor issue.)

OK, so the product.img generation seems to be done by lorax:

https://github.com/rhinstaller/lorax/blob/master/README.product

based on whatever it finds in certain magic locations when it's run. The packages which put stuff in these magic locations appear to be the fedora-productimg-* packages.

Looking at recent changes there, I see a rather suspicious one:

http://pkgs.fedoraproject.org/cgit/fedora-productimg-workstation.git/commit/?id=ec76d8f49a3d8aacc40ca894f8c3174a2f0a8795

and the matching one for Server:

http://pkgs.fedoraproject.org/cgit/fedora-productimg-server.git/commit/?id=09f080c64e429cc363b1f5e6a5b4a2be08a9e301

the problem there is neither of them BuildRequires python-devel, so I don't think the install class files wind up in the right place. However, I don't see how this would cause a completely broken image file to be built for i386 but an apparently okay one to be built for x86_64, so I'm not sure it's the culprit here.

For now, assigning to lorax while I poke at it a bit more...

Comment 9 Adam Williamson 2015-02-26 06:03:34 UTC

The python-devel buildrequires issue is a real one, but I don't think it can be the cause of the 12-byte 32-bit product.img files. It *does* mean the install classes for Server and Workstation won't be used, though. I've filed that separately - https://bugzilla.redhat.com/show_bug.cgi?id=1196504 - and have an update on the way to fix it.

Comment 10 Chris Murphy 2015-02-28 02:37:48 UTC

Server boot.iso TC7 i686 is working for me now. This problem appears to be fixed.

Comment 11 Adam Williamson 2015-02-28 05:18:29 UTC

ah, yes, sorry, I didn't get around to updating this bug with details of our further investigation.

bcl found out that the file got truncated because xz crashed: https://bugzilla.redhat.com/show_bug.cgi?id=1196786 . it turns out that xz crashed because we ask it to do maximum compression using one thread per CPU, and that combination of options turns out to need 10GB(!!) of RAM, which is not possible for a 32-bit executable. so, just on i686, it explodes and we get truncated files.

as a short-term fix we've just made releng's compose scripts include a memory usage limit for xz so it'll never use more than 3700MiB of RAM on any arch. for a medium-term fix, bcl may add a workaround to lorax so it restricts xz's memory use on i686.

a 'proper' fix would probably be to make xz calculate 4GB as the top limit of 'available' memory when running as a 32-bit executable; then we could use an option that limits memory use to a percentage of 'available' memory on all platforms, and trust that it would DTRT. that's being tracked in #1196786.

given that the short-term fix for this is in the releng scripts it doesn't require any kind of package push, so we can mark this as FIXED for now. hopefully bcl and dgilmore are keeping track of things so we can drop releng's workaround if bcl adjusts lorax, and then adjust the lorax fix when upstream improves.

Note You need to log in before you can comment on or make changes to this bug.