Bug 1128341 - aarch64: anaconda in VM dies with "SystemError: Could not determine system architecture." because blivet assumes aarch64 always has EFI
Summary: aarch64: anaconda in VM dies with "SystemError: Could not determine system ar...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: python-blivet
Version: rawhide
Hardware: aarch64
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: David Lehman
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 1166876 (view as bug list)
Depends On:
Blocks: ARM64, F-ExcludeArch-aarch64
TreeView+ depends on / blocked
 
Reported: 2014-08-09 09:07 UTC by Richard W.M. Jones
Modified: 2016-09-14 15:30 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-12-06 21:04:11 UTC


Attachments (Terms of Use)

Description Richard W.M. Jones 2014-08-09 09:07:33 UTC
Description of problem:


Starting installer, one moment...
Traceback (most recent call last):
  File "/sbin/anaconda", line 913, in <module>
    from pyanaconda import geoloc
  File "/usr/lib64/python2.7/site-packages/pyanaconda/geoloc.py", line 109, in <module>
    from pyanaconda import network
  File "/usr/lib64/python2.7/site-packages/pyanaconda/network.py", line 40, in <module>
    from blivet.devices import FcoeDiskDevice, iScsiDiskDevice
  File "/usr/lib/python2.7/site-packages/blivet/__init__.py", line 75, in <module>
    from .devices import BTRFSDevice, BTRFSSubVolumeDevice, BTRFSVolumeDevice, DirectoryDevice, FileDevice, LVMLogicalVolumeDevice, LVMThinLogicalVolumeDevice, , NFSDevice, NoDevice, OpticalDevice, PartitionDevice, TmpFSDevice, devicePathToName
    from .devices import BTRFSDevice, BTRFSSubVolumeDevice, BTRFSVolumeDevice, DirectoryDevice, FileDevice, LVMLogicalVolumeDevice, LVMThinLogicalVolumeDevice, , NFSDevice, NoDevice, OpticalDevice, PartitionDevice, TmpFSDevice, devicePathToName
  File "/usr/lib/python2.7/site-packages/blivet/devices.py", line 51, in <module>
    from .formats import get_device_format_class, getFormat, DeviceFormat
  File "/usr/lib/python2.7/site-packages/blivet/formats/__init__.py", line 508, in <module>
    collect_device_format_classes()
  File "/usr/lib/python2.7/site-packages/blivet/formats/__init__.py", line 108, in collect_device_format_classes
    globals()[mod_name] = __import__(mod_name, globals(), locals(), [], -1)
  File "/usr/lib/python2.7/site-packages/blivet/formats/biosboot.py", line 26, in <module>
    from .. import platform
  File "/usr/lib/python2.7/site-packages/blivet/platform.py", line 442, in <module>
    platform = getPlatform()
  File "/usr/lib/python2.7/site-packages/blivet/platform.py", line 440, in getPlatform
    raise SystemError("Could not determine system architecture.")
SystemError: Could not determine system architecture.

Note on this machine, uname output is:

Linux mustang.home.annexia.org 3.16.0-0.rc7.git4.1.rwmj4.fc22.aarch64 #1 SMP Sun Aug 3 00:26:40 BST 2014 aarch64 aarch64 aarch64 GNU/Linux

Version-Release number of selected component (if applicable):

Anaconda in Fedora 21

How reproducible:

100%

Steps to Reproduce:
1. Try to install Fedora into an aarch64 virtual machine.

Comment 1 Richard W.M. Jones 2014-08-09 12:43:07 UTC
I was looking through the source of blivet, which is where the
bug happens (not Anaconda), and it seems that blivet assumes
that aarch64 is always EFI-based, ie from blivet/platform.py:

...
    elif arch.isEfi():
        if arch.isMactel():
            return MacEFI()
        elif arch.isAARCH64():
            return Aarch64EFI()
        else:
            return EFI()
    elif arch.isX86():
        return X86()
    elif arch.isARM():
        armMachine = arch.getARMMachine()
        if armMachine == "omap":
            return omapARM()
        else:
            return ARM()
    else:
        raise SystemError("Could not determine system architecture.")

I'm installing to a VM, and currently (and for the foreseeable future)
VMs won't have EFI.  We will eventually add it, but it's going to be
quite complicated.

So anyway I believe this is an assumption in Blivet which is only
correct for baremetal, not for VMs.

(By the way I could not find the right component for Blivet, so I'm
leaving this as Anaconda for now).

Comment 2 Richard W.M. Jones 2014-08-09 20:19:07 UTC
Please note I am currently looking at how hard it would be to
get aarch64 VMs to use EFI.  If that can be done relatively
easily, then fixing this bug would not be necessary.  Will get
back on this point.

Comment 3 Cole Robinson 2014-11-21 21:25:05 UTC
*** Bug 1166876 has been marked as a duplicate of this bug. ***

Comment 4 Cole Robinson 2014-11-21 21:27:37 UTC
I hit the same issue (see my duped bug), however I was passing UEFI/AAVMF roms to the VM. I just don't know if they are loaded when qemu boots off -kernel.

This blivet check is just a minor bit though, I'm guessing even if we taught it that aarch64 isn't strictly _always_ on uefi, anaconda wouldn't set up the correct post install boot environment if it doesn't detect UEFI (just a guess). So figuring out what's going on at the qemu/kernel level is probably necessary

Comment 5 Brian Lane 2014-11-21 22:33:30 UTC
(In reply to Cole Robinson from comment #4)
> I hit the same issue (see my duped bug), however I was passing UEFI/AAVMF
> roms to the VM. I just don't know if they are loaded when qemu boots off
> -kernel.
> 
> This blivet check is just a minor bit though, I'm guessing even if we taught
> it that aarch64 isn't strictly _always_ on uefi, anaconda wouldn't set up
> the correct post install boot environment if it doesn't detect UEFI (just a
> guess). So figuring out what's going on at the qemu/kernel level is probably
> necessary

Is there a qemu/virtinstall/libvirt bug for this? If not, please open one with all the details.

Comment 6 Laszlo Ersek 2014-11-22 09:50:41 UTC
Here's some background on how the '-kernel' qemu option works, for the x86_64 (-M pc) target and the aarch64 (-M virt) target.

(1) For x86_64, the -kernel flag causes qemu to load the kernel image, massage it a bit, and expose it under a number of fw_cfg keys. Then, it is the responsibility of the boot firmware to look for these fw_cfg keys, and to download & dispatch the kernel if it is available.

Both SeaBIOS and OVMF implement this. Differently, of course, but both firmwares handle it.

In qemu, see

pc_memory_init() [hw/i386/pc.c]
  load_linux()
    load_multiboot() [hw/i386/multiboot.c]
      fw_cfg_add_bytes( ... FW_CFG_KERNEL_DATA ... )
    fw_cfg_add_bytes( ... FW_CFG_KERNEL_DATA ... )

load_linux() either calls load_multiboot() or it doesn't, based on the image format; but in either case, FW_CFG_KERNEL_DATA is populated.

In addition, "pc-bios/optionrom/linuxboot.S" in qemu provides a minimal boot loader (basically just an init trampoline) that gets compiled into an option ROM. (See qemu commit 57a46d05.)

When SeaBIOS is used as firmware, it dispatches this minimal option ROM. The option ROM downloads the kernel using fw_cfg, and jumps to it.

When OVMF is used as firmware, the minimal option ROM is ignored. Instead, OVMF looks for the fw_cfg keys in question directly, and downloads the kernel, fixes it up, and jumps to it. See "OvmfPkg/Library/PlatformBdsLib/QemuKernel.c" and "OvmfPkg/Library/LoadLinuxLib".

Note that in the OVMF case, the kernel loaded thusly does run in a full UEFI environment, where the runtime services et al are available.

(2) In case of the aarch64 target, '-kernel' works differently.

machvirt_init() [hw/arm/virt.c]
  arm_load_kernel() [hw/arm/boot.c]
    write_bootloader()

If the -kernel option was not used, then write_bootloader() is not reached, and execution will simply start at address 0, which is where the UEFI binary resides normally.

If -kernel was used, then it is loaded into guest RAM, and a minimal boot loader is generated in qemu dynamically, in ARM machine code (see "bootloader_aarch64" and write_bootloader()). loader_start = vbi->memmap[VIRT_MEM].base, 0x40000000.

When -kernel is used, the flash contents will simply not be executed (although the flash contents are correct).

Note that an arm64 guest kernel loaded this way cannot be used for installing a guest. Since the UEFI blob is never launched, the guest kernel loaded with '-kernel' will have no access to UEFI runtime services, and -- to name just one thing -- it won't be able to configure boot options with "efibootmgr" for the installed guest. In other words, you can install a UEFI guest only when the installer kernel and Anaconda etc. already run in a full-blown UEFI environment. For this reason "blivet" is right to reject the environment that it finds itself in. (== Cole was right in comment 4.)

The solution to this is to:

- rework the arm code in qemu similarly to the i386 pc code (that is, always jump to the firmware, and expose the kernel to the firmware only as "data" initially, be it through a future fw_cfg mechanism that's appropriate for arm, or place it in guest RAM and reference it from the DTB),

- *and* ArmVirtualizationQemu (== the arm32/arm64 counterpart of OVMF in edk2) needs to reuse (or clone) OVMF's QemuKernel.c and LoadLinuxLib features.

Which translates to:
- no BZ for blivet,
- at least one BZ for (upstream) qemu,
- at least one BZ for (upstream) ArmVirtualizationQemu in edk2.

This feature is not small. What do we need it for? I'm aware of two possible use cases:
- quick guest kernel development cycle facilitated by the '-kernel' option of qemu (you just rebuild the kernel on the host, and boot the guest directly with it, without having to copy it to a guest disk, updating grub.cfg in the guest, and so on)
- guest installation from URL (ie. --location http://...) with virt-install.

Comment 7 Richard W.M. Jones 2014-11-22 10:02:02 UTC
(In reply to Laszlo Ersek from comment #6)
> This feature is not small. What do we need it for? I'm aware of two possible
> use cases:
> - quick guest kernel development cycle facilitated by the '-kernel' option
> of qemu (you just rebuild the kernel on the host, and boot the guest
> directly with it, without having to copy it to a guest disk, updating
> grub.cfg in the guest, and so on)
> - guest installation from URL (ie. --location http://...) with virt-install.

On x86, -kernel is used in three places in our "stack" that I'm aware of:

(1) virt-install uses it in order to implement the --location option, which
is what you mention above.

(2) libguestfs uses to boot /boot/vmlinuz instead of having to build a
disk image containing the host kernel.  As you mentioned above, but it's
not a development option, it's how we work.

(3) It's a valid way to configure guest VMs, using the <kernel> directive
in libvirt.  This is sometimes used as a way to get around bootloader
problems in the guest, eg. if grub is broken in the guest or you can't
install a bootloader for some reason, it's convenient to pull out the
guest kernel [virt-builder --get-kernel], modify the libvirt config,
and have a working guest again.

Of these only (1) impacts blivet / anaconda / installation of VMs.  (1) is
seriously useful, but I didn't realize it would be so awkward to implement
on ARM ..

Comment 8 David Lehman 2014-12-02 21:11:40 UTC
Fixing this in blivet will only get you so far because the platform class used for aarch64 is a subclass of the efi platform class. It will require either you make the VMs look more realistic or someone implement a blivet platform (and anaconda bootloader) class to match this vm-only reality.

Comment 9 David Lehman 2014-12-02 21:12:26 UTC
Personally, I don't think this bug should be assigned to blivet, but I don't know where else to assign it.

Comment 10 Cole Robinson 2014-12-02 22:51:27 UTC
Laszlo has done some work towards fixing it in qemu and edk2/aavmf, so let's move it to qemu

Comment 11 Laszlo Ersek 2014-12-06 21:04:11 UTC
So, after analyzing this bug to death, I'm moving it back to python-blivet, and closing it as NOTABUG  at once. "python-blivet" is not at fault.

Re comment 10, I prefer not to simply redirect the BZ to another component. We have covered a lot of ground in the comments above, and the objective has significantly diverged from the original bug report (see comment #0).

Virtual UEFI firmware is now available for aarch64 guests. If issues remain that block Fedora 21 guest installation:

https://fedoraproject.org/wiki/Architectures/AArch64/Install_with_QEMU

then those are related to:

- shim (trying to load "grubx64.efi") eg. in case of PXE installation,
- bad installer media / lorax (no bootable ElTorito image), eg. in case of
  virtio-scsi ISO installation,
- qemu (failure to combine guest UEFI with -kernel / -initrd / -append boot),
  eg. in case of URL installation (virt-install --location),
- and potentially other packages.

Let's leave this python-blivet BZ die in peace.

Regarding URL installation specifically, please open a brand new BZ for the qemu component, with the following title:

  qemu-system-aarch64: support "-drive if=pflash" and "-kernel" simultaneously

Please open such a BZ for *each* Product that wishes to track this feature.

(The most recent upstream posting for this feature is at
<http://thread.gmane.org/gmane.comp.emulators.qemu/309428>. The text of the cover letter can be reused as comment#0 for these new qemu BZs)

As for libguestfs (comment #7 points (2) and (3)), those use cases don't need guest UEFI; they already work nicely with the traditional "-kernel" option of qemu (no guest UEFI needs to be passed with "-drive if=pflash").

Comment 14 Richard W.M. Jones 2015-02-04 14:58:05 UTC
Is there an actual resolution to the bug, because it's still happening
even though I've passed the UEFI ROMs to the guest.

Comment 15 Richard W.M. Jones 2015-02-04 15:43:25 UTC
(In reply to Richard W.M. Jones from comment #14)
> Is there an actual resolution to the bug, because it's still happening
> even though I've passed the UEFI ROMs to the guest.

It turns out the version of qemu in Rawhide isn't new enough
to activate UEFI in a guest.  I've backported Laszlo's UEFI
patches to qemu and will update it shortly.

Comment 16 Aaron Sowry 2016-09-13 10:11:25 UTC
> Let's leave this python-blivet BZ die in peace.

Apologies for resurrecting such an old bug, but...

Why is it correct to assume aarch64 should always use UEFI? This is not always the case.

Comment 17 Peter Robinson 2016-09-13 10:17:20 UTC
> Why is it correct to assume aarch64 should always use UEFI? This is not
> always the case.

Well that depends:
* blivet now (as of F-24+) has support for msdos as well as GPT partition tables.
* u-boot supports (as of 2016.05) uEFI boot and uEFI services emulation and that's the way we intend on supporting Fedora on various SBBs
* VMs on aarch64 will always use the tianocore uEFI firmware so in the context of this bug it's fixed

Comment 18 Aaron Sowry 2016-09-13 10:41:55 UTC
> * u-boot supports (as of 2016.05) uEFI boot and uEFI services emulation and that's the way we intend on supporting Fedora on various SBBs

Hmm, I did not know that. Still doesn't stop non-EFI aarch64 from being a valid platform, though. But if Fedora's policy is to only support EFI-based aarch64 architectures then at least I know that now, thanks for the response.

Comment 19 Peter Robinson 2016-09-13 10:46:11 UTC
> Hmm, I did not know that. Still doesn't stop non-EFI aarch64 from being a
> valid platform, though. But if Fedora's policy is to only support EFI-based
> aarch64 architectures then at least I know that now, thanks for the response.

Support and work are two different things. We have limited resources so need to focus, especially on low level things like boot paths, on what provides us the largest amount of platforms to support for the best amount of effort (IE the best bang for our buck). We don't explicitly exclude anything and if people wish to provide patches for other methods I'll happily review them.

Comment 20 Aaron Sowry 2016-09-13 10:53:47 UTC
Alright, I'll have a look into the different options. Cheers.

Comment 21 Andrew Jones 2016-09-14 15:30:31 UTC
This bug just crossed my path and reminded me of bug 1267667, which I wrote long, long ago but has never received any comments...


Note You need to log in before you can comment on or make changes to this bug.