Bug 1347291 - Booting from Windows 10 entry ends with 'relocation failed' error
Summary: Booting from Windows 10 entry ends with 'relocation failed' error
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: grub2   
(Show other bugs)
Version: 25
Hardware: Unspecified Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Peter Jones
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: https://fedoraproject.org/wiki/Common...
Keywords: CommonBugs, Patch, Reopened
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-06-16 12:57 UTC by Petr Schindler
Modified: 2017-04-05 00:25 UTC (History)
27 users (show)

Fixed In Version: grub2-2.02-0.36.fc25 grub2-2.02-0.37.fc25
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-04-05 00:25:11 UTC
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
relocation failed error with debug activated (9.54 MB, application/x-xz)
2016-06-27 16:11 UTC, Juan Orti
no flags Details
screenshots of debug with Asus P8P67-Pro motherboard (8.12 MB, application/x-gzip)
2016-08-13 03:38 UTC, Andy Wang
no flags Details
Relocation Failed error with debug information Intel Board (855.97 KB, image/jpeg)
2016-08-29 11:28 UTC, veldur
no flags Details
dbg01-e7240 (288.55 KB, image/png)
2016-11-20 17:54 UTC, Bogdan Costescu
no flags Details
[PATCH] efi/chainloader: fix wrong sanity check in relocate_coff() (1.45 KB, patch)
2016-11-21 19:26 UTC, Laszlo Ersek
no flags Details | Diff
[PATCH] efi/chainloader: truncate overlong relocation section (4.38 KB, patch)
2016-11-23 06:18 UTC, Laszlo Ersek
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Bugzilla 1320273 None None None Never

Description Petr Schindler 2016-06-16 12:57:14 UTC
Description of problem:
I installed Windows 10 on my computer (in UEFI). During installation I left some space for Fedora installation. Then I installed Fedora to empty space (also in UEFI). After reboot there is Windows entry in grub, but when I try to boot from it it only shows message: error: relocation failed. Windows can be booted from boot menu with windows efi entry.

In windows grub entry there is:
chainload /EFI/Microsoft/Boot/bootmgfw.efi which seems to be right.

Version-Release number of selected component (if applicable):
grub2-2.02-0.34.fc24.x86_64
I used Workstation Live RC 1.2

How reproducible:
We tested this twice and had the same problem both times

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
I propose this as a blocker as it violates the criterion: "The installer must be able to install into free space alongside an existing clean Windows installation and install a bootloader which can boot into both Windows and Fedora."

Comment 1 Petr Schindler 2016-06-16 12:58:27 UTC
I forgot to mention that this happens only on one of two computers where I tested this scenario. On the second machine there I got bug 1347273

Comment 2 Adam Williamson 2016-06-16 15:27:53 UTC
I saw something similar on my bare metal test box, but it worked fine in a KVM, and we do have users reporting success in https://bugzilla.redhat.com/show_bug.cgi?id=1320273 (i.e. they didn't hit this after that was fixed), so it does seem like this works at least *some* of the time.

On my test box the first attempt to boot Windows just immediately loops back to grub (the screen flashes briefly, basically), the *second* and subsequent attempts show the "relocation failed" error. If I try to boot Fedora after that, I see a "relocation failed" error, but then Fedora boots.

The installs on that box were to a hardware RAID set, I guess I can re-test on a regular disk.

Comment 3 Adam Williamson 2016-06-16 15:56:21 UTC
Sebastian Grill (CCed) reports something similar in the comments to 1320273:

"With grub2-2.02-0.34.fc24 i get a 'relocation failed' error upon trying to boot Win 10."

Comment 4 Pete Walter 2016-06-16 17:19:00 UTC
I'm getting the exact same error here with RC 1.2 when trying to boot a Windows 10 that's set up for dual boot.

Comment 5 Striker Leggette 2016-06-16 17:39:09 UTC
I know this was reported against W10, but I attempted F24RC1-2 with W2k12 dual boot in KVM and couldn't reproduce.

Comment 6 Geoffrey Marr 2016-06-16 18:43:14 UTC
Discussed during the 2016-06-16 Go/No-Go #2 meeting: [1]

The decision was made to not classify this bug as a blocker as there have been several reports of this working OK (50/50) and the firmware boot selector can still be used to boot the correct OS.

[1] https://meetbot.fedoraproject.org/fedora-meeting-2/2016-06-16/f24-final-go_no_go-meeting.2016-06-16-17.00.txt

Comment 7 Sebastian Grill 2016-06-17 06:28:50 UTC
Thanks for the CC. If there's any info i can provide that is of interest to you, please let me know.

Comment 8 Adam Williamson 2016-06-17 06:31:52 UTC
sebastian: could you let us know what system you're testing on - i.e. the model? If it's a self-build, what's the motherboard? Thanks!

Comment 9 Sebastian Grill 2016-06-17 06:51:45 UTC
Yes it's self built.

Output of sudo lshw -short minus some HIDs below.

H/W path           Device      Class          Description
=========================================================
                               system         System Product Name (To be filled 
/0                             bus            P8Z68-V GEN3
/0/0                           memory         64KiB BIOS
/0/4                           processor      Intel(R) Core(TM) i5-2500K CPU @ 3
/0/4/5                         memory         256KiB L1 cache
/0/4/6                         memory         1MiB L2 cache
/0/4/7                         memory         6MiB L3 cache
/0/24                          memory         16GiB System Memory
/0/24/0                        memory         4GiB DIMM DDR3 Synchronous 1600 MH
/0/24/1                        memory         4GiB DIMM DDR3 Synchronous 1600 MH
/0/24/2                        memory         4GiB DIMM DDR3 Synchronous 1600 MH
/0/24/3                        memory         4GiB DIMM DDR3 Synchronous 1600 MH
/0/100                         bridge         2nd Generation Core Processor Fami
/0/100/1                       bridge         Xeon E3-1200/2nd Generation Core P
/0/100/1/0                     display        Pitcairn XT [Radeon HD 7870 GHz Ed
/0/100/1/0.1                   multimedia     Cape Verde/Pitcairn HDMI Audio [Ra
/0/100/16                      communication  6 Series/C200 Series Chipset Famil
/0/100/19          eno1        network        82579V Gigabit Network Connection
/0/100/1a                      bus            6 Series/C200 Series Chipset Famil
/0/100/1a/1        usb1        bus            EHCI Host Controller
/0/100/1a/1/1                  bus            Integrated Rate Matching Hub
/0/100/1b                      multimedia     6 Series/C200 Series Chipset Famil
/0/100/1c                      bridge         6 Series/C200 Series Chipset Famil
/0/100/1c.1                    bridge         6 Series/C200 Series Chipset Famil
/0/100/1c.1/0                  bus            ASM1042 SuperSpeed USB Host Contro
/0/100/1c.1/0/0    usb4        bus            xHCI Host Controller
/0/100/1c.1/0/1    usb3        bus            xHCI Host Controller
/0/100/1c.2                    bridge         6 Series/C200 Series Chipset Famil
/0/100/1c.3                    bridge         6 Series/C200 Series Chipset Famil
/0/100/1c.3/0                  storage        JMB362 SATA Controller
/0/100/1c.4                    bridge         6 Series/C200 Series Chipset Famil
/0/100/1c.4/0                  bus            ASM1042 SuperSpeed USB Host Contro
/0/100/1c.4/0/0    usb6        bus            xHCI Host Controller
/0/100/1c.4/0/0/2              bus            USB3.0 Hub
/0/100/1c.4/0/1    usb5        bus            xHCI Host Controller
/0/100/1c.4/0/1/2              bus            USB2.0 Hub
/0/100/1c.6                    bridge         82801 PCI Bridge
/0/100/1c.6/0                  bridge         ASM1083/1085 PCIe to PCI Bridge
/0/100/1c.7                    bridge         6 Series/C200 Series Chipset Famil
/0/100/1d                      bus            6 Series/C200 Series Chipset Famil
/0/100/1d/1        usb2        bus            EHCI Host Controller
/0/100/1d/1/1                  bus            Integrated Rate Matching Hub
/0/100/1f                      bridge         Z68 Express Chipset Family LPC Con
/0/100/1f.2                    storage        6 Series/C200 Series Chipset Famil
/0/100/1f.3                    bus            6 Series/C200 Series Chipset Famil
/0/1               scsi0       storage        
/0/1/0.0.0         /dev/sda    disk           128GB M4-CT128M4SSD2
/0/1/0.0.0/2       /dev/sda2   volume         199MiB Windows FAT volume
/0/1/0.0.0/3       /dev/sda3   volume         500MiB EXT4 volume
/0/1/0.0.0/4       /dev/sda4   volume         118GiB LVM Physical Volume
/0/2               scsi1       storage        
/0/2/0.0.0         /dev/sdb    disk           500GB Samsung SSD 850
/0/2/0.0.0/1       /dev/sdb1   volume         127MiB reserved partition
/0/2/0.0.0/2       /dev/sdb2   volume         464GiB Windows NTFS volume
/0/2/0.0.0/3       /dev/sdb3   volume         259MiB Windows FAT volume
/0/2/0.0.0/4       /dev/sdb4   volume         449MiB Windows NTFS volume
/0/2/0.0.0/5       /dev/sdb5   volume         449MiB Windows NTFS volume
/0/3               scsi2       storage        
/0/3/0.0.0         /dev/sdc    disk           6001GB WDC WD60EZRX-00M
/0/3/0.0.0/1       /dev/sdc1   volume         5589GiB Windows NTFS volume
/0/5               scsi4       storage        
/0/5/0.0.0         /dev/cdrom  disk           BD-RE  BH10LS38
/1                             power          To Be Filled By O.E.M.
/2                             power          To Be Filled By O.E.M.
/3                 virbr0      network        Ethernet interface
/4                 virbr0-nic  network        Ethernet interface

Comment 10 Sebastian Grill 2016-06-17 09:11:48 UTC
Sorry, i just noticed the MB make is a bit not-so-obvious in that report. It's an ASUS P8Z68-V/GEN3.

Comment 11 Juan Orti 2016-06-18 18:13:22 UTC
I also get this error with a motherboard Asus P8Z68-V LE, with latest firmware version 4102.

Comment 12 Kamil Páral 2016-06-21 12:53:45 UTC
I tested this on hardware from comment 0. It seems to affect only UEFI, in BIOS mode I see no error. Booting directly from UEFI menu works, and downgrading grub2* to grub2-2.02-0.25.fc23 (F23 version) also fixes the problem.

Comment 13 Dmitry Burstein 2016-06-24 17:42:42 UTC
I have the same problem on Asus P8Z68 Deluxe BIOS 3603 (the last one). Can boot with F8. Upgraded Fedora with dnf system upgrade.

Comment 14 Dmitry Burstein 2016-06-24 17:46:20 UTC
(In reply to Dmitry Burstein from comment #13)
> I have the same problem on Asus P8Z68 Deluxe BIOS 3603 (the last one). Can
> boot with F8. Upgraded Fedora with dnf system upgrade.

Forgot to mention I have Win 8.1, not 10.

Comment 15 Andy Wang 2016-06-25 04:19:10 UTC
I'm seeing the same problem with windows 10 on an Asus P8P67 PRO

Comment 16 Peter Jones 2016-06-27 15:47:49 UTC
Actually, can somebody add this to the top of their config file and then post a log of the failure:

set debug=secureboot,chain,linuxefi

Thanks.

Comment 17 Juan Orti 2016-06-27 16:11 UTC
Created attachment 1172975 [details]
relocation failed error with debug activated

Comment 18 Sorin Mihai Oprea 2016-06-30 11:44:10 UTC
Issue confirmed with Asus G74Sx.

Comment 19 Andy Wang 2016-08-13 03:38 UTC
Created attachment 1190596 [details]
screenshots of debug with Asus P8P67-Pro motherboard

Attached screenshots from the debug on my Asus P8P67 Pro motherboard.

Comment 20 Dmitry Burstein 2016-08-27 20:40:52 UTC
I noticed that if in the boot menu I scroll down to the Windows line, press "c" to enter the command line mode and upon entering it immediately type "exit", the computer boots into Windows with no problem.
I'm using it as an alternative workaround to engaging EFI's own boot sequence, but may be it will give any idea about the nature of the bug.

Comment 21 veldur 2016-08-29 11:28 UTC
Created attachment 1195332 [details]
Relocation Failed error with debug information Intel Board

Mine too getting same error in Grub2 fedora24 when booting windows entries. For me, EFI Boot Manager itself got replace with Grub2 entries. So I can't able to boot windows using EFI Boot section too.

efibootmgr
Timeout: 1 seconds
BootOrder: 0005,0006,0002,0000,0001,0003,0004,0007
Boot0000* CD/DVD Drive
Boot0001* Hard Drive
Boot0002* Fedora
Boot0003* Network Card
Boot0004* UEFI: ST500DM002-******
Boot0005* Fedora
Boot0006* Linux
Boot0007* UEFI: ST500DM002-******

Comment 22 Seth Jennings 2016-10-06 02:53:48 UTC
Happening on an ASRock Z75 Pro3 motherboard.  Reverted to f23 grub-efi as a workaround.

Comment 23 Seth Jennings 2016-10-06 02:59:59 UTC
Sorry for the noise but might have some additional information:

I've isolated to -26 of grub2 which is:

* Fri Mar 04 2016 Peter Jones <pjones@redhat.com> - 2.02-0.26
- Rebased to newer upstream (grub-2.02-beta3) for fedora-24

The error I get with -26 from my UEFI is "error: out of memory".  At some later version, between -26 and -34, the error isn't displayed but I just get a blank screen after selecting Windows from the grub menu.

Comment 24 Andy Wang 2016-10-08 01:16:11 UTC
I'm not sure if this is a universal solution (I don't understand the UEFI boot failover logic) but I converged the workaround in comment 20 to a more "permanent" solution.

In /boot/efi/EFI/fedora I created a custom.cfg:
menuentry "Windows 10 workaround" {
  exit
}

That entry causes the grub boot loader to quit, and at least with my motherboard, the UEFI firmware continues to the next EFI boot loader, which I've configured to be the Windows 10 boot loader, and it boots windows.

Comment 25 Laszlo Ersek 2016-10-12 19:20:08 UTC
I grepped grub, shim, and upstream edk2 for "relocation failed". No hits. I also grepped their git histories for the message (-S). No hits again.

Based on comment 2 (--> the message is printed both when booting Windows and when booting Fedora, under the "right" circumstances), I must think the message comes from a subset of proprietary UEFI firmwares. Presumably, this family of firmware requires, correctly or incorrectly, such things from the image to be launched that grub's chain-loader does not satisfy.

Comment 26 Adam Williamson 2016-10-12 23:03:11 UTC
Fedora's grub build is not entirely a stock one. The 'relocation failed' error comes from this patch:

http://pkgs.fedoraproject.org/cgit/rpms/grub2.git/tree/0070-Add-secureboot-support-on-efi-chainloader.patch

Comment 27 Laszlo Ersek 2016-10-12 23:59:06 UTC
Thanks. Thus, "relocation failed" is printed after relocate_coff() fails (== returns a value different from GRUB_EFI_SUCCESS).

Looking at the return statements in relocate_coff(), each one that returns a failure code is preceded by a specific grub_error() message. However, in the attachment from comment 21 ("relocation_failed_error_with_debug.jpg"), none of those messages can be seen.

That's because grub_error() formats the message into the global variable "grub_errmsg", and whatever error was formatted by relocate_coff() itself is promptly overwritten by the "relocation failed" message in handle_image(), before "grub_errmsg" is printed anywhere. The return status of relocate_coff() is also not logged (grub_errno 18 is just GRUB_ERR_BAD_ARGUMENT).

Comment 28 Juan Orti 2016-10-14 07:32:05 UTC
This also happens in F25.

grub2-2.02-0.34.fc24.x86_64
grub2-efi-2.02-0.34.fc24.x86_64
shim-0.8-10.x86_64

Motherboard: Asus P8Z68-V LE, firmware version 4102

Comment 29 Bogdan Costescu 2016-11-07 16:39:46 UTC
I landed here through links from ask.fedoraproject.org, as the title does not match. I use Windows 8.1 along F24, and I get a black screen rather than an error message, however this seems to match comment #23. Hardware is a Dell Latitude E7240 laptop, which came with Windows 8.1 preinstalled. Originally I installed F22 alongside Windows; this was then upgraded to F23 and later to F24. The last upgrade broke booting into Windows. As suggested above, downgrading grub to the 2.02-0.25 version from F23 makes things work again.

Apart from the 2 workarounds (downgrade grub, or make it finish and pass on the next EFI boot entry) I don't see much progress here. Is there any interest in making this work for F24 and later? Or is progress tracked somewhere else? What are the disadvantages of switching to the F23 grub version?

Comment 30 Alexander Korsunsky 2016-11-18 15:03:26 UTC
I also have the problem of a blank screen after chainloading. It doesn't matter if I use the grub menu entry or select the bootloader manually over the grub command line.

I am trying to boot into Windows 7, and I am running on a Lenovo ThinkPad x230.

Is there any information that could help?

(In reply to Geoffrey Marr from comment #6)
> Discussed during the 2016-06-16 Go/No-Go #2 meeting: [1]
> 
> The decision was made to not classify this bug as a blocker as there have
> been several reports of this working OK (50/50) and the firmware boot
> selector can still be used to boot the correct OS.

That's an... "interesting" decision. Forcing users to either a) meddle with their UEFI to get into the OS of their choice or b) downgrading packages to a non-broken version? Which benefits does the F24 and F25 version of Grub even have over the one that's not broken?

Comment 31 Adam Williamson 2016-11-18 16:57:35 UTC
It's not 'meddling' with UEFI. The UEFI boot manager is specifically intended for allowing you to pick what to boot. Choosing a boot option isn't 'meddling'.

Comment 33 Laszlo Ersek 2016-11-18 20:23:33 UTC
I think we'll have to ping-pong debug this. That is, I prepare a debug-patched grub package, then adventurous users that experience the symptoms and are willing to help install it, and come back with the debug output. Based on the debug output, we try to figure out what went wrong, refine the debug patches, goto 10.

So,

(1) Please install grub2 from <http://koji.fedoraproject.org/koji/taskinfo?taskID=16514783>. This is a scratch build with my debug patches. You are welcome to review the patches in the SRPM (they are numbered starting with 11000) and/or rebuild the package yourself.

For installing this, you might have to run "dnf downgrade", as the "dbg01" counter that I put in the NVR happens to make "grub2-2.02-0.34.dbg01.fc24" sort before "grub2-2.02-0.34.fc24". For the further debug builds, I'll keep bumping the dbg counter, so dnf upgrade should work.

This is in no way supported, signed, or anything like that. If it breaks your machine, you get to keep both pieces. You've been warned.

I tested the debug patches in an OVMF VM where I installed Windows 10 + Fedora 24, for dual booting. While I cannot reproduce the subject symptoms, the debug patches seem to pass the smoke test (they produce a bunch of info messages and allow booting Windows 10).

(2) I don't know how quickly Koji removes scratch build output files, so response time from affected users is essential. I could expose the build(s) more permanently in my people.redhat.com webspace, but I'd like to avoid that if possible.

(3) Getting the output in a text file (by capturing serial port output) would be vastly preferable to screen photos. If you have the gear for this (i.e., your machine has an actual UART, you have a null-modem cable, and another box to read the data, or some serial-to-network kit), please set

GRUB_TERMINAL_OUTPUT="console serial"

in the "/etc/default/grub" file, then run

# grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg 

(4) Reboot. You will have to disable Secure Boot, or authorize this grub binary in MokManager. (I set up Secure Boot in OVMF just out of principle, without a second thought, and was pleasantly surprised to see shim / MokManager catch the unsigned binary.)

(5) No need to change grub debug settings; chainloader messages will be spewed to the console (screen, and if enabled, serial port) automatically. Please capture the log (pictures of text file) leading up to the failure, and attach it to this BZ.

Thanks.

Comment 35 Daniel 2016-11-20 15:32:42 UTC
I just upgraded from a fully updated F24 to F25. After the upgrade I’m getting the “relocation failed” error when choosing Windows 10 from the grub2 boot menu.

Comment 36 Sebastian Grill 2016-11-20 15:36:34 UTC
This is a workaround until this is fixed:

# dnf downgrade grub2* --releasever=23

Comment 37 Bogdan Costescu 2016-11-20 17:53:14 UTC
Thanks Laszlo! Here my pong :)

I attach the screen capture from my Dell E7240 with your dgb01, after choosing to bootstrap the Windows (8.1 in my case) chainloader. The output goes on and on, after pressing a key, the address after "entry" is then incremented by 2. Once in a while, a "reloc_base" appears with the address also incremented by 2 w.r.t. the previous address and the "n" incremented by 1 w.r.t. the previous "reloc_base" line. The address immediately after "reloc_base" is no longer incremented by 2, but by a multiple of 2.

Comment 38 Bogdan Costescu 2016-11-20 17:54 UTC
Created attachment 1222252 [details]
dbg01-e7240

Comment 39 Laszlo Ersek 2016-11-21 02:10:00 UTC
Bogdan, thanks for the screenshot. Can you please upload the full output?
The file you uploaded doesn't contain the debug messages that lead directly
to the error. What I'd need is the last one or two pages right before the
failure.

--*--

Anyway, analyzing this debug log forced me to review the code line by line,
and I think I found the bug. Well, let me put it like this: I think to have
found *a* bug that looks consistent with the symptoms (that is, it *could*
cause this issue), but I can't guarantee in advance that it's *the* bug.

Okay, so here goes.

The handle_image() function is called with the EFI binary loaded into
memory. The input parameter "handle_image:data" points to the image
allocated in memory that gets passed in. The handle_image() function calls
gBS->AllocatePool(), and creates a *section-wise* copy of the original
image. The copy is aligned appropriately, it has the right size, and it is
pointed-to by the variable "buffer_aligned".

The handle_image() function iterates over sections in the original memory
area (that is, "handle_image:data"), investigates each section, and
optionally copies the section to the right place in the target area; that
is, into "handle_image:buffer_aligned". Notably, the reloc section (which
contains the list of lists of relocations) is not necessarily copied (it can
be marked as discardable in the image file), and Bogdan's log file fragment
actually says "Discarding section" for the reloc section.

After this is done, the relocate_coff() function is called, with the
following parameters:

  argument                    --> parameter           role
  --------                        ---------           --------
  handle_image:data           --> relocate_coff:orig  ORIGINAL
  handle_image:buffer_aligned --> relocate_coff:data  COPY

Note the confusing variable names. In the caller function, "data" stands for
the *original* image. In the callee, "data" stands for the section-wise
copied, to-be-relocated image.

This logic comes from the following downstream-only commit:

    Add secureboot support on efi chainloader

    Expand the chainloader to be able to verify the image by means of shim
    lock protocol. The PE/COFF image is loaded and relocated by the
    chainloader instead of calling LoadImage and StartImage UEFI boot
    Service as they require positive verification result from keys enrolled
    in KEK or DB. The shim will use MOK in addition to firmware enrolled
    keys to verify the image.

    The chainloader module could be used to load other UEFI bootloaders,
    such as xen.efi, and could be signed by any of MOK, KEK or DB.

    Based on https://build.opensuse.org/package/view_file/openSUSE:Factory/grub2/grub2-secureboot-chainloader.patch

If you open the URL at the last line, you notice a difference: the SUSE
version of handle_image() never drops sections. It copies all sections.

This results in a difference for relocate_coff(): in the SUSE patch,
relocate_coff() can work entirely within the copied (target) area, because
the "reloc" section will *always* be in that area. So, in the SUSE patch,
only the copy is passed to relocate_coff():

  argument                    --> parameter           role
  ---------------------------     ------------------  ----
  handle_image:buffer_aligned --> relocate_coff:data  COPY

In the RHEL version however, the reloc section may or may not be present in
the copy, hence we can only parse the reloc entries in the *original*. Which
is why in our version of the patch, relocate_coff() has parameters "orig"
and "data".


Let's talk a bit about the relocation section. The relocation section
apparently consists of sequences like this:

  -------------------
  uint32_t rva
  uint32_t size
  uint16_t entry[0]
  uint16_t entry[1]
  uint16_t entry[2]
  uint16_t entry[3]
  ...
  uint16_t entry[N]
  -------------------
  uint32_t rva
  uint32_t size
  uint16_t entry[0]
  uint16_t entry[1]
  uint16_t entry[2]
  uint16_t entry[3]
  ...
  uint16_t entry[M]
  -------------------
  uint32_t rva
  uint32_t size
  uint16_t entry[0]
  uint16_t entry[1]
  uint16_t entry[2]
  uint16_t entry[3]
  ...
  uint16_t entry[K]

and so on. The "size" field describes the size of that sequence, including
the "rva" and "size" fields themselves, and the number of entries (marked
above as N, M, K) in each sequence can be derived from the "size" field in
that sequence.

The parsing code in relocate_coff() has sanity checks. Each time it starts
working on a new sequence, it determines the end of the sequence -- in order
to see where the entries will end, for that sequence. The purported end is
subjected to a wraparound check. Let me quote the SUSE original:

> static grub_efi_status_t
> relocate_coff (pe_coff_loader_image_context_t *context, void *data)
>   [...]
>   grub_efi_uint64_t size = context->image_size;
>   void *image_end = (char *)data + size;
>   [...]
>   reloc_base = image_address (data, size, context->reloc_dir->rva);
>   [...]
>   while (reloc_base < reloc_base_end)
>     {
>       reloc = (grub_uint16_t *)((char*)reloc_base + sizeof (struct grub_pe32_data_directory));
>       reloc_end = (grub_uint16_t *)((char*)reloc_base + reloc_base->size);
>
>       if ((void *)reloc_end < data || (void *)reloc_end > image_end)
>         {
>           grub_error (GRUB_ERR_BAD_ARGUMENT, "Reloc table overflows binary");
>           return GRUB_EFI_UNSUPPORTED;
>         }

"image_end" stands for the end of the *copied* image, and "data" stands for
the beginning of the same. So the check intends to ensure that the end of
this reloc sequence will be within the copied image file.

This check makes sense because the copied image fill is guaranteed, in the
SUSE version, to contain the reloc section.

(BTW, there's an off-by one in the code above: reloc_end is an exclusive
boundary, while "data" is an inclusive one, so even their equality should be
refused. If reloc_end == data, then the actual relocation entries are
strictly before "data", which should be rejected. Anyway, I digress.)

Okay, let's see how the same looks in the RHEL code:

> static grub_efi_status_t
> relocate_coff (pe_coff_loader_image_context_t *context,
> 	       struct grub_pe32_section_table *section,
> 	       void *orig, void *data)
>   [...]
>   grub_efi_uint64_t size = context->image_size;
>   void *image_end = (char *)orig + size;
>   [...]
>   reloc_base = image_address (orig, size, section->raw_data_offset);
>   [...]
>   while (reloc_base < reloc_base_end)
>     {
>       grub_uint16_t *entry;
>       reloc = (struct grub_pe32_fixup_block *)((char*)reloc_base);
>       [...]
>       reloc_end = (struct grub_pe32_fixup_block *)
> 	((char *)reloc_base + reloc_base->size);
>
>       if ((void *)reloc_end < data || (void *)reloc_end > image_end)
>         {
>           grub_error (GRUB_ERR_BAD_ARGUMENT, "Reloc entry %d overflows binary",
> 		      n);
>           grub_dprintf ("chain", "Reloc entry %d overflows binary\n", n);
>           return GRUB_EFI_UNSUPPORTED;
>         }

Remember, in the RHEL version, we must iterate over the reloc sequences in
the *original* image. Therefore we correctly calculate "image_end" as "orig
+ size" (translated from SUSE's "data + size"). But... in the actual
containment check, the first condition is still

  (void *)reloc_end < data

which is bogus: here "data" stands for the *copied* image, which is entirely
irrelevant for parsing the reloc entries.

In other words, this is an omission in the port from SUSE's patch. This
instance of "data" should have been replaced with "orig". It didn't cause a
compilation error because the "data" parameter was also preserved. (In such
cases it's best to rename the preexistent parameter as well; it helps with
auditing all uses.)


So why does it cause an invalid error on some firmware, and no error on
other firmware? The reason is that we're effectively comparing the addresses
of separate, unrelated memory allocations (i.e., the base address of the
copied image, and an address that falls into the original image).

On most firmware / with most actual memory maps, the AllocatePool() boot
service allocates the copied image at a lower address than the original
image (the allocator works top-down), and then "reloc_end", which points
into the original image, will be higher than the base address of the copied
image. (*)

In other cases however, the copy happens to be placed at a higher address
(the pool allocator might find a gap higher up that it can fill well with
the copy), and then the incorrect check fires, even for valid EFI images.
(We don't get to see the error message "... overflows binary" because of the
separate grub bug I identified in comment 27.)

(*) Now, looking at Bogdan's log file fragment, we can see that the subject
bug does *not* fire in his case.  From comment 37, the tentative location
for the copied reloc section (if it wasn't about to be discarded) is
0xD5469000..0xD546995F, while the original reloc section starts higher, at
0xD55F4800. However, that's understandable: in comment 29 Bogdan wrote, "I
get a black screen rather than an error message" -- his symptoms are
different! So, my take is that Bogdan is affected by an independent bug.


Alright, let's go back to the other attachments, and try to verify the idea:

- Comment 17 does not contain enough information for saying either way.

- Ditto for comment 19.

- However, comment 21 contains enough information, and it confirms the
  analysis. The tentative location for the copied reloc section is
  0xC967000..0xC96781F. The original reloc section starts *lower*, at
  0xC80D400. This triggers the bug.

The fix should be simple (again, this is *not* for the independent bug that
Bogdan is witnessing):

> diff --git a/grub-core/loader/efi/chainloader.c b/grub-core/loader/efi/chainloader.c
> index da6554c1813e..24a1eb63b5e6 100644
> --- a/grub-core/loader/efi/chainloader.c
> +++ b/grub-core/loader/efi/chainloader.c
> @@ -402,7 +402,7 @@ relocate_coff (pe_coff_loader_image_context_t *context,
>        reloc_end = (struct grub_pe32_fixup_block *)
>         ((char *)reloc_base + reloc_base->size);
>
> -      if ((void *)reloc_end < data || (void *)reloc_end > image_end)
> +      if ((void *)reloc_end < orig || (void *)reloc_end > image_end)
>          {
>            grub_error (GRUB_ERR_BAD_ARGUMENT, "Reloc entry %d overflows binary",
>                       n);

I'll try to provide a scratch build for testing tomorrow.

Comment 40 Bogdan Costescu 2016-11-21 14:16:49 UTC
I do not get anything else before the attached output, or at least I cannot see any other output and can't attach a serial console to try to capture more. Would it be possible to produce another build which pauses at the point the interesting data is printed? Alternatively, I have experience with building SRPMs, so if you can at least provide a patch doing this, I can rebuild the two binary packages myself.

Comment 41 Laszlo Ersek 2016-11-21 14:26:46 UTC
(In reply to Bogdan Costescu from comment #40)
> I do not get anything else before the attached output,

What I would be interested in is *after* the output you uploaded. The screenshot in comment 38 is from the start of the relocation.

> or at least I cannot see any other output

That's interesting. Normally when the debug messages fill the screen and --MORE-- appears at the bottom, you just need to hit Enter or Space to proceed. Grub will continue until another screenful of debug messages has been produced.

Do you mean that the relocation hangs for you after the first screenful of messages?

> and can't attach a serial console to try to capture
> more. Would it be possible to produce another build which pauses at the
> point the interesting data is printed?

It shouldn't be necessary; the debug message printing has built-in paging functionality.

Comment 42 Laszlo Ersek 2016-11-21 14:53:45 UTC
(In reply to Laszlo Ersek from comment #39)

> The fix should be simple (again, this is *not* for the independent bug that
> Bogdan is witnessing):
> 
> > diff --git a/grub-core/loader/efi/chainloader.c b/grub-core/loader/efi/chainloader.c
> > index da6554c1813e..24a1eb63b5e6 100644
> > --- a/grub-core/loader/efi/chainloader.c
> > +++ b/grub-core/loader/efi/chainloader.c
> > @@ -402,7 +402,7 @@ relocate_coff (pe_coff_loader_image_context_t *context,
> >        reloc_end = (struct grub_pe32_fixup_block *)
> >         ((char *)reloc_base + reloc_base->size);
> >
> > -      if ((void *)reloc_end < data || (void *)reloc_end > image_end)
> > +      if ((void *)reloc_end < orig || (void *)reloc_end > image_end)
> >          {
> >            grub_error (GRUB_ERR_BAD_ARGUMENT, "Reloc entry %d overflows binary",
> >                       n);
> 
> I'll try to provide a scratch build for testing tomorrow.

http://koji.fedoraproject.org/koji/taskinfo?taskID=16553935

Juan Orti, Andy Wang, veldur, can you guys please test this build? Thanks.

Comment 43 Bogdan Costescu 2016-11-21 16:42:12 UTC
By "The output goes on and on," in comment #37 I meant pagination with "MORE". So indeed I get the expected output. What I didn't know was that it will eventually come to something else and I stopped (ctrl-alt-del) after allowing it to print 4-5 further screens... So after installing the dbg02 from comment #42, I allowed it to go further. After more than 20 screens, it reaches the "reloc_base_end" address, then comes to "booting via entry point", after which Windows does start. Yey!!! There seems to be indeed a single bug, both for those with some error message and for those with a black screen (like me). Many thanks for tracking it down!

So I guess the next step would be a dbg03 with no debugging output or some release candidate for the official package? :)

Comment 44 Juan Orti 2016-11-21 17:57:45 UTC
(In reply to Laszlo Ersek from comment #42)
> (In reply to Laszlo Ersek from comment #39)
> 
> > The fix should be simple (again, this is *not* for the independent bug that
> > Bogdan is witnessing):
> > 
> > > diff --git a/grub-core/loader/efi/chainloader.c b/grub-core/loader/efi/chainloader.c
> > > index da6554c1813e..24a1eb63b5e6 100644
> > > --- a/grub-core/loader/efi/chainloader.c
> > > +++ b/grub-core/loader/efi/chainloader.c
> > > @@ -402,7 +402,7 @@ relocate_coff (pe_coff_loader_image_context_t *context,
> > >        reloc_end = (struct grub_pe32_fixup_block *)
> > >         ((char *)reloc_base + reloc_base->size);
> > >
> > > -      if ((void *)reloc_end < data || (void *)reloc_end > image_end)
> > > +      if ((void *)reloc_end < orig || (void *)reloc_end > image_end)
> > >          {
> > >            grub_error (GRUB_ERR_BAD_ARGUMENT, "Reloc entry %d overflows binary",
> > >                       n);
> > 
> > I'll try to provide a scratch build for testing tomorrow.
> 
> http://koji.fedoraproject.org/koji/taskinfo?taskID=16553935
> 
> Juan Orti, Andy Wang, veldur, can you guys please test this build? Thanks.

With this build I can boot Windows 10 from grub. Thank you!

Are you interested in the gazillon photos I've taken?

Comment 45 Laszlo Ersek 2016-11-21 19:08:01 UTC
Kudos to you guys for testing the dbg02 build! No more photos / logs are necessary then.

Regarding an official build: since I don't maintain grub2, I couldn't push the patch to dist-git. And without pushing the patch to dist-git, I must not build a non-scratch package in Koji. So, we'll have to follow the traditional route in this part: I'll attach the patch to the BZ, Peter will review it, and hopefully apply it. Then he can submit an official Koji build too.

Thanks guys again for your help!

Comment 46 Laszlo Ersek 2016-11-21 19:26 UTC
Created attachment 1222471 [details]
[PATCH] efi/chainloader: fix wrong sanity check in relocate_coff()

In relocate_coff(), the relocation entries are parsed from the original
image (not the section-wise copied image). The original image is
pointed-to by the "orig" pointer. The current check

  (void *)reloc_end < data

compares the addresses of independent memory allocations. "data" is a typo
here, it should be "orig".

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1347291
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
Tested-by: Bogdan Costescu <bcostescu@gmail.com>
Tested-by: Juan Orti <j.orti.alcaine@gmail.com>
---
 grub-core/loader/efi/chainloader.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comment 47 Laszlo Ersek 2016-11-21 19:45:40 UTC
I pinged Peter on IRC. It appears that he's not going to be available this
week.

From <https://admin.fedoraproject.org/pkgdb/package/rpms/grub2/>:

> Package Administrator(s)
>
> These users can help you if you need commit privileges for this package.
>
>     bcl (Fedora devel, Fedora 25, Fedora 24, Fedora 23)
>
>     pjones (Fedora devel, Fedora 25, Fedora 24, Fedora 23, Fedora EPEL 6,
>     Fedora EPEL 5)
>
>     ausil (Fedora devel, Fedora 25, Fedora 24, Fedora 23)

Brian, Dennis, can you guys please review

- the analysis in comment 39,
- the patch in comment 46,
- the testing feedback in comment 43 / 44,

and apply the patch and submit an official Koji build? Thank you.

Comment 48 Adam Williamson 2016-11-21 20:48:28 UTC
I'm a provenpackager, if no-one else gets to it first I will try to take care of this.

Comment 49 Adam Williamson 2016-11-21 20:53:38 UTC
well, crap, I can do a build, but I'm sure not qualified to review a patch like this (I do python, man, not this kinda stuff).

Tell you one more who might be willing, though - mjg59. Any chance you could take a look at this, Matthew? Sorry to take the liberty, but we're trying to make things right for people dual-boot installing F25. If you're willing and you say it's good I can fire a koji build and stuff.

Comment 50 Matthew Garrett 2016-11-21 21:23:28 UTC
Analysis and patch sound fine to me.

Comment 51 Laszlo Ersek 2016-11-21 21:27:29 UTC
Thanks Matthew! :)

Comment 52 Adam Williamson 2016-11-22 02:45:04 UTC
Thanks a lot Matthew! I'll do a build.

Comment 53 Adam Williamson 2016-11-22 03:28:13 UTC
Sigh, so this turns out to be a bit complicated and I don't want to do it without Peter, really.

The f24 grub2 branch has several changes that neither master (Rawhide) nor f25 have, first of all. I'd really want to check with pjones if those need to come forward to f25 and rawhide, and whether any reconciliation is needed with this patch. And also, build on ppc64 on master is failing and I'm really not sure what the correct resolution for that is. The immediate issue is that the ppc64 build BuildRequires /usr/lib/crt1.o and nothing provides that (any more?), but I'm really not sure what the story or the correct resolution is there. So I'm kinda unwilling to touch this unless I screw something up, unfortunately.

We could perhaps do a cleaned-up scratch build for affected folks to use in the mean time.

Comment 54 Dan Horák 2016-11-22 07:57:17 UTC
(In reply to Adam Williamson from comment #53)
> Sigh, so this turns out to be a bit complicated and I don't want to do it
> without Peter, really.
> 
> The f24 grub2 branch has several changes that neither master (Rawhide) nor
> f25 have, first of all. I'd really want to check with pjones if those need
> to come forward to f25 and rawhide, and whether any reconciliation is needed
> with this patch. And also, build on ppc64 on master is failing and I'm
> really not sure what the correct resolution for that is. The immediate issue
> is that the ppc64 build BuildRequires /usr/lib/crt1.o and nothing provides
> that (any more?), but I'm really not sure what the story or the correct
> resolution is there. So I'm kinda unwilling to touch this unless I screw
> something up, unfortunately.
> 
> We could perhaps do a cleaned-up scratch build for affected folks to use in
> the mean time.

/usr/lib/crt1.o is provided by the koji internal glibc32 package, that workarounds multilib in koji and allows to build gcc with both 32 and 64 backends on 64-bit arches => an update of glibc32 in primary koji instance is needed after the merger with ppc koji.

Comment 55 Laszlo Ersek 2016-11-22 08:52:29 UTC
(In reply to Adam Williamson from comment #53)
> Sigh, so this turns out to be a bit complicated and I don't want to do it
> without Peter, really.
> 
> The f24 grub2 branch has several changes that neither master (Rawhide) nor
> f25 have, first of all.

Can you please elaborate? In my (anonymous) clone of Fedora's grub2 dist-git repo, I see the f24 branch standing currently at commit 9a4054fb1c5af194030d44f4e0a2c008afb0323e ("Update ppc64 configure invocation"). Both the authorship date and the commit date of this commit are 2016-06-10.

Furthermore, this commit updates the Release field in the "grub2.spec" file from "0.33%{?dist}" to "0.34%{?dist}".

Looking in Koji, the latest official build for Fedora 24 is "grub2-2.02-0.34.fc24":
<http://koji.fedoraproject.org/koji/buildinfo?buildID=771914>
which completed on "Fri, 10 Jun 2016 18:33:04 UTC".

Therefore it seems to me that the current f24 HEAD in dist-git matches the latest build from Koji, which also matches the NVR reported in this issue (see comment 0). The patch in comment 46 simply goes on top.

IOW, it shouldn't matter if there are differences between the f24 and f25 branches in dist-git, as long as the basis for the new patch (the HEAD of each branch) matches the latest Brew build for the corresponding Fedora release. Let's mark this requirement with (*).

The patch is really small (a one-liner), and the file being patched ("grub-core/loader/efi/chainloader.c") is identical between the above-mentioned f24 HEAD, and the current f25 HEAD. (The current f25 HEAD is 9d15b4d492f89ba926dd9de9418c6d0df0a794c9, "Update to be newer than f24's branch".) This means that the patch is equally correct for f24 and f25.

However... I do see your point about Fedora 25. Namely, the latest official build in Koji is "grub2-2.02-0.30.fc25", at <http://koji.fedoraproject.org/koji/buildinfo?buildID=752004>, which does *not* correspond to the current f25 HEAD. (It corresponds to the parent of the current f25 HEAD, 336bf36497b1bdaac168d87a7e79802e8da4d333, "Revert 27e66193, which was replaced by upstream's 49426e9fd".)

Meaning that the requirement I marked with (*) above is not satisfied for Fedora 25 -- the patch would apply and work just fine on top of the f25 dist-git HEAD, but that basis *itself* has never been built in Koji & released to users. In other words, the f25 branch for grub2 is in an un-updateable state.

Regarding Rawhide (F26):
- the file being patched is identical (with dist-git master at 0ea125de5aa1a96e5a35d5d45255e691574fdfd0, which you just committed), so that would be fine,
- however the official build status is even worse than that of grub2/F25: there is *no* successful F26 build available in Koji.

> I'd really want to check with pjones if those need
> to come forward to f25 and rawhide, and whether any reconciliation is needed
> with this patch.

According to the above, source code reconciliation is not needed at this point for the patch, against F25 and F26. Instead, the problem is that it is currently impossible to do *targeted* F25 and F26 builds for this patch.

> And also, build on ppc64 on master is failing and I'm
> really not sure what the correct resolution for that is. The immediate issue
> is that the ppc64 build BuildRequires /usr/lib/crt1.o and nothing provides
> that (any more?), but I'm really not sure what the story or the correct
> resolution is there. So I'm kinda unwilling to touch this unless I screw
> something up, unfortunately.
> 
> We could perhaps do a cleaned-up scratch build for affected folks to use in
> the mean time.

For Fedora 24 (where everything is in sync), I think it is safe to do an official build.

For Fedora 25 and Fedora 26, the patch is safe to add to dist-git. However, we shouldn't submit official Koji builds, as those builds would present a lot more changes to users than just this fix. This is something that Peter will have to sort out indeed.

I don't like the idea of a cleaned-up scratch build for several reasons:
- for Fedora 24, it is unjustified (an official build will do just fine)
- for Fedora 25 and Rawhide, it would expose as-yet unreleased bits
- worst of all, Koji would remove the files after a while, and we'd have to
  find external storage for them.

I suggest to go ahead with the F24 build (for which this BZ was originally reported, until it got bumped to F25 in comment 28).

Dist-git and Koji are already out of sync for F25 and F26, so I agree not to build packages for those.

Anyhow, I defer to your judgement. Thanks for your help!

Comment 56 Dan Horák 2016-11-22 12:20:18 UTC
I have prepared a new glibc32 build, so in a while new grub2 builds should succeed also on ppc64.

Comment 57 Kamil Páral 2016-11-22 13:41:10 UTC
(In reply to Laszlo Ersek from comment #42)
> http://koji.fedoraproject.org/koji/taskinfo?taskID=16553935

I can confirm this build fixes Window boot on system from comment 0. Great job, Laszlo, thanks a lot!

Comment 58 Alexander Korsunsky 2016-11-22 14:02:46 UTC
(In reply to Kamil Páral from comment #57)
> (In reply to Laszlo Ersek from comment #42)
> > http://koji.fedoraproject.org/koji/taskinfo?taskID=16553935
> 
> I can confirm this build fixes Window boot on system from comment 0. Great
> job, Laszlo, thanks a lot!

Not for me :(


I now get several pages of output from loader/efi/chainloader.c, printing all kinds of sections. When I jumpt through those pages, the last message is 

> booting via entry point

After which, nothing happens.


If required, I can upload (camera) screenshots of the output, but is stuff that is printed somewhere as a text file?

Comment 59 Adam Williamson 2016-11-22 16:35:38 UTC
Laszlo: the '-34' on master and f25 branches is not in fact at all the same as the '-34' on f24. the two diverge starting at 2ae5c1e . From that point on, f24 has:

9a4054f Update ppc64 configure invocation
35f33ea Two bug fixes...
2572c16 Revert 27e66193, which was replaced by upstream's 49426e9fd
8768d23 Fix ppc64 build failure on fedora-24
af00b6d Pull TPM updates from mjg59.
0fa4f49 Fix aarch64 build problem.
9bc6276 Rebased to newer upstream for fedora-24
e229c7b Update debuginfo too
133f7ea Minor fixup to our bzrignore fix.
83affdb make the sb thing work

while master has (ignoring my commit and revert):

9d15b4d Update to be newer than f24's branch.
336bf36 Revert 27e66193, which was replaced by upstream's 49426e9fd
1713515 Rebased to newer f24 code
0ac23e2 Pull TPM updates from mjg59.
dcaf8f6 switch to pkgconfig() style BR
bec00f1 fix summary-ended-with-dot warning from rpmlint

Possibly 1713515 brought them back in sync again, but even since then, f24 has "Two bug fixes..." and "Update ppc64 configure invocation", which master and f25 do not.

If it were my package I'd do a big merge commit to bring f24, f25 and master all back into line and then put your patch on top of that. But I don't want to do that without pjones' signoff.

Comment 60 Laszlo Ersek 2016-11-22 16:52:53 UTC
(In reply to Alexander Korsunsky from comment #58)
> (In reply to Kamil Páral from comment #57)
> > (In reply to Laszlo Ersek from comment #42)
> > > http://koji.fedoraproject.org/koji/taskinfo?taskID=16553935
> > 
> > I can confirm this build fixes Window boot on system from comment 0. Great
> > job, Laszlo, thanks a lot!
> 
> Not for me :(
> 
> 
> I now get several pages of output from loader/efi/chainloader.c, printing
> all kinds of sections. When I jumpt through those pages, the last message is 
> 
> > booting via entry point
> 
> After which, nothing happens.

If you reach the entry point of the relocated binary, then it sounds like a different bug. Can you check the open bugs for grub2? It may have been reported earlier. If not, reporting it as a separate BZ might make sense.

> If required, I can upload (camera) screenshots of the output, but is stuff
> that is printed somewhere as a text file?

No, grub2 does not save the log messages to a file (grub2 generally does not write to the disk, very justifiedly, apart from the grub environment block). In order to capture messages like this, you need a (virtual) serial connection and another (virtual) machine.

Comment 61 Bogdan Costescu 2016-11-22 17:29:21 UTC
(In reply to Alexander Korsunsky from comment #58)
> (In reply to Kamil Páral from comment #57)
> > (In reply to Laszlo Ersek from comment #42)
> > > http://koji.fedoraproject.org/koji/taskinfo?taskID=16553935
> > 
> > I can confirm this build fixes Window boot on system from comment 0. Great
> > job, Laszlo, thanks a lot!
> 
> Not for me :(
> 
> 
> I now get several pages of output from loader/efi/chainloader.c, printing
> all kinds of sections. When I jumpt through those pages, the last message is 
> 
> > booting via entry point
> 
> After which, nothing happens.

Are you sure to use the dbg02 package, as mentioned in comment #42?

Comment 62 Bogdan Costescu 2016-11-22 17:32:03 UTC
(In reply to Laszlo Ersek from comment #60)
> > I now get several pages of output from loader/efi/chainloader.c, printing
> > all kinds of sections. When I jumpt through those pages, the last message is 
> > 
> > > booting via entry point
> > 
> > After which, nothing happens.
> 
> If you reach the entry point of the relocated binary, then it sounds like a
> different bug. Can you check the open bugs for grub2? It may have been
> reported earlier. If not, reporting it as a separate BZ might make sense.

As I mentioned in comment #43, I also see "booting via entry point" as the last printed message. The difference with respect to Alexander's report is that for me it goes further and immediately starts booting Windows.

Comment 63 Laszlo Ersek 2016-11-22 18:29:24 UTC
Okay. Maybe the entry point itself is translated incorrectly. Alexander, can you please attach your log? Perhaps I can find something in it.

For example I can see a path through the code that would discard the section that contains the entry point, if the section in question were marked as discardable. This would arguably be a bug in the image being relocated.

Comment 64 Adam Williamson 2016-11-22 20:29:04 UTC
So I'm definitely not trying to reconcile the pile of crazy that's the grub2 package build without pjones around. Instead, I've just done some scratch builds for now, all on top of whatever else is currently on the relevant branch, just with Laszlo's patch added. Here's F25:

http://koji.fedoraproject.org/koji/taskinfo?taskID=16567423

F24:

http://koji.fedoraproject.org/koji/taskinfo?taskID=16567714

Note these won't be signed for Secure Boot purposes, you won't be able to use them with SB enabled, I believe. But if you're running into this issue and you don't have SB enabled, those builds may help.

Comment 65 Laszlo Ersek 2016-11-23 02:59:01 UTC
(In reply to Alexander Korsunsky from comment #58)
> (In reply to Kamil Páral from comment #57)
> > (In reply to Laszlo Ersek from comment #42)
> > > http://koji.fedoraproject.org/koji/taskinfo?taskID=16553935
> > 
> > I can confirm this build fixes Window boot on system from comment 0. Great
> > job, Laszlo, thanks a lot!
> 
> Not for me :(
> 
> 
> I now get several pages of output from loader/efi/chainloader.c, printing
> all kinds of sections. When I jumpt through those pages, the last message is 
> 
> > booting via entry point
> 
> After which, nothing happens.

Good news: I can reproduce this in an OVMF virtual machine.

The serial log contains:

> loader/efi/chainloader.c:775: loader/efi/chainloader.c:775: booting via entry point
> booting via entry point
> !!!! X64 Exception Type - 0E(#PF - Page-Fault)  CPU Apic ID - 00000000 !!!!
> RIP  - 000000005C40401A, CS  - 0000000000000038, RFLAGS - 0000000000210203
> ExceptionData - 0000000000000000
> RAX  - 0000000000000304, RCX - 0000000000001010, RDX - 00000000000024D0
> RBX  - 0000000000002404, RSP - 000000007F2DA2C8, RBP - 0000000000000000
> RSI  - 000000001004B230, RDI - 00000018B44B4253
> R8   - 0000000000002510, R9  - 0000000000080FC8, R10 - 000000001000A0E4
> R11  - 000000007D1EB900, R12 - 0000000010024B34, R13 - 000000001004B230
> R14  - 000000007D1EBB7C, R15 - 0000000000000000
> DS   - 0000000000000030, ES  - 0000000000000030, FS  - 0000000000000030
> GS   - 0000000000000030, SS  - 0000000000000030
> CR0  - 0000000080000033, CR2 - 00000018B44B4253, CR3 - 000000007F279000
> CR4  - 0000000000000668, CR8 - 0000000000000000
> DR0  - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
> DR3  - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
> GDTR - 000000007F266A98 0000000000000047, LDTR - 0000000000000000
> IDTR - 000000007EB31018 0000000000000FFF,   TR - 0000000000000000
> FXSAVE_STATE - 000000007F2D9F20
> !!!! Can't find image information. !!!!

The latter part comes from the edk2 exception handler that is built into OVMF. The presence of this register dump in the serial log means that we crash still within the UEFI boot mgr application.

Let me look some more at the log.

Comment 66 Laszlo Ersek 2016-11-23 03:00:15 UTC
Forgot to say in comment 65 -- this is a Windows 7 guest, based on Alexander's comment 30.

Comment 67 Laszlo Ersek 2016-11-23 06:18 UTC
Created attachment 1222983 [details]
[PATCH] efi/chainloader: truncate overlong relocation section

The UEFI Windows 7 boot loader ("EFI/Microsoft/Boot/bootmgfw.efi", SHA1
31b410e029bba87d2068c65a80b88882f9f8ea25) has inconsistent headers.

Compare:

> The Data Directory
> ...
> Entry 5 00000000000d9000 00000574 Base Relocation Directory [.reloc]

Versus:

> Sections:
> Idx Name      Size      VMA               LMA               File off ...
> ...
>  10 .reloc    00000e22  00000000100d9000  00000000100d9000  000a1800 ...

That is, the size reported by the RelocDir entry (0x574) is smaller than
the virtual size of the .reloc section (0xe22).

Quoting the grub2 debug log for the same:

> chainloader.c:595: reloc_dir: 0xd9000 reloc_size: 0x00000574
> chainloader.c:603: reloc_base: 0x7d208000 reloc_base_end: 0x7d208573
> ...
> chainloader.c:620: Section 10 ".reloc" at 0x7d208000..0x7d208e21
> chainloader.c:661:  section is not reloc section?
> chainloader.c:663:  rds: 0x00001000, vs: 00000e22
> chainloader.c:664:  base: 0x7d208000 end: 0x7d208e21
> chainloader.c:666:  reloc_base: 0x7d208000 reloc_base_end: 0x7d208573
> chainloader.c:671:  Section characteristics are 42000040
> chainloader.c:673:  Section virtual size: 00000e22
> chainloader.c:675:  Section raw_data size: 00001000
> chainloader.c:678:  Discarding section

After hexdumping "bootmgfw.efi" and manually walking its relocation blocks
(yes, really), I determined that the (smaller) RelocDir value is correct.
The remaining area that extends up to the .reloc section size (== 0xe22 -
0x574 == 0x8ae bytes) exists as zero padding in the file.

This zero padding shouldn't be passed to relocate_coff() for parsing. In
order to cope with it, split the handling of .reloc sections into the
following branches:

- original case (equal size): original behavior (--> relocation
  attempted),

- overlong .reloc section (longer than reported by RelocDir): truncate the
  section to the RelocDir size for the purposes of relocate_coff(), and
  attempt relocation,

- .reloc section is too short, or other checks fail: original behavior
  (--> relocation not attempted).

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1347291
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
---
 grub-core/loader/efi/chainloader.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

------------

This is the patch for the issue reported by Alexander. It is really a different issue -- the bug is actually in the Windows 7 UEFI boot loader binary; it is malformed. The patch makes grub2 cope with it, which can be considered a "feature", rather than a bugfix.

The build below includes both patches (plus the debug log patches).

http://koji.fedoraproject.org/koji/taskinfo?taskID=16573505

Comment 68 Bogdan Costescu 2016-11-23 07:57:13 UTC
I can confirm that dbg03 from comment #67 still works for me (Windows 8.1).

Comment 69 Kamil Páral 2016-11-23 12:24:03 UTC
(In reply to Adam Williamson from comment #64)
> Here's F25:
> http://koji.fedoraproject.org/koji/taskinfo?taskID=16567423

Works for the machine from comment 0.

(In reply to Laszlo Ersek from comment #67)
> The build below includes both patches (plus the debug log patches).
> http://koji.fedoraproject.org/koji/taskinfo?taskID=16573505

Works for the machine from comment 0.

Comment 70 Laszlo Ersek 2016-11-23 14:43:24 UTC
Thank you guys for the regression testing. Alexander, does the build in comment 67 make your problem go away? Thanks.

Comment 71 Alexander Korsunsky 2016-11-23 15:08:33 UTC
(In reply to Laszlo Ersek from comment #63)
> Okay. Maybe the entry point itself is translated incorrectly. Alexander, can
> you please attach your log? Perhaps I can find something in it.
Here are the pages of output:

http://i.imgur.com/fsZGaQ1.jpg
http://i.imgur.com/dwEP6Mp.jpg
http://i.imgur.com/CdCcv5O.jpg
http://i.imgur.com/cSaa2yY.jpg


(In reply to Bogdan Costescu from comment #61)
> Are you sure to use the dbg02 package, as mentioned in comment #42?
Yes, I'm sure:

$ rpm -qa | grep grub2
grub2-efi-modules-2.02-0.34.dbg02.fc24.x86_64
grub2-tools-2.02-0.34.dbg02.fc24.x86_64
grub2-2.02-0.34.dbg02.fc24.x86_64
grub2-efi-2.02-0.34.dbg02.fc24.x86_64
grub2-starfield-theme-2.02-0.34.dbg02.fc24.x86_64


(In reply to Adam Williamson from comment #64)
> So I'm definitely not trying to reconcile the pile of crazy that's the grub2
> package build without pjones around. Instead, I've just done some scratch
> builds for now, all on top of whatever else is currently on the relevant
> branch, just with Laszlo's patch added. 
> F24:
> 
> http://koji.fedoraproject.org/koji/taskinfo?taskID=16567714

This one doesn NOT work for me. Still hangs with a single underscore on the screen after selecting the Win7 Boot entry.


(In reply to Laszlo Ersek from comment #67)
> This is the patch for the issue reported by Alexander. It is really a
> different issue -- the bug is actually in the Windows 7 UEFI boot loader
> binary; it is malformed. The patch makes grub2 cope with it, which can be
> considered a "feature", rather than a bugfix.
> 
> The build below includes both patches (plus the debug log patches).
> 
> http://koji.fedoraproject.org/koji/taskinfo?taskID=16573505

This one works for me! (After skipping through dozens of debug output pages ;) )

Comment 72 Alexander Korsunsky 2016-11-23 15:31:29 UTC
(In reply to Laszlo Ersek from comment #70)
> Thank you guys for the regression testing. Alexander, does the build in
> comment 67 make your problem go away? Thanks.

Yes! Thanks Laslzo for the great work.

I will keep testing whatever contains this patch until it lands in the fedora-updates repo.

Comment 73 Laszlo Ersek 2016-11-23 20:20:33 UTC
Side point: please take care to spell my name correctly. I'm kinda touchy on that subject. Use the clipboard when in doubt. (For example, I actually used the clipboard to copy your name into my comments :))

Regarding the second patch (comment 67):

- Thank you Alexander for testing it :)

- Since we've decided to wait for Peter's return before we commit / officially
  build anything, I think we shouldn burden Matt with another review request.
  I think Peter can review the second patch when he returns.

- Adam, re: comment 64, do you think you could provide scratch builds for F24
  and F25, for all arches, with the patches from comment 46 and comment 67 only
  (that is, none of my debug patches)? Thank you.

Comment 74 Adam Williamson 2016-11-23 20:37:05 UTC
Yep, I can do that. Thanks for all the work. Sorry we can't get real builds done till Peter is back, but the whole grub2 package setup is a bit...special.

Comment 75 Dmitry Burstein 2016-11-25 17:03:03 UTC
Can confirm the package from comment 64 solved the problem from comment 14 on F25.

Comment 76 Adam Williamson 2016-11-30 22:27:15 UTC
Updated scratch builds (sorry for the delay) with both of Laszlo's patches but no debug stuff:

Fedora 25 - http://koji.fedoraproject.org/koji/taskinfo?taskID=16685670
Fedora 24 - http://koji.fedoraproject.org/koji/taskinfo?taskID=16685693

Again, these won't work for Secure Boot.

I'm hoping pjones will be around to do some official builds soon.

Comment 77 Andy Wang 2016-12-01 05:42:05 UTC
The latest builds work fine for me on my p8p67 pro.  Thanks for the great work on this.

Comment 78 Bogdan Costescu 2016-12-01 09:39:32 UTC
I can confirm that the latest build from comment #76 works for me on the hardware from comment #29. The laptop is now running F25, so here I have tested the F25 package.

Comment 79 Fedora Update System 2016-12-01 22:37:31 UTC
grub2-2.02-0.36.fc25 has been submitted as an update to Fedora 25. https://bodhi.fedoraproject.org/updates/FEDORA-2016-ee4d91769e

Comment 80 Adam Williamson 2016-12-01 22:55:44 UTC
Thanks Peter! The update (and the F26 build) seem to be missing Laszlo's second patch ("truncate overlong relocation section"), are you still checking that one?

Comment 81 Fedora Update System 2016-12-03 04:35:45 UTC
grub2-2.02-0.36.fc25 has been pushed to the Fedora 25 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-ee4d91769e

Comment 82 Sebastian Grill 2016-12-03 09:33:33 UTC
I just tried grub2-2.02-0.36.fc25 and booting Windows 10 works on the hardware from comment 9. 

Thank you for your commitment.

Comment 83 Fedora Update System 2016-12-04 02:28:19 UTC
grub2-2.02-0.36.fc25 has been pushed to the Fedora 25 stable repository. If problems still persist, please make note of it in this bug report.

Comment 84 Fedora Update System 2016-12-05 14:48:05 UTC
grub2-2.02-0.37.fc25 has been submitted as an update to Fedora 25. https://bodhi.fedoraproject.org/updates/FEDORA-2016-543df7ea8c

Comment 85 Flo H. 2016-12-05 18:58:44 UTC
Will this patch make it to fc24 repo?

Comment 86 Adam Williamson 2016-12-05 19:42:08 UTC
Setting back to ON_QA, as 0.36 had only Laszlo's *first* patch, 0.37 has his second. Can folks who needed Laszlo's second patch please test and provide karma for 0.37?

I'll ask pjones again if he can do an f24 update too.

Comment 87 Alexander Korsunsky 2016-12-05 19:47:30 UTC
(In reply to Adam Williamson from comment #86)
> Setting back to ON_QA, as 0.36 had only Laszlo's *first* patch, 0.37 has his
> second. Can folks who needed Laszlo's second patch please test and provide
> karma for 0.37?

I just tested, and while 0.36 did not fix my Win7 boot issue, 0.37 did. Thanks for the fixes!

Comment 88 Adam Williamson 2016-12-05 19:49:13 UTC
alex: thanks a lot! if you have a Fedora (FAS) account it's best to log in to Bodhi before giving feedback, as it only 'counts' feedback from logged-in people for voting purposes.

Comment 89 Fedora Update System 2016-12-06 03:25:31 UTC
grub2-2.02-0.37.fc25 has been pushed to the Fedora 25 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-543df7ea8c

Comment 90 Kamil Páral 2016-12-06 09:25:11 UTC
(In reply to Fedora Update System from comment #89)
> grub2-2.02-0.37.fc25 has been pushed to the Fedora 25 testing repository. If

This fixes Win10 booting on the machine from comment 0.

Comment 91 Fedora Update System 2016-12-07 05:24:48 UTC
grub2-2.02-0.37.fc25 has been pushed to the Fedora 25 stable repository. If problems still persist, please make note of it in this bug report.

Comment 92 Fedora Update System 2016-12-07 14:35:28 UTC
grub2-2.02-0.35.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-a098b75b13

Comment 93 Fedora Update System 2016-12-08 04:52:57 UTC
grub2-2.02-0.35.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-a098b75b13

Comment 94 Fedora Update System 2016-12-08 15:56:14 UTC
grub2-2.02-0.38.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-a098b75b13

Comment 95 Fedora Update System 2016-12-08 19:22:57 UTC
grub2-2.02-0.38.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-a098b75b13

Comment 96 Fedora Update System 2016-12-11 21:54:10 UTC
grub2-2.02-0.38.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.

Comment 97 niemand 2016-12-13 05:45:23 UTC
Hello Laszlo (Ersek),

I inspected your thoughts in your comment #39:
https://bugzilla.redhat.com/show_bug.cgi?id=1347291#c39

And you are Hell of the coder and observer! :-)

I have question to you, though... Here:

Why this should be done as you did it, and fixed RHEL the way you did it (from your comment #39)???

RHEL implementation:

  argument                    --> parameter           role
  --------                        ---------           --------
  handle_image:data           --> relocate_coff:orig  ORIGINAL
  handle_image:buffer_aligned --> relocate_coff:data  COPY

So, in the SUSE patch, only the copy is passed to relocate_coff():

  argument                    --> parameter           role
  ---------------------------     ------------------  ----
  handle_image:buffer_aligned --> relocate_coff:data  COPY

What prevents you to change code in RHEL, to actually match SUSE code and to pass ONLY relocatable copy of boot-loader (seems much simpler and much less error prone)???

What/why are real differences in these two implementations??? Why RHEL needs original and copy sections, and SUSE ONLY the copy to work from?

Thank you,
_nobody_

Comment 98 Laszlo Ersek 2017-01-03 01:00:43 UTC
(In reply to niemand from comment #97)

> What prevents you to change code in RHEL, to actually match SUSE code and to
> pass ONLY relocatable copy of boot-loader (seems much simpler and much less
> error prone)???
> 
> What/why are real differences in these two implementations??? Why RHEL needs
> original and copy sections, and SUSE ONLY the copy to work from?

My understanding -- which has become quite vague at this point -- is that the original SUSE patch was simpler, but it also copied stuff that was unnecessary for execution. In other words, it wasted memory & execution time. The RHEL variant of the patch intended to optimize this logic (as far as I understand anyway, again), and in that diversion from the SUSE original, the first issue was introduced.

Comment 99 Adam Williamson 2017-04-05 00:25:11 UTC
I'm pretty sure this can be closed at this point.


Note You need to log in before you can comment on or make changes to this bug.