Bug 870366 - Can't boot system after reboot the machine with uefi which run provision
Can't boot system after reboot the machine with uefi which run provision
Status: CLOSED NOTABUG
Product: Beaker
Classification: Community
Component: lab controller (Show other bugs)
0.9
Unspecified Unspecified
urgent Severity urgent (vote)
: ---
: ---
Assigned To: beaker-dev-list
tools-bugs
Provisioning, TestBlocker
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-10-26 05:23 EDT by Zhongqiang Dou
Modified: 2015-05-11 18:20 EDT (History)
15 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-03-25 23:02:10 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Comment 1 Dan Callaghan 2012-10-30 19:19:05 EDT
Can you give us more info about what Beaker is doing wrong here? Is there something wrong with the grub config we are generating for it?
Comment 2 Zhongqiang Dou 2012-10-31 05:01:07 EDT
(In reply to comment #1)
> Can you give us more info about what Beaker is doing wrong here? Is there
> something wrong with the grub config we are generating for it?

Please refer to following bug.
https://bugzilla.redhat.com/show_bug.cgi?id=794543
Comment 3 Dan Callaghan 2012-10-31 18:13:11 EDT
Bug 794543 is about the workaround to flip the boot order back to booting from network first (since Anaconda changes it during install to boot from hard disk), and it was fixed months ago. Are you saying that you are seeing the same problem? That after installation is finished the boot order is wrong, i.e. booting from hard disk instead of from network?
Comment 4 Michael Gregg 2012-10-31 22:17:11 EDT
This bug sounds related. 

https://bugzilla.redhat.com/show_bug.cgi?id=670266

I've gotten a better result on my IBM machines with a different grub image on my beaker server. I'll follow up with more info if requested.
Comment 5 Zhongqiang Dou 2012-10-31 22:31:14 EDT
(In reply to comment #3)
> Bug 794543 is about the workaround to flip the boot order back to booting
> from network first (since Anaconda changes it during install to boot from
> hard disk), and it was fixed months ago. Are you saying that you are seeing
> the same problem? That after installation is finished the boot order is
> wrong, i.e. booting from hard disk instead of from network?

Yes, we met the same problem with Bug 794543.
Actually, the problem is that, the boot order is right after OS installation and the machine can boot into OS as normal, but if we reboot the machine manually and then the first boot is changed as pxe net boot. The machine can't boot into OS, just boot into PXE menu list.
Comment 6 Michael Gregg 2012-10-31 22:38:52 EDT
could your problem be solved by creating a efidefault that automatically chains to localboot?
Comment 7 Zhongqiang Dou 2012-11-01 03:25:45 EDT
(In reply to comment #6)
> could your problem be solved by creating a efidefault that automatically
> chains to localboot?

Hi Michael, there is the efidefault in /var/lib/tftpboot/grub on lab controller, and the content is the entries of pxe boot. I don't think it is good idea to edit the file to add localboot entry manually. It is not normal operation.

Create a job on the machine from Beaker site and waiting the status change to running, the file "0A425629" will be generated to make the pxe installation. The action showed on the beaker is "Provision", "clear_logs", "configure_netboot", "reboot".
The content of the file is,
[root@lab-02 grub]# cat 0A425629
default 0
timeout 10
title Beaker scheduled job for ibm-x3850x5-01.rhts.eng.nay.redhat.com
    root (nd)
    kernel /images/ibm-x3850x5-01.rhts.eng.nay.redhat.com/kernel console=ttyS0,115200n81 ks=http://beaker.engineering.redhat.com/kickstart/132647 ksdevice=00:21:5E:11:65:52 netboot_method=efigrub
    initrd /images/ibm-x3850x5-01.rhts.eng.nay.redhat.com/initrd

After the OS installation, the machine is rebooted by anaconda and boot into OS. The file "0A425629" is removed from lab controller by action "clear_netboot". 

Red Hat Enterprise Linux Server release 6.3 (Santiago)
Kernel 2.6.32-279.el6.x86_64 on an x86_64

ibm-x3850x5-01.rhts.eng.nay.redhat.com login:

I reboot the machine and then it only can boot into PXE menu list.
    GNU GRUB  version 0.97  (432K lower / 1894380K upper memory)

 +-------------------------------------------------------------------------+
 | Fedora-17_nfs-PAE-i386                                                  |  
 | Fedora-17_nfs-i386                                                      |
 | Fedora-17_nfs-x86_64                                                    |
 | Fedora-rawhide-20120622_nfs-17-PAE-i386                                 |
 | Fedora-rawhide-20120622_nfs-17-i386                                     |
 | Fedora-rawhide-20120622_nfs-17-x86_64                                   |
 | Fedora-rawhide-20120623_nfs-17-PAE-i386                                 |
 | Fedora-rawhide-20120623_nfs-17-i386                                     |
 | Fedora-rawhide-20120623_nfs-17-x86_64                                   |
 | Fedora-rawhide-20120624_nfs-17-PAE-i386                                 |
 | Fedora-rawhide-20120624_nfs-17-i386                                     |
 | Fedora-rawhide-20120624_nfs-17-x86_64                                   | v
 +-------------------------------------------------------------------------+
      Use the ^ and v keys to select which entry is highlighted.
      Press enter to boot the selected OS, 'e' to edit the
      commands before booting, 'a' to modify the kernel arguments
      before booting, or 'c' for a command-line.

Could you explain whether is the procedure right? If I want the host to boot from loca disk, how to do it?
Comment 8 Dan Callaghan 2012-11-01 03:30:21 EDT
(In reply to comment #7)
> (In reply to comment #6)
> > could your problem be solved by creating a efidefault that automatically
> > chains to localboot?
> 
> Hi Michael, there is the efidefault in /var/lib/tftpboot/grub on lab
> controller, and the content is the entries of pxe boot. I don't think it is
> good idea to edit the file to add localboot entry manually. It is not normal
> operation.

What is this file, and who created it? I assume it's equivalent to pxelinux.cfg/default for PXELINUX? (My knowledge of grub netbooting is very limited...)

Beaker doesn't manage that file, but it sounds like it probably should. It needs to default to local boot when there is no host-specific config. That is how Beaker automates systems.
Comment 9 Dan Callaghan 2012-11-01 03:35:43 EDT
(In reply to comment #7)
> Could you explain whether is the procedure right? If I want the host to boot
> from loca disk, how to do it?

Sorry, for clarity I should add that everything you described here is exactly how things are supposed to work in Beaker -- except for the last part of course, where it gets stuck at the GRUB menu instead of booting from hard disk :-)

I suspect if you remove the grub/efidefault file, GRUB will default to local boot, which is what we are expecting. That's why I asked where grub/efidefault has come from.
Comment 10 Zhongqiang Dou 2012-11-01 03:58:47 EDT
(In reply to comment #9)
> (In reply to comment #7)
> > Could you explain whether is the procedure right? If I want the host to boot
> > from loca disk, how to do it?
> 
> Sorry, for clarity I should add that everything you described here is
> exactly how things are supposed to work in Beaker -- except for the last
> part of course, where it gets stuck at the GRUB menu instead of booting from
> hard disk :-)
> 
> I suspect if you remove the grub/efidefault file, GRUB will default to local
> boot, which is what we are expecting. That's why I asked where
> grub/efidefault has come from.

Removed the file efidefault and rebooted the machine, the grub lost boot files.
    GNU GRUB  version 0.97  (432K lower / 1894380K upper memory)

 [ Minimal BASH-like line editing is supported.  For the first word, TAB
   lists possible command completions.  Anywhere else TAB lists the possible
   completions of a device/filename.]

grub>
Comment 11 Michael Gregg 2012-11-01 11:37:51 EDT
So it sounds like a work-around for this bug will be to manually add a localboot option to the efidefault config file. 

Long term, beaker is likley going to need to manage the localboot default option in the efideault file.
Comment 12 Dan Callaghan 2012-11-01 18:00:25 EDT
(In reply to comment #11)
> Long term, beaker is likley going to need to manage the localboot default
> option in the efideault file.

Okay, that's fine. It sounds like GRUB behaves the same as PXELINUX (it does nothing if no default config is found), which is the reason why Beaker has to make sure pxelinux.cfg/default exists. We can do the same for grub/efidefault.

It would be nice if anyone could point me to some docs about this particular version of GRUB and how it behaves (and which config format it expects), because as far as I can tell it is completely undocumented, like most of the boot loaders we have to support in Beaker :-(
Comment 13 Michael Gregg 2012-11-01 18:13:35 EDT
I'd love to point you to relevant docs, but I have been somewhat hard pressed to come up with a exact spec on how it all works. 

Grub was not originally intended to be used on a network like this. As such, nearly all of the documentation that I have found assumes that you'll be loading things from a hard drive. 

I found a set of examples oves on some centos page that looks relevant:

http://wiki.centos.org/HowTos/PXE/PXE_Setup/Menus

It's suggesting that the timeout be raised to a non-zero value, and a option at should be added to the top of the file looking similar to this:

ONTIMEOUT local

I'll post more documentation if I find it.
Comment 14 Michael Gregg 2012-11-01 19:03:23 EDT
That last comment was wrong. 

The documentation was for syslinux, not grub. 

I have tested the following on my machines. This config does seem to make my machines time out and chain to the next boot device. 

The first 5 lines of my efidefault:

default=0
timeout=10

title exit-grub
    quit


Please, try it there, see if it works for you.
Comment 15 Zhongqiang Dou 2012-11-02 03:03:10 EDT
(In reply to comment #14)
> That last comment was wrong. 
> 
> The documentation was for syslinux, not grub. 
> 
> I have tested the following on my machines. This config does seem to make my
> machines time out and chain to the next boot device. 
> 
> The first 5 lines of my efidefault:
> 
> default=0
> timeout=10
> 
> title exit-grub
>     quit
> 
> 
> Please, try it there, see if it works for you.
Add this lines and it quit from Grub but hang on the network boot.

**Quit the Grub**
 GNU GRUB  version 0.97  (432K lower / 1894380K upper memory)

 +-------------------------------------------------------------------------+
 | exit-grub                                                               |  
 | Fedora-17_nfs-PAE-i386                                                  |
 | Fedora-17_nfs-i386                                                      |
 | Fedora-17_nfs-x86_64                                                    |
 | Fedora-rawhide-20120622_nfs-17-PAE-i386                                 |
 | Fedora-rawhide-20120622_nfs-17-i386                                     |
 | Fedora-rawhide-20120622_nfs-17-x86_64                                   |
 | Fedora-rawhide-20120623_nfs-17-PAE-i386                                 |
 | Fedora-rawhide-20120623_nfs-17-i386                                     |
 | Fedora-rawhide-20120623_nfs-17-x86_64                                   |
 | Fedora-rawhide-20120624_nfs-17-PAE-i386                                 |
 | Fedora-rawhide-20120624_nfs-17-i386                                     | v
 +-------------------------------------------------------------------------+
      Use the ^ and v keys to select which entry is highlighted.
      Press enter to boot the selected OS, 'e' to edit the
      commands before booting, 'a' to modify the kernel arguments
      before booting, or 'c' for a command-line.

   The highlighted entry will be booted automatically in 1 seconds.     

**Hang on network boot**
Broadcom UNDI PXE-2.1 v5.2.4
Copyright (C) 2000-2009 Broadcom Corporation
Copyright (C) 1997-2000 Intel Corporation
All rights reserved.

CLIENT MAC ADDR: 00 21 5E 11 65 52  GUID: 9098F91B 2584 DF11 A195 00215E116552
CLIENT IP: 10.66.86.41  MASK: 255.255.254.0  DHCP IP: 10.66.78.111
GATEWAY IP: 10.66.87.254
Comment 16 Nick Coghlan 2012-11-05 20:36:55 EST
If you can figure out what settings we need to put into grub/efidefault, then we should be able to address this as Dan described above.
Comment 17 Scott Poore 2012-11-05 20:51:24 EST
I'm still looking for those.

In the mean time, I've got a couple questions, 

Is it possible to change boot order after anaconda reboot to make disk first and confirm that it boots normally?  Or EFI doesn't work like that?  it overrides regardless and hits network anyway?

Is it possible to boot with a rescue disk and confirm that grub menu is NOT coming from local disk?  

long shot guesses maybe but, I'm not sure yet what is happening.

Thanks,
Scott
Comment 23 Qixiang Wan 2012-12-06 02:54:53 EST
Note for users who need to use such UEFI systems (now what we found is all IBM X series Servers) in beaker:

Beaker can provision such system via UEFI pxeboot successfully, but after /distribution/install task, reboot system can cause it fail to boot up from local drive. This is because beaker do the following changes in the post script:

[1] change BootOrder to first boot from PXE, then the installed OS
[2] change BootNext to the installed OS

so the first reboot which is just after distro installation can boot up the installed OS on local drive successfully since the BootNext change. Any later reboot will make the system try to boot from pxe first, and the boot manager can't continue to boot the next entry in BootOrder which we haven't figured out the reason.

User can login the system after provision, and change the BootNext value to installed OS on local drive before any reboot. (Please don't change BootOrder to move the PXE out from the first entry in BootOrder, otherwise it can prevent beaker to provision this system).

If you need to run a task (other than /distribution/install) which requires a reboot, you can run a /distribution/command task before it to change the BootNext value) or just update your task to do it before every reboot.

Actually as user can change the BootOrder at any time after login or in boot manager, it can't be promised PXE is first entry in BootOrder when beaker provision the system. One possible solution is don't do any change in post script, and force PXE boot on the next boot only via ipmi before the job is scheduled to run, like:

ipmitool -I lanplus -H $IMM  -U $USER -P $PASSWD chassis bootdev pxe options=efiboot

but this is only applicable for the systems which have ipmi support.
Comment 24 Lingzhu Xiang 2012-12-10 02:33:38 EST
Let me try to tell the story again:

After POST, UEFI firmware runs a boot manager that will look up a list of boot
entries stored in flash memory, and try them all one by one until one boots.
This is the UEFI boot process defined by the specification. Apparently the top
boot entry decides control of the entire boot process; the rest are just
fallbacks.

Beaker needs to control boot method with PXE as the top boot entry. So after
distro installation makes itself the top, beaker takes it back (rhts_post, or
[1] and [2] in comment #23). When not provisioning new distro, PXE doesn't boot
anything (grub quit command) and let the boot manager fall back to booting local
distro.

Now, on IBM System x servers, after PXE quits, the boot manager does not fall
back to booting the intended distro. Instead, it bails out the entire UEFI boot
process and skips into legacy boot.


How reproducible:
Qixiang has reproduced it on IBM x3650 M3.
I has reproduced it on IBM x3850 X5. The x3850 X5 firmware was indeed broken in
comment #1, but the reproduction was unaffected.

Qixiang has other data points on the reproduction. Can you put them here?


Steps to Reproduce:
1. Configure TFTP server to provide grub.efi for PXE boot.
2. Prepare the disk with EFI bootloader (grub.efi, grubx64.efi, etc.)
3. Edit boot entries as PXE is the first, the local bootloader is the second.
4. Boot the machine until it enters grub.
5. Execute "quit" command.


Actual results:
It prints "please wait, initializing legacy usb devices", enters legacy boot,
and does not boot the local bootloader.


Expected results:
The local bootloader is loaded.


Additional info:

This is a firmware bug. Immediate ramifications:

- If beaker keeps control of boot, the installed distro can't boot.
- If the installed distro is able to boot, beaker won't have control of boot
  and can't provision new distros.

Firmware bugs are usually prohibitively hard to fix. Unfortunately, this time
it is no exception. Qixiang and I has spent a week frobbing the IBM firmware,
but it didn't get better.

Now there are several kinds of workarounds for IBM System x:

1. Beaker doesn't grab the control after provisioning. Reinstate beaker's
   control when returning.
   Problem: the control may be often not reinstated and beaker loses control
   forever.
2. Beaker keeps the control. Let the user request one-time boot option with
   BootNext (efibootmgr) before every reboot to escape from being stuck. 
   Problem: accidents happen before the user can do anything then the user loses 
   the distro forever.
3. Beaker keeps the control. Beaker directly boots/chainloads the local distro 
   in PXE after provisioning, instead of relying on falling back.
   Problem: no consistent policy.
4. Beaker keeps the control. PXE sets the one-time boot option and reboots once.
   Problem: extra implementation, an extra reboot wasted.
5. Beaker keeps the control. When not provisioning, prevent the boot manager from
   loading PXE bootloader at all (by causing network error, or providing invalid
   bootloader) and let it fall back normally.
   Problem: how to switch PXE bootloader with beaker API?

The 1st and 2nd are addressed in comment #23. The 5th comes from my observation when UEFI PXE is fed with pxelinux.0, it prints "Succeed to download NBP files. Boot Failed. PXE Network" and proceeds with the next boot entry.

Thoughts about these workarounds?
Comment 25 Dan Callaghan 2012-12-12 20:31:25 EST
(In reply to comment #24)
> 3. Beaker keeps the control. Beaker directly boots/chainloads the local
> distro 
>    in PXE after provisioning, instead of relying on falling back.
>    Problem: no consistent policy.

What you mean here is changing "quit" to "localboot" in the efidefault file, is that right? Is there any downside to this? It sounds like the right solution.

> 5. Beaker keeps the control. When not provisioning, prevent the boot manager
> from
>    loading PXE bootloader at all (by causing network error, or providing
> invalid
>    bootloader) and let it fall back normally.
>    Problem: how to switch PXE bootloader with beaker API?

I guess you mean that when Beaker wants the system to boot from disk, it has to somehow change the boot filename in DHCP to some nonsense value, so that the firmware boots from disk instead? This won't work since Beaker doesn't manage DHCP config (it's managed by hand by the administrator).
Comment 26 Lingzhu Xiang 2012-12-12 23:19:34 EST
(In reply to comment #25)
> > 3. Beaker keeps the control. Beaker directly boots/chainloads the local
> 
> What you mean here is changing "quit" to "localboot" in the efidefault file,
> is that right? Is there any downside to this? It sounds like the right
> solution.

"localboot" was opposed in comment 7, so I didn't think of it. I thought about "quit" as suggested in comment 14. Qixiang can weigh in here.

> > 5. Beaker keeps the control. When not provisioning, prevent the boot manager
> > from
> 
> I guess you mean that when Beaker wants the system to boot from disk, it has
> to somehow change the boot filename in DHCP to some nonsense value, so that
> the firmware boots from disk instead? This won't work since Beaker doesn't
> manage DHCP config (it's managed by hand by the administrator).

I mean Beaker can change the content of the bootloader file through TFTP server access. But this is still no more than a hack. 


I have a new hypothesis: the default "PXE Network" is probably actually a vendor-specific dual boot option which terminates UEFI boot process.

In some previous experiments I didn't distinguish two types of boot entries. IBM machines have a default boot option named "PXE Network", and I can manually add PXE boot options. The two look exactly the same. 

                                 Start Options
-------------------------------------------------------------------------------

  [PXE Network]                                      | Device Path :  
  PXE Network UEFI Only                              | PciRoot(0x0)/Pci(0x1,0x0
  PXE Network UEFI Only 1                            | )/Pci(0x0,0x0)/MAC(E41F1
  CD/DVD Rom                                         | 36BE5C4,0x0)/IPv4(0.0.0. 
                                                     | 0,UDP,DHCP,0.0.0.0)

The device path conforms to UEFI spec.

I added two "PXE Network UEFI Only" boot options and tried them with different orders.

  Load File
  [PciRoot(0x0)/Pci(0x1,0x0)/Pci(0x0,0x0)/MAC(E41F1
  36BE5C4,0x0)/IPv4(0.0.0.0,UDP,DHCP,0.0.0.0)]

The first time
--------------

PXE Network UEFI Only
PXE Network UEFI Only 1
CD/DVD Rom
PXE Network

Boot process:
>>Start PXE over IPv4.
  Station IP address is 10.66.86.49
  Server IP address is 10.66.78.118
  NBP filename is pxelinux-402.0
  NBP filesize is 26828 Bytes
 Downloading NBP file...

  Succeed to download NBP file.
Boot Failed. PXE Network UEFI Only

 Downloading NBP file...

  Succeed to download NBP file.
Boot Failed. PXE Network UEFI Only 1
Boot Failed. CD/DVD Rom
UEFI PXE PciRoot(0x0)/Pci(0x1,0x0)/Pci(0x0,0x0)/MAC(E41F136BE5C4,0x0)/IPv4(0.0.0.0,UDP,DHCP,0.0.0.0)

 Downloading NBP file...

  Succeed to download NBP file.
Please wait, initializing legacy usb devices...Done
(enters legacy boot)

The second time
---------------

PXE Network
CD/DVD Rom
PXE Network UEFI Only
PXE Network UEFI Only 1

Boot process:
UEFI PXE PciRoot(0x0)/Pci(0x1,0x0)/Pci(0x0,0x0)/MAC(E41F136BE5C4,0x0)/IPv4(0.0.0.0,UDP,DHCP,0.0.0.0)

>>Start PXE over IPv4.
  Station IP address is 10.66.86.49
  Server IP address is 10.66.78.118
  NBP filename is pxelinux-402.0
  NBP filesize is 26828 Bytes
 Downloading NBP file...

  Succeed to download NBP file.
Please wait, initializing legacy usb devices...Done
(enters legacy boot)

Strings from ">>Start PXE over IPv4." to "Succeed to download NBP file." can be found in UefiPxeBcDxe from EDK2. "Boot Failed. XXXX" is standard message of EDK2 boot manager from BdsDxe/Strings.uni.

I think now what can be done is to avoid the default "PXE Network" for pure UEFI PXE boot and try to see how it turns out.
Comment 27 Qixiang Wan 2012-12-13 01:04:24 EST
(In reply to comment #26)
> [snip]
> "localboot" was opposed in comment 7, so I didn't think of it. I thought
> about "quit" as suggested in comment 14. Qixiang can weigh in here.

Theoretically, quit from efi should be the proper solution, but as we see, it doesn't work for these IBM X series systems. 

> 
> > > 5. Beaker keeps the control. When not provisioning, prevent the boot manager from loading PXE bootloader at all (by causing network error, or providing invalid bootloader) and let it fall back normally.
> > 
> > I guess you mean that when Beaker wants the system to boot from disk, it has
> > to somehow change the boot filename in DHCP to some nonsense value, so that
> > the firmware boots from disk instead? This won't work since Beaker doesn't
> > manage DHCP config (it's managed by hand by the administrator).
> 
> I mean Beaker can change the content of the bootloader file through TFTP
> server access. But this is still no more than a hack. 
> 

This will prevent beaker to provision other systems at that moment if we replace the bootloader with invalid binary or something else, as the bootloader is used by all systems at the same time.

> 
> I have a new hypothesis: the default "PXE Network" is probably actually a
> vendor-specific dual boot option which terminates UEFI boot process.
> 
> [snip]
> 
> PXE Network UEFI Only
> PXE Network UEFI Only 1
> CD/DVD Rom
> PXE Network
> 

What's the result with this case:
---
PXE Network UEFI Only
PXE Network UEFI Only 1
Red Hat Enterprise Linux (Installed OS)
CD/DVD Rom
PXE Network
---

Can the system continue to boot local installed os after the first 2 entries? If it doesn't work either, the problem still exist.

BTW, additional comment for comment 23:

> User can login the system after provision, and change the BootNext value to installed OS on local drive before any reboot. (Please don't change BootOrder to move the PXE out from the first entry in BootOrder, otherwise it can prevent beaker to provision this system).

> If you need to run a task (other than /distribution/install) which requires a reboot, you can run a /distribution/command task before it to change the BootNext value) or just update your task to do it before every reboot.

Actually, user can just use rhts-reboot manually or in their tasks and don't need to care about the BoorOrder because rhts-reboot is aware of that, it changes BootNext to the value of CurrentBoot before rebooting:

efibootmgr -n $(efibootmgr -v | grep BootCurrent | awk '{ print $2}')
Comment 28 Lingzhu Xiang 2012-12-13 01:18:21 EST
(In reply to comment #27)
> What's the result with this case:
> ---
> PXE Network UEFI Only
> PXE Network UEFI Only 1
> Red Hat Enterprise Linux (Installed OS)
> CD/DVD Rom
> PXE Network
> ---
> 
> Can the system continue to boot local installed os after the first 2
> entries? If it doesn't work either, the problem still exist.

Yes, it will proceed with proper UEFI boot process and boot the local distro (Fedora instead of RHEL):

>>Start PXE over IPv4.
  Station IP address is 10.66.86.49
  Server IP address is 10.66.78.118
  NBP filename is pxelinux-402.0
  NBP filesize is 26828 Bytes
 Downloading NBP file...

  Succeed to download NBP file.
Boot Failed. PXE Network UEFI Only

 Downloading NBP file...

  Succeed to download NBP file.
Boot Failed. PXE Network UEFI Only 1
Welcome to GRUB!

error: file `/EFI/fedora/locale/en.mo.gz' not found.
[    0.000000] Initializing cgroup subsys cpuset
...
Comment 29 Qixiang Wan 2012-12-13 02:03:25 EST
(In reply to comment #28)
> (In reply to comment #27)
> > What's the result with this case:
> > ---
> > PXE Network UEFI Only
> > PXE Network UEFI Only 1
> > Red Hat Enterprise Linux (Installed OS)
> > CD/DVD Rom
> > PXE Network
> > ---
> > 
> > Can the system continue to boot local installed os after the first 2
> > entries? If it doesn't work either, the problem still exist.
> 
> Yes, it will proceed with proper UEFI boot process and boot the local distro
> (Fedora instead of RHEL):
> 

Excellent! If the case of "manual added pxe entry" + "quit from efi" can works as we expected, then I think we don't need to do anything from beaker side, before using these systems in beaker with UEFI pxeboot, we/sysadm can just simply delete the the "PXE Network" entry which is shipped with the system by default in UEFI boot manager and add a same entry [1] like above by manually, then beaker should work well with these systems without any change.

[1] Have the string "netboot" or "pxe" (case-insensitive) in the pxeboot entry is required for beaker to find the right pxeboot entry:
PXE_SLOT=$(/usr/sbin/efibootmgr -v | grep -Ei '(netboot|pxe)' |cut -c5-8)

However I don't have access to any UEFI boot system now, I'll try to get one and have a try with it next week.
Comment 30 Qixiang Wan 2012-12-17 03:28:30 EST
According to Lingzhu's findings in comment 26 and comment 28, the default "PXE Network" shipped by default on IBM X series systems is probably a vendor-specific dual boot option. And per http://www.ibm.com/support/entry/portal/docdisplay?lndocid=MIGR-5082107 , UEFI boot is not possible once a legacy transition has occurred. This can explain why it can't continue to boot the next entry in efi BootOrder, though I doubt they're doing the right things, because as that document says:

"A legacy transition will be caused by a Preboot eXecution Environment (PXE) setting of "Both" or "Legacy Only" when processing the PXE Boot option"

but in our experiments, the legacy transition also happens when we disable Legacy support for the nic:
(F1 -> Setup -> System Settings -> Network -> Network Boot Configuration -> [Nic Address] -> Pxe Mode [UEFI support]

Any way, we have a workaround for these IBM systems now, sysadm can add a new pxeboot entry in the boot manager before using UEFI pxeboot for the system, but one thing need to be addressed here is beaker identify the pxeboot entry in efi BootOrder by:

PXE_SLOT=$(/usr/sbin/efibootmgr -v | grep -Ei '(netboot|pxe)' |cut -c5-8)

so the new entry's name should contain the string of "netboot" or "pxe", now we have a problem, there will be 2 entries found, what we can do here is:
[1] change in beaker to use a different regax pattern which can be distinguished,and sysadm add the new entry with that name
[2] sysadm remove the default "PXE Network" entry from boot manager and add the new entry with the same name.

I think [2] is better than [1] because it doesn't make sense for sysadm to change every efi systems to add a non-default pxeboot entry, such as the IA64 systems which already works well in beaker.

Beside of that, sysadm need to place a efidefault file on the tftp server which have the content like following:

------
default=0
timeout=5

title localboot from the first drive 
   root (hd0,0)
   configfile /efi/redhat/grub.conf
------
which means this workaround can only be applied to the situation that we're using RHEL or Fedora, and the OS is installed on the first disk.

I get an IBM system today and tried the "quit from grub" case, but it doesn't work for us, because it will get back to the boot manager, and can't continue to boot the next entry without interactive.
Comment 31 Qixiang Wan 2012-12-17 03:53:34 EST
I was just being stupid, why we need to care about the default weird pxeboot entry while we have a working (though partly for general cases) efidefault which can boot up local installed os.

so what we need to do is just place a efidefault file on the tftp server, and have something like:
------
default=0
timeout=5

title localboot from the first drive 
   root (hd0,0)
   configfile /efi/redhat/grub.conf
------

Ignore my comment 30 if you don't want to see the silly words.
Comment 32 Lingzhu Xiang 2012-12-17 04:17:38 EST
(In reply to comment #30)
> [1] change in beaker to use a different regax pattern which can be
> distinguished,and sysadm add the new entry with that name

For IBM firmware menu, the exact step is to "Add Boot Option" and select one like

Load File
  [PciRoot(0x0)/Pci(0x1,0x0)/Pci(0x0,0x0)/MAC(E41F1
  36BE5C4,0x0)/IPv4(0.0.0.0,UDP,DHCP,0.0.0.0)]

Then it will ask for a label to be input. There is always a default "PXE Network". Confusing.

> title localboot from the first drive 
>    root (hd0,0)
>    configfile /efi/redhat/grub.conf
> ------
> which means this workaround can only be applied to the situation that we're
> using RHEL or Fedora, and the OS is installed on the first disk.

This won't work with Fedora and RHEL7 with grub2. You mentioned that. Let me think if we can try to quit harder.

> I get an IBM system today and tried the "quit from grub" case, but it
> doesn't work for us, because it will get back to the boot manager, and can't
> continue to boot the next entry without interactive.

Yes that is really how it is supposed to work, but I didn't think much about that:

UEFI specification, 3.1.1 Boot Manager Programming:

    If the boot via Boot#### returns with a status of EFI_SUCCESS the boot
    manager will stop processing the BootOrder variable and present a boot
    manager menu to the user. If a boot via Boot#### returns a status other
    than EFI_SUCCESS, the boot has failed and the next Boot#### in the BootOrder 
    variable will be tried until all possibilities are exhausted.

And grub's "quit" or grub2's "exit" returns GRUB_EFI_SUCCESS. So it should return to a boot manager menu.

If we want to boot the next option, grub should return "a status other than EFI_SUCCESS". Now it doesn't have a command to return that. But it should be trivial to implement. Open a bug for grub[2]?
Comment 34 Nick Coghlan 2013-09-29 20:21:05 EDT
Is there still a Beaker bug here? Or has this been determined to be a Grub2 bug?

Or else a documentation bug to instruct users to use rhts-reboot in their tasks rather than plain reboot?
Comment 35 Dan Callaghan 2013-11-18 21:11:49 EST
The other problem here might be that the BootCurrent variable is missing. I have observed that on the x Series systems which were recently made available on beaker-devel. If the BootCurrent variable does not exist then rhts-reboot will not correctly set BootNext before rebooting, and the system will end up booting from the network again.
Comment 41 Dan Callaghan 2014-03-25 23:02:10 EDT
This bz has turned into a bit of a mess so I would like to close it.

Any remaining issues with netbooting where the system fails to talk to the DHCP server or the TFTP server should be followed up with Eng Ops or the firmware vendor.

Beaker should now always set the BootNext EFI variable correctly, as described here:

https://beaker-project.org/docs/architecture-guide/provisioning-process.html#boot-order

If you are still seeing issues with Beaker setting the BootNext variable correctly, please file a new bug with details.

Beaker no longer relies on being able to quit from EFI GRUB so the issues with that (as described in comment 24 and elsewhere) do not apply.

Note You need to log in before you can comment on or make changes to this bug.