1569586 – Continuous PXE Boot Loop

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1569586 - Continuous PXE Boot Loop

Summary: Continuous PXE Boot Loop

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Documentation
Sub Component:
Version:	6.3.0
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	urgent
Target Milestone:	Unspecified
Assignee:	Stephen Wadeley
QA Contact:	Melanie Corr
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-19 14:28 UTC by Dustin Scott
Modified:	2023-12-15 16:04 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-06-27 09:04:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1467925	0	unspecified	CLOSED	Host provisioning is in infinite loop w/ 6.3	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1596042	0	unspecified	CLOSED	[DOCS] document change in endpoint to end installation successfully	2021-02-22 00:41:40 UTC

Internal Links: 1467925 1596042

Description Dustin Scott 2018-04-19 14:28:50 UTC

Description of problem:

VMware VMs get provisioned with the following settings:

bios.bootOrder = "ethernet0,hdd"

This means that VMs will attempt to boot from network on each boot sequence. Meaning, if you attempt to PXE boot a VM, it will try to boot from the network every time, entering a neverending PXE boot loop, since it will never try to boot from HDD.

This should be flipped to:

bios.bootOrder = "hdd,ethernet0"

The above will attempt to boot from HDD first, and if there is nothing to boot from, will attempt a network boot via PXE next. Of course, after the system is built via PXE, we will have a boot device that we can boot from, which breaks the PXE loop.

Manually editing the VMX file to the above does fix the issue. This should not be considered a workaround however, as PXE is supposed to be a fully automated process without manual intervention. Rather, the manual edit should be considered a troubleshooting mechanism.

Found an upstream bug that should fix this issue:

http://projects.theforeman.org/issues/20878

Version-Release number of selected component (if applicable):

6.3.0

How reproducible:

100%

Steps to Reproduce:
1. Configure PXE boot with Satellite 6.3.0 as the target
2. Configure a VMware compute resource (tested on VMware 6.0U3)
3. PXE boot with the following specific settings:
a) Firmware: BIOS (EFI does not work without reconfig)
b) Virtaul HW Version: 11 (6.0) (only tested with this HW version)
c) E1000/VMXNET3 (both don't work)
d) LSI Logic Parallel/VMware Paravirtual (both don't work)

Actual results:

VM will never be provisioned, and will enter a continuous PXE boot loop.

Expected results:

VM is provisioned. After the first PXE build, the VM will boot from the HDD.

Additional info:

Comment 1 Dustin Scott 2018-04-19 15:45:54 UTC

Actually, I think the foreman bug attached is related to a 'clone' (Template provison) and not a 'create' (PXE provision).  However I did find this:

/usr/share/foreman/app/models/compute_resources/foreman/model/vmware.rb

    def vm_instance_defaults
      super.merge(
        :memory_mb  => 768,
        :interfaces => [new_interface],
        :volumes    => [new_volume],
        :scsi_controller => { :type => scsi_controller_default_type },
        :datacenter => datacenter,
        :firmware => 'automatic',
        :boot_order => ['network', 'disk']
      )
    end


The boot order default should be reversed for a PXE to ['disk', 'network'].  I am confirming this will fix the issue now.

Comment 2 Dustin Scott 2018-04-19 15:58:46 UTC

Confirmed.  Switching the :boot_order fixed the issue.  I believe the :boot_order option set to ['network', 'disk'] is the safer approach and will get around this problem.

Comment 3 Dustin Scott 2018-04-19 15:59:30 UTC

Ugh.  Ignore the above comment.  I meant to say:

I believe the :boot_order option set to ['disk', 'network'] is the safer approach and will get around this problem.

Comment 6 Marek Hulan 2018-04-20 18:01:29 UTC

I think booting from network is needed for network based provisioning. If you need to rebuild the host, you just change tftp config and restart the vm. PXE loop happens if at the end of provisioning, Anaconda callback does not inform Satellite, that VM is built. Please provide Satellite production.log from time, when the kickstart runs %post.

Comment 7 Dustin Scott 2018-04-20 20:47:03 UTC

Marek,

You are correct that booting from network is necessary for PXE based builds, however, booting from the network as the preferred device is not necessary.

For the first time boot up, if the order is set as ['disk','network'], the system will attempt to boot from disk first.  Of course, there is no boot device on the disk on first boot, so it will boot from network.


Are you saying that the callback to Satellite that the system was built will prevent this boot loop?  What about the Satellite 6.2 customers who migrated their scripts, such as myself?  That will leave them in a state where they will be in this boot loop until they fix the discrepancy between 6.2 and 6.3...e.g.:

6.3 Callback:

wget -q -O /dev/null --no-check-certificate <%= foreman_url('built') %>

6.2 Callback:

wget -q -O /dev/null --no-check-certificate <%= foreman_url %>


I did find that I had to change this in my provisioning template in order to finally show the system as 'built' and not 'pending', but I have a feeling people will be migrating scripts and might not catch this one.

Overall, I feel setting ['disk','network'] is a much, much safer approach here without changing functionality.

-Dustin

Comment 8 Dustin Scott 2018-04-20 20:54:32 UTC

Disclaimer from above:

I started with Satellite 6.2.4, and have been updating ever since.  The 6.3.0 is built as a fresh instance with ported post-scripts/provisioning templates.

wget -q -O /dev/null --no-check-certificate <%= foreman_url %>


^^^ that may have come from an earlier version of satellite, but I don't think it's unrealistic that customers might hit this situation.

Comment 9 Marek Hulan 2018-04-23 11:23:25 UTC

> Are you saying that the callback to Satellite that the system was built will prevent this boot loop?

yes, as you figured out, in 6.3 it's required to pass type of foreman_url as an argument, otherwise it would be different url. I understand the change can be painful if you're using customized templates. I'm afraid we failed documenting this upgrade step. Unfortunately we can't upgrade customized templates automatically as foreman_url is still valid and in other context can be used without 'build' argument (or with different one). With the new url, the PXE configuration should be updated as desired and once the system is provisioned, PXE should point booting to local disk. Therefore even if network is before disk, it continues booting from the disk.

I'm moving this BZ to documentation so that it's documented 6.3 upgrade step.

Comment 10 Dustin Scott 2018-05-01 13:09:26 UTC

Is there an underlying reason to boot in this order > ['network', 'disk'] that I am missing here?

I still feel that it is much safer to boot via ['disk', 'network' ] and is the much cleaner solution that relying upon an API call at the end of a build process to break us out of the boot loop.  Understood that the API call is still necessary to inform Satellite that we are built, but from a safety standpoint, wouldn't it be really bad if that API call never happened (networking issue) and these systems continue to reboot and build via PXE?

Again, there might be something I'm missing that is driving the network device to be the priority boot device, but I'm not seeing it.  Just trying to clarify, as I don't think [ 'network', 'disk' ] is a proper solution from what I've seen thus far.

Comment 11 Stephen Wadeley 2018-05-04 18:42:32 UTC

Hello

Info on the foreman_url was added to Release Notes recently


https://access.redhat.com/documentation/en-us/red_hat_satellite/6.3/html/release_notes/release_information#known_issues

Comment 12 Stephen Wadeley 2018-05-07 14:36:22 UTC

Hello Dustin

Can you help to check where we should update the guides?

The "Upgrading and Updating Red Hat Satellite" guide has:

2.5.1. Upgrading Discovery[1]



The "Managing Hosts" guide has:
6.1. Network Configuration for PXE-based Discovery[2]


The "Provisioning Guide" has:

3.5. Creating Provisioning Templates[3]

Thank you


[1] https://access.redhat.com/documentation/en-us/red_hat_satellite/6.3/html/upgrading_and_updating_red_hat_satellite/upgrading_red_hat_satellite#upgrading_discovery_parent

[2] https://access.redhat.com/documentation/en-us/red_hat_satellite/6.3/html/managing_hosts/chap-red_hat_satellite-managing_hosts-discovering_bare_metal_hosts_on_satellite#sect-Red_Hat_Satellite-Managing_Hosts-Discovering_Bare_metal_Hosts_on_Satellite-Network_Configuration_for_PXE-based_Discovery

[3] https://access.redhat.com/documentation/en-us/red_hat_satellite/6.3/html/provisioning_guide/configuring_provisioning_resources#Configuring_Provisioning_Resources-Creating_Provisioning_Templates

Comment 13 Dustin Scott 2018-05-18 13:27:07 UTC

I believe this to be the correct section to provide the appropriate documentation:

2.5. Post-Upgrade Tasks



However, the verbiage below indicates 'optional' where this is NOT optional for anyone using PXE:

Some of the procedures in this section are optional.



For the record, I still think there needs to be other conversations as to whether [ 'network', 'disk' ] is the correct order for foreman setting the boot order in VMware.  I do not believe this is safe, or necessary at all.  This started out as a code bug and quickly turned to a documentation bug, but I believe the doc to simply be a band-aid for a larger problem.

Comment 14 Dustin Scott 2018-05-18 13:41:25 UTC

After some discussions internally, I think both ['network','disk'] and ['disk','network'] have their own valid use cases.

Based on that, is there a way we can make this a configurable option?  I'm thinking this may need to be as granular as the host/compute profile level.

Comment 15 Dustin Scott 2018-05-18 13:41:25 UTC

After some discussions internally, I think both ['network','disk'] and ['disk','network'] have their own valid use cases.

Based on that, is there a way we can make this a configurable option?  I'm thinking this may need to be as granular as the host/compute profile level.

Comment 16 Alex Mayberry 2018-05-18 13:52:53 UTC

If it is of any value here, I have worked in situations where [pxe, network] was desired, so that re-imaging was always easily managed via profiles without the tedious need to modify boot orders manually each time.

For the counter-point, I would offer the example of a scenario where you have a provisioning network that require alternative settings for the initial build, and need to have the network boot option removed once the initial build is completed.

I agree with Dustin, this behaviour should be configurable.

Comment 17 Stephen Wadeley 2018-06-11 19:28:07 UTC

Hello

The template can be edited to change the boot order, so I should document how to do that.


I presume this is not specific to VMware VMs? 

Thank you

Comment 18 Marek Hulan 2018-06-19 11:37:29 UTC

I think we can't change the vmware vm bootorder in template. This bug now mixes two things, documenting foreman_url to foreman_url('built') change fot Satellite 6.3 and RFE, so that we make vmware bootorder configurable. I suggest the later is opened as a separate BZ, as this was already used for the documentation change.

Comment 19 Lukas Zapletal 2018-06-19 12:09:52 UTC

Dustin/Stephen,

Foreman (Satellite 6) PXE workflow was designed with net-hdd boot order from the very beginning. We believe this provides the best flexibility across all environments we support, it's not just PXE in VM we support much more.

We have the boot order currently hardcoded in our codebase, it's not just changing one line this will not work. We'd need to create brand new orchestration step and after a VM exits build mode, we'd need to change the boot order. There is a big challenge tho - most virtualizations do not allow you to change anything in BIOS when VM is running, so we'd need to shutdown the instance, do the change, turn it back. The question is timing - when to do this? We do not want to terminate firstboot because that's when configuration management takes over and does initial run.

You can file a RFE but I am pessimistic on picking this up anytime soon. I'd rather suggest digging on the other side (VMWare/RHEV) and finding out a way to change to boot order after first restart - maybe they provide some hooks or ways to do this. Most virtualization platforms do provide automatic removal of PXE boot after first successful reboot - libvirt, ovirt, RHEV and VMWare does this when you create an ad-hoc VM automatically. Maybe there is a way to configure the same behavior for VM from template.

Wrap-up: Boot order is hardcoded to net-hdd when doing PXE and hdd only when doing VM clone from template or image-based provisioning. Boot loop was a different (known) issue - upgrade to 6.3. Feel free to provide the above in our documentation.

Comment 20 Stephen Wadeley 2018-06-27 06:54:43 UTC

(In reply to Marek Hulan from comment #18)
> I think we can't change the vmware vm bootorder in template. This bug now
> mixes two things, documenting foreman_url to foreman_url('built') change fot
> Satellite 6.3 and RFE, so that we make vmware bootorder configurable. I
> suggest the later is opened as a separate BZ, as this was already used for
> the documentation change.

OK, I will add section in Upgrade Guide[1] about the need to check custom provisioning templates for <%= foreman_url %> and change to <%= foreman_url('built') %>


as per comment 11 this change is mentioned in the release notes but it does not tell you to change custom templates.

The change was mentioned in this comment: 
https://bugzilla.redhat.com/show_bug.cgi?id=1552093#c4
(release notes links to that bug)

[1] 2.5.6. Updating Templates
https://access.redhat.com/documentation/en-us/red_hat_satellite/6.3/html/upgrading_and_updating_red_hat_satellite/upgrading_red_hat_satellite#updating_templates

Comment 26 Stephen Wadeley 2018-06-27 09:04:47 UTC

Hello

These changes are now live on the Customer Portal

2.5.6.1. Finish Templates

https://access.redhat.com/documentation/en-us/red_hat_satellite/6.3/html/upgrading_and_updating_red_hat_satellite/upgrading_red_hat_satellite#updating_templates

Thank you

Note You need to log in before you can comment on or make changes to this bug.