Bug 1257722 - [RFE][Beaker] Bring back rhtsbooter
[RFE][Beaker] Bring back rhtsbooter
Status: NEW
Product: Beaker
Classification: Community
Component: general (Show other bugs)
19
Unspecified Unspecified
high Severity high (vote)
: ---
: ---
Assigned To: beaker-dev-list
tools-bugs
: FutureFeature
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-08-27 15:04 EDT by Jeff Burke
Modified: 2016-03-07 09:03 EST (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Client side script (7.01 KB, text/plain)
2015-08-27 15:04 EDT, Jeff Burke
no flags Details
beaker booter command (3.04 KB, text/x-python)
2015-09-02 14:15 EDT, Bill Peck
no flags Details

  None (edit)
Description Jeff Burke 2015-08-27 15:04:27 EDT
Created attachment 1067856 [details]
Client side script

Description of problem:
 Systems fail to netboot. When the do the boot the harddrive and boot the older OS. If the previous install was used for a Job then the BEAH, Restraint client side harness is still on the machine.

 We should be able to contact the LC and verify if we are running the correct recipe. If not then we should be able to download the vmlinuz, initrd and ks.cfg
to the local /boot directory. 

 Then using grubby update grub(2) with a new stanza to boot the vmlinuz, initrd using the ks.cfg we just downloaded. So that the recipe can continue.

 I have attached the original rhtsbooter script that use to be there.

Thanks,
Jeff
Comment 1 John Feeney 2015-08-30 21:08:35 EDT
This problem is significantly effecting QE's ability to test RHEL7.2. QE is 
reporting a high number of failures on AArch64 systems and can not be rectified 
once it happens.

Thus, I am setting the Priority to Urgent, otherwise RHEL may delayed.
Comment 2 Roman Joost 2015-09-01 01:50:07 EDT
Dear Jeff,

thanks for your bug report. Given the potential negative side effects of the attached script, would it be possible to mention what problems you're facing with the current situation if the systems fail? Does it take too long to re-create the system? Is the result output not satisfactory?

We'd be interested in improving the current state instead of using the script. Perhaps there are things we're not aware of.

Kind Regards,
Róman
Comment 4 Dan Callaghan 2015-09-01 02:17:22 EDT
The whole idea of this "rhtsbooter" script is to try and work around a system which has failed to netboot for some reason, and to kick off the installation by hacking up whatever boot loader is on disk instead.

But it's not a very good workaround, because it means that the system will not actually boot over the network even thought it should have. Nowadays in Beaker we have switchable boot loader support: for example, for ppc64 we use yaboot on RHEL <= 7.0 but grub2 on RHEL >= 7.1. That means we could end up booting with the wrong boot loader because the previous recipe was using yaboot but the next one should be using grub2.

In situations like that, the script will either fail to work (and probably make an even bigger mess of things than before it started) or it might actually get the next recipe started but cause some surprising problems, or invalidate the results, because it didn't actually boot with the netboot loader that it was supposed to be using.

So I think we are better off fixing up the underlying issues with boot order being wrong, rather than trying to paper over it with a workaround like this rhtsbooter script.
Comment 5 Jeff Burke 2015-09-01 07:53:26 EDT
(In reply to Dan Callaghan from comment #4)
> The whole idea of this "rhtsbooter" script is to try and work around a
> system which has failed to netboot for some reason, and to kick off the
> installation by hacking up whatever boot loader is on disk instead.
> 
> But it's not a very good workaround, because it means that the system will
> not actually boot over the network even thought it should have. Nowadays in
> Beaker we have switchable boot loader support: for example, for ppc64 we use
> yaboot on RHEL <= 7.0 but grub2 on RHEL >= 7.1. That means we could end up
> booting with the wrong boot loader because the previous recipe was using
> yaboot but the next one should be using grub2.
>
Hi Dan, 
 Although you may not think this is a very good workaround. It was an attempt to do something to help alleviate the problems we are facing in the lab. Good points about the switchable bootloader on ppc64. We should be solve that. I agree it may not boot from pxe but the install will still run over the network.
>
>In situations like that, the script will either fail to work (and probably
> make an even bigger mess of things than before it started) or it might
> actually get the next recipe started but cause some surprising problems, or
> invalidate the results, because it didn't actually boot with the netboot
> loader that it was supposed to be using.
> 
 The situation you mention will happen. But depending on how the workaround is done we can help mitigate the side effects. I do no want to dismiss this as an option because of one arch and one scenario. Lets talk through all of the options. Honestly I am not happy about asking for this.
>
> So I think we are better off fixing up the underlying issues with boot order
> being wrong, rather than trying to paper over it with a workaround like this
> rhtsbooter script.
This was not meant to be an either / or request. I agree and we should continue to try and get these issues fixed. However you know how difficult it has been to get some of these things address by developers. In parallel I would like to have a workaround in place to help alleviate this. I believe that we have some scenarios that will just happen in a test / lab environment.

I don't have a list of all of the RHEL BZ that we think play a direct or indirect role in these failures? If we can get that together I will also help try and drive those to resolution.

Regards,
Jeff
Comment 6 Bill Peck 2015-09-02 14:15:40 EDT
Created attachment 1069564 [details]
beaker booter command

Hi Dan,

I've written a python version which uses the beaker lab controller get_installation_for_system.  I'm not thrilled with the parsing I needed to do to get the active recipe id.  But it does seem to work and this version uses the new-kernel-pkg command which should be much more robust than the original rhts-booter command.

In a perfect world this will never be needed but I think we would rather have some attempt at an install then a failure.

worst case we could add an option to skip this (if the job really should fail if it can't netboot).

Thoughts?
Comment 7 Dan Callaghan 2015-09-02 19:32:19 EDT
Yes opting out will be important since there are some people who want to use Beaker specifically for testing netboot. Another case where this will break is with the (not yet implemented) Secureboot support which relies on booting the right shim loader thing for the specific distro in use: bug 1156201.

More importantly, we should make sure that the recipe results in Beaker clearly show that the system didn't actually netboot but booted from disk as a workaround. That will help to reduce the chance of netboot problems going undetected.
Comment 8 Dan Callaghan 2015-09-02 19:33:20 EDT
Thanks for the tip about new-kernel-pkg as well. It looks like that is only RHEL6+ though... but I guess that will be good enough for this purpose.
Comment 9 Dan Callaghan 2015-09-03 00:37:32 EDT
BTW we are not aware of any chronic boot order problems, aside from the AMD Seattle issue which is being worked on.

So we are treating this as just a "nice to have" RFE, not an urgent fix.
Comment 10 Jeff Burke 2015-09-03 07:45:31 EDT
(In reply to Dan Callaghan from comment #9)
> BTW we are not aware of any chronic boot order problems, aside from the AMD
> Seattle issue which is being worked on.
> 
> So we are treating this as just a "nice to have" RFE, not an urgent fix.

Hi Dan,
 We do have several other problems that plague the environment. This happens often with several UEFI systems. Also we continue work with PnT ops on lab infrastructure anomalies. But we still see it happen on other systems as well.

 You are correct the Seattle issue is being worked and I am not sure that this issue would even fix the current Seattle issue since they are dropping into firmware, not even booting locally. But if they can get to at least boot locally this would help.

 How do I get this raised from "nice to have" to being included?

Regards,
Jeff
Comment 12 Roman Joost 2015-09-13 20:57:05 EDT
As agreed with Dilip Soman there is not much we can do from the Beaker side. One item we can do here is put a more aggressive broken system detection mechanism in place. Dan filed an RT:

https://engineering.redhat.com/rt/Ticket/Display.html?id=370562

Taking that into account, I think we can lower the priority of this ticket to medium since solving this bug will not provide an immediate relief to the issues regarding the broken firmware of the AArch64 systems. We'll prioritise this bug as a regular RFE for our roadmap.
Comment 13 Jeff Burke 2015-10-26 10:33:40 EDT
Hi Roman,
 I wanted to follow up on this request. Do we have a schedule of if/when this will be included?

 Currently when a system gets installed and the boot order is changed to boot from disk. If the system fails to finish the recipe properly the system remains broken until some one manually fixes it. The issue with manually fixing it is that the netboot menu that the user is presented with has no escape. There is no default timeout to then boot to local disk nor is the an option to boot local disk.

 As of right now if someone reserves a UEFI machine and lets the WD take the system back or cancels the reservesys. The system is in a broken state until someone manually fixes the bootnext entry and sets it to network. Having this feature would help those cases.

Regards,
Jeff
Comment 15 Jeff Burke 2016-03-02 08:16:16 EST
Hi guys, 
 Just checking in on this request. I have a feeling that it is going to become an issue again the closer we get to 7.3

Regards,
Jeff
Comment 16 PaulB 2016-03-02 14:19:19 EST
All,
fyi...
Its my understanding that this would resolve the issue with 
the aarch64 "seattle" systems losing the boot order if a developer
runs return2beaker. Following a return2beaker on a seattle host the 
boot order is left to boot to hard-drive, rather than pxe.

sample reference RT:
 https://engineering.redhat.com/rt/Ticket/Display.html?id=393721

Best,
-pbunyan
Comment 17 Jeff Burke 2016-03-02 14:31:24 EST
(In reply to PaulB from comment #16)
> All,
> fyi...
> Its my understanding that this would resolve the issue with 
> the aarch64 "seattle" systems losing the boot order if a developer
> runs return2beaker. Following a return2beaker on a seattle host the 
> boot order is left to boot to hard-drive, rather than pxe.
> 
> sample reference RT:
>  https://engineering.redhat.com/rt/Ticket/Display.html?id=393721
> 
> Best,
> -pbunyan

Hi Paul,
 I don't think this solves the problem. This is a "workaround" to help alleviate the problems that come along with a system not PXE/Netbooting properly.

 I was under a different opinion then Rich about the return2beaker.sh. Can you please setup a reproducer and verify that is in fact what is happening.

Thank you,
Jeff
Comment 18 PaulB 2016-03-07 09:03:11 EST
(In reply to Jeff Burke from comment #17)
> (In reply to PaulB from comment #16)
> > All,
> > fyi...
> > Its my understanding that this would resolve the issue with 
> > the aarch64 "seattle" systems losing the boot order if a developer
> > runs return2beaker. Following a return2beaker on a seattle host the 
> > boot order is left to boot to hard-drive, rather than pxe.
> > 
> > sample reference RT:
> >  https://engineering.redhat.com/rt/Ticket/Display.html?id=393721
> > 
> > Best,
> > -pbunyan
> 
> Hi Paul,
>  I don't think this solves the problem. This is a "workaround" to help
> alleviate the problems that come along with a system not PXE/Netbooting
> properly.
> 
>  I was under a different opinion then Rich about the return2beaker.sh. Can
> you please setup a reproducer and verify that is in fact what is happening.
> 
> Thank you,
> Jeff

Jeff,
I will speak with RichF regarding the the intermittent issue with the ARM
system boot order. I wont dirty this BZ any furthur...

-pbunyan

Note You need to log in before you can comment on or make changes to this bug.