Bug 1680659

Summary: Ironic iDrac driver not setting UEFI BIOS Boot Order
Product: Red Hat OpenStack Reporter: Christopher Brown <chris.brown>
Component: openstack-ironicAssignee: Chris Dearborn <christopher_dearborn>
Status: CLOSED CURRENTRELEASE QA Contact: mlammon
Severity: medium Docs Contact:
Priority: medium    
Version: 13.0 (Queens)CC: bfournie, chris.brown, christopher_dearborn, ggrimaux, mburns
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 13:06:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Christopher Brown 2019-02-25 13:26:44 UTC
Description of problem:

With 14th generation Dell Poweredge servers using iDrac 9  out of band management, Ironic is unable to set the boot order and we are forced to use Legacy BIOS.

Version-Release number of selected component (if applicable):

(undercloud) [stack@undercloud ~]$ rpm -qa | grep -i ironic
openstack-ironic-conductor-10.1.6-4.el7ost.noarch
python2-ironicclient-2.2.1-1.el7ost.noarch
python-ironic-lib-2.12.1-2.el7ost.noarch
openstack-ironic-common-10.1.6-4.el7ost.noarch
openstack-ironic-staging-drivers-0.9.1-1.el7ost.noarch
openstack-ironic-inspector-7.2.1-5.el7ost.noarch
openstack-ironic-api-10.1.6-4.el7ost.noarch
python-ironic-inspector-client-3.1.1-2.el7ost.noarch
puppet-ironic-12.4.0-4.el7ost.noarch
python2-ironic-neutron-agent-1.0.0-1.el7ost.noarch

How reproducible:

Always

Steps to Reproduce:
1. Deploy undercloud with uefi options enabled
2. Attempt to deploy - this works first time as no Operating System is on the disk. Subsequent attempts fail as the PXE nic is not set as the first nic.

The upshot of this is that OSP 13 cannot deploy to the latest Dell hardware with UEFI enabled.

Comment 1 Chris Dearborn 2019-02-27 20:22:40 UTC
Using UEFI requires a little setup prior to deploying OSP.  Can you verify that the UEFI configuration is setup correctly?  To do this:
1. login to the iDRAC GUI
2. navigate to Configuration->BIOS Settings->Network Settings
3. Verify that "PXE Device 1" is set to Enabled and PXE Device 2-4 are set to Disabled
4. Under "PXE Device 1 Settings", verify Interface is set to the correct PXE NIC port

When the overcloud is deployed, ironic sets the server to boot 1 time from the configured PXE NIC port, so if things are configured as above, there should be only 1 NIC port for it to boot from.

Comment 2 Christopher Brown 2019-02-28 07:07:42 UTC
Hi Chris,

(In reply to Chris Dearborn from comment #1)
> Using UEFI requires a little setup prior to deploying OSP.  Can you verify
> that the UEFI configuration is setup correctly?  To do this:
> 1. login to the iDRAC GUI
> 2. navigate to Configuration->BIOS Settings->Network Settings
> 3. Verify that "PXE Device 1" is set to Enabled and PXE Device 2-4 are set
> to Disabled
> 4. Under "PXE Device 1 Settings", verify Interface is set to the correct PXE
> NIC port
> 
> When the overcloud is deployed, ironic sets the server to boot 1 time from
> the configured PXE NIC port, so if things are configured as above, there
> should be only 1 NIC port for it to boot from.

Yes, this was set both manually and using the management software. We also tried with:

openstack baremetal node boot device set

and this temporarily set the correct boot device however the issue seems to be with "something" setting the boot device as the hard disk. For example, Ironic queues up a job to change the boot order. This happens, the node reboots but then another job queues to set it back to hard disk. So rather than failing to PXE boot, it never gets to PXE boot.

Comment 3 Chris Dearborn 2019-02-28 15:55:25 UTC
FYI, when ironic changes the boot order, it does not change it permanently, but instead changes it to pxe boot 1 time only.  As a result, you won't see the permanent boot order change.

For clarity, is the issue that you're seeing that the overcloud nodes are trying to boot from the local disk when they should be PXE booting?

Also for clarity, you are NOT seeing an issue where it is trying to PXE boot from the wrong NIC port?

Comment 4 Christopher Brown 2019-02-28 19:27:26 UTC
Hi Chris,

(In reply to Chris Dearborn from comment #3)
> FYI, when ironic changes the boot order, it does not change it permanently,
> but instead changes it to pxe boot 1 time only.  As a result, you won't see
> the permanent boot order change.

Yep, got that, although there is a --persistent option is misleading but thats another matter.

> For clarity, is the issue that you're seeing that the overcloud nodes are
> trying to boot from the local disk when they should be PXE booting?

Yes.

> Also for clarity, you are NOT seeing an issue where it is trying to PXE boot
> from the wrong NIC port?

Correct. PXE boot is never attempted.

As soon as we switch to Legacy BIOS, everything works.

We tested redfish driver however iDrac 9 does not implement ForceRestart from the redfish specification therefore this driver is unusable with hardware running iDrac 9 - separate issue though!

Comment 5 Chris Dearborn 2019-03-06 16:46:19 UTC
Hey Chris,

Can you check to see what BIOS version you are running on your overcloud nodes?

In my testing, it appears that a regression was introduced in the BIOS firmware starting with the 1.6.11 release.  I've tested with the 1.4.9 and 1.5.6 releases, and both of those seem to work fine, so if you want a quick workaround, I would recommend falling back to the 1.5.6 release.

I am continuing to investigate possible workarounds and am also following up with folks on the firmware side.


Thanks,

Chris

Comment 6 Christopher Brown 2019-03-06 17:45:55 UTC
Hi Chris,

(In reply to Chris Dearborn from comment #5)
> Hey Chris,
> 
> Can you check to see what BIOS version you are running on your overcloud
> nodes?

1.6.12 

> 
> In my testing, it appears that a regression was introduced in the BIOS
> firmware starting with the 1.6.11 release.  I've tested with the 1.4.9 and
> 1.5.6 releases, and both of those seem to work fine, so if you want a quick
> workaround, I would recommend falling back to the 1.5.6 release.

Thanks, we have had to workaround using Legacy BIOS for the moment. 
 
> I am continuing to investigate possible workarounds and am also following up
> with folks on the firmware side.

Please do keep me updated. We have more deployment windows coming up in the next month so would potentially be able to test these fixes then.

Comment 7 Bob Fournier 2019-03-13 12:17:35 UTC
*** Bug 1680927 has been marked as a duplicate of this bug. ***

Comment 8 Chris Dearborn 2019-03-15 21:06:28 UTC
Hey folks,

this issue has been fixed in the new BIOS version that is due to be shipped around mid-April.

Comment 9 Christopher Brown 2019-03-28 09:48:30 UTC
Hi Chris,

(In reply to Chris Dearborn from comment #8)
> Hey folks,
> 
> this issue has been fixed in the new BIOS version that is due to be shipped
> around mid-April.

Is this 3.30.30.30 or are we still waiting on this release?

Thanks

Comment 10 Chris Dearborn 2019-03-28 15:27:25 UTC
Hey Chris,

No, the fix isn't in 3.30.30.30.  That's the latest version number for the Lifecycle Controller firmware, but the issue is actually in the BIOS firmware.  The current release of the BIOS firmware for 14G is 1.6.13, and the bug is still present in that version.  I can't supply the exact version number that the fix will be in because it is still changing internally as testing continues, but it will be in the BIOS firmware release that immediately follows 1.6.13, and it should ship in mid-April.

Comment 11 Chris Dearborn 2019-05-20 14:22:16 UTC
This issue should be fixed in BIOS version 2.1.8.

Comment 12 Bob Fournier 2019-06-04 12:56:49 UTC
Hi Chris Brown - I see the case is closed and Chris D. has indicated the BIOS version it is fixed in.  Can we close this bug or will you be able to test the updated BIOS?

Comment 13 Christopher Brown 2019-06-04 13:05:09 UTC
Hi Bob,

(In reply to Bob Fournier from comment #12)
> Hi Chris Brown - I see the case is closed and Chris D. has indicated the
> BIOS version it is fixed in.  Can we close this bug or will you be able to
> test the updated BIOS?

Thanks for following up. We are unable to test the updated BIOS but please feel free to close as this is not an issue with a Red Hat product.

An updated iDrac firmware shipped which enabled us to use redfish and therefore allowed us to switch to UEFI.

Comment 14 Bob Fournier 2019-06-04 13:06:37 UTC
Thanks Chris.