Bug 1888143 - [Ironic][Grub]HW deployment fails for servers with QLogic FastLinQ PLAN EP QL41212 NICs
Summary: [Ironic][Grub]HW deployment fails for servers with QLogic FastLinQ PLAN EP QL...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic-python-agent
Version: 16.1 (Train)
Hardware: All
OS: All
medium
medium
Target Milestone: ---
: ---
Assignee: Julia Kreger
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1970422
TreeView+ depends on / blocked
 
Reported: 2020-10-14 07:54 UTC by Alex Stupnikov
Modified: 2021-12-21 08:54 UTC (History)
9 users (show)

Fixed In Version: openstack-ironic-python-agent-5.0.5-2.20210611024818.el8ost.2
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1970422 (view as bug list)
Environment:
Last Closed: 2021-09-15 07:09:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack Storyboard 2008386 0 None None None 2020-11-24 16:35:18 UTC
OpenStack gerrit 764016 0 None NEW Option to enable bootloader config failure bypass 2020-12-03 09:31:21 UTC
OpenStack gerrit 781864 0 None NEW Option to enable bootloader config failure bypass 2021-03-19 19:20:11 UTC
Red Hat Issue Tracker OSP-187 0 None None None 2021-12-21 08:54:37 UTC
Red Hat Product Errata RHEA-2021:3483 0 None None None 2021-09-15 07:10:12 UTC

Description Alex Stupnikov 2020-10-14 07:54:54 UTC
Description of problem:

Customer is trying to deploy overcloud on top of nodes with Mellanox and QLogic NICs. He has no issues with "Mellanox" nodes, but deployment on "QLogic" nodes fails with the following error:

Oct 06 05:48:48 HOSTNAME ironic-python-agent[3322]: 2020-10-06 05:48:47.822 3322 ERROR ironic_python_agent.extensions.image [-] Installing GRUB2 boot loader to device /dev/sda failed with Unexpected error while running command.
                                                              Command: chroot /tmp/tmp9svts23f /bin/sh -c "grub2-install /dev/sda"
                                                              Exit code: 1
                                                              Stdout: 'Skipping unreadable variable "Boot0000": Input/output error\n'
                                                              Stderr: 'Installing for x86_64-efi platform.\nCould not prepare Boot variable: Input/output error\ngrub2-install: error: efibootmgr failed to register the boot entry: Input/output error.\n'.: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.


Customer fixed the problem after patching (this is lab, not production) Ironic Agent using the following commands:

mkdir tempipa
cd tempipa
zcat ../ironic-python-agent.initramfs | cpio -i
sed -i 's/"%(bin)s-install %(dev)s"/"%(bin)s-install --no-nvram %(dev)s"/g' usr/lib/python3.6/site-packages/ironic_python_agent/extensions/image.py
find . | cpio -oc | gzip > ../ironic-python-agent.initramfs


So this looks like possible Grub bug to me, but I am also wondering if there is something to improve inside Ironic? If it is Grub bug, could you help me reporting it, or I need to collaborate with appropriate support group? Customer provided sosreport from affected "QLogic" node.


Version-Release number of selected component (if applicable):
New RHOSP 16.1 deployment

Comment 9 Julia Kreger 2020-11-24 16:35:19 UTC
I discussed this with the upstream community and the consensus seems to be that the card is attempting to be helpful by injecting and ultimately in this case overflowing the table by populating UEFI nvram table entries for each device being offered by the adapter to help ensure that booting from a device mapped through the card is an immediate option for diskless systems. While helpful, we seem to need to just ignore that we had a failure occur. Except we can't do that across the board. So the consensus seems to be to allow an operator override setting which could be set per node to ignore bootloader installation/setup failures as this impacts all operations with the UEFI nvram table. Since this is really a feature, stable policy basically prohibits us from getting backporting this upstream. That being said, I suspect it will be safe to assume this will have landed in time for OSP17.

Please let me know if there are any questions or concerns. For now moving to ON_DEV as this change is in upstream review.

Comment 10 Alex Stupnikov 2020-11-28 14:17:03 UTC
Julia, thank you very much for all the support, I really appreciate it.

As you told, it looks like it will not be possible to back-port the change to stable OpenStack branches, so I have a final question: what are our options for affected customer? Can we cherry pick patch to downstream branch or provide some kind of supported workaround? Or maybe redirect his request and tell to request help from hardware vendor?

Comment 11 Julia Kreger 2020-11-30 19:51:46 UTC
It would help if they were to pull in their hardware vendor. Maybe there is a setting in the card firmware to disable the behavior. We may eventually be able to backport the change to be a supported workaround for the hardware's behavior. That being said still requires the patch to merge upstream first and I think we'll need QE to sign-off because there may not be an "easy" path for them to test the workaround.

Comment 12 Alex Stupnikov 2020-12-02 15:30:26 UTC
Thank you Julia, I will share your findings with customer and will recommend him to ask his HW vendor to take a second look. Appreciate you guidance!

Comment 25 errata-xmlrpc 2021-09-15 07:09:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483


Note You need to log in before you can comment on or make changes to this bug.