Description of problem: Customer is trying to deploy overcloud on top of nodes with Mellanox and QLogic NICs. He has no issues with "Mellanox" nodes, but deployment on "QLogic" nodes fails with the following error: Oct 06 05:48:48 HOSTNAME ironic-python-agent[3322]: 2020-10-06 05:48:47.822 3322 ERROR ironic_python_agent.extensions.image [-] Installing GRUB2 boot loader to device /dev/sda failed with Unexpected error while running command. Command: chroot /tmp/tmp9svts23f /bin/sh -c "grub2-install /dev/sda" Exit code: 1 Stdout: 'Skipping unreadable variable "Boot0000": Input/output error\n' Stderr: 'Installing for x86_64-efi platform.\nCould not prepare Boot variable: Input/output error\ngrub2-install: error: efibootmgr failed to register the boot entry: Input/output error.\n'.: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. Customer fixed the problem after patching (this is lab, not production) Ironic Agent using the following commands: mkdir tempipa cd tempipa zcat ../ironic-python-agent.initramfs | cpio -i sed -i 's/"%(bin)s-install %(dev)s"/"%(bin)s-install --no-nvram %(dev)s"/g' usr/lib/python3.6/site-packages/ironic_python_agent/extensions/image.py find . | cpio -oc | gzip > ../ironic-python-agent.initramfs So this looks like possible Grub bug to me, but I am also wondering if there is something to improve inside Ironic? If it is Grub bug, could you help me reporting it, or I need to collaborate with appropriate support group? Customer provided sosreport from affected "QLogic" node. Version-Release number of selected component (if applicable): New RHOSP 16.1 deployment
I discussed this with the upstream community and the consensus seems to be that the card is attempting to be helpful by injecting and ultimately in this case overflowing the table by populating UEFI nvram table entries for each device being offered by the adapter to help ensure that booting from a device mapped through the card is an immediate option for diskless systems. While helpful, we seem to need to just ignore that we had a failure occur. Except we can't do that across the board. So the consensus seems to be to allow an operator override setting which could be set per node to ignore bootloader installation/setup failures as this impacts all operations with the UEFI nvram table. Since this is really a feature, stable policy basically prohibits us from getting backporting this upstream. That being said, I suspect it will be safe to assume this will have landed in time for OSP17. Please let me know if there are any questions or concerns. For now moving to ON_DEV as this change is in upstream review.
Julia, thank you very much for all the support, I really appreciate it. As you told, it looks like it will not be possible to back-port the change to stable OpenStack branches, so I have a final question: what are our options for affected customer? Can we cherry pick patch to downstream branch or provide some kind of supported workaround? Or maybe redirect his request and tell to request help from hardware vendor?
It would help if they were to pull in their hardware vendor. Maybe there is a setting in the card firmware to disable the behavior. We may eventually be able to backport the change to be a supported workaround for the hardware's behavior. That being said still requires the patch to merge upstream first and I think we'll need QE to sign-off because there may not be an "easy" path for them to test the workaround.
Thank you Julia, I will share your findings with customer and will recommend him to ask his HW vendor to take a second look. Appreciate you guidance!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:3483