RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1597210 - Rebase ipxe to latest upstream
Summary: Rebase ipxe to latest upstream
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: ipxe
Version: 7.5
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: rc
: 7.7
Assignee: Neil Horman
QA Contact: Erico Nunes
URL:
Whiteboard:
: 1395512 1478203 1638280 1679416 (view as bug list)
Depends On:
Blocks: 1614004 1634838 1649833 1652518 1683386
TreeView+ depends on / blocked
 
Reported: 2018-07-02 09:09 UTC by Adam Huffman
Modified: 2022-03-13 15:11 UTC (History)
16 users (show)

Fixed In Version: ipxe-20180825-1.git133f4c.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1634838 1652518 1683386 (view as bug list)
Environment:
Last Closed: 2019-08-06 12:40:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:2059 0 None None None 2019-08-06 12:40:12 UTC

Description Adam Huffman 2018-07-02 09:09:14 UTC
Description of problem:
The current version fails to boot on HPE Gen10 servers.

Version-Release number of selected component (if applicable):
ipxe-bootimgs-20170123-1.git4e85b27.el7_4.1.noarch

How reproducible:
Every time

Steps to Reproduce:
1. Install TripleO with iPXE
2. Attempt to introspect a node
3.

Actual results:
Could not open net0: Not enough space (http://ipxe.org/312c4089)
Could not open net1: Not enough space (http://ipxe.org/312c4089)
Could not open net2: Not enough space (http://ipxe.org/312c4089)
Could not open net3: Not enough space (http://ipxe.org/312c4089)
nodnic_port_allocate_dbr_dma: doorbell record dma allocation error ( Status = -1)

nodnic_port_allocate_ring_db_dma: failed to allocate doorbell record dma ( Status = -1)
nodnic_port_rx_pi_dma_alloc: rx doorbell dma allocation error (Status = -1)
Nodnic_port_create_qp: receive db dma error (Status = -1)
Could not open net4: Error 0x00000001 (http:ipxe.org/00000001)
Could not open net5: Not enough space (http://ipxe.org/31136089)

Expected results:
iPXE loads the introspection images and also the deployment images and the node boots

Additional info:
When I built manually from upstream Git, iPXE booted normally and was able to load images on the node.

Comment 2 Neil Horman 2018-07-02 11:08:52 UTC
Is there a specific upstream commit that you believe fixes this error?  Its far to late in the 7.6 development cycle to consider a complete rebase.  If you can point out a specific commit we can look at the possiblity of backporting, or if you can show that using the latest upstream ipxe image results in an operational system we can look at a wholesale update for 7.7

Comment 3 Adam Huffman 2018-07-02 16:30:33 UTC
I can't give you a specific commit at the moment, I'm afraid.
When I contacted the iPXE developers, they advised using the latest available, so that's what I did, and I haven't yet had time to go back and narrow it down.

I can confirm that using the latest upstream iPXE image gives an operational system.

Comment 4 Neil Horman 2018-07-02 17:47:12 UTC
Ok, well, thats a step in the right direction.  On a positive note, there have been a very limited number of commits to the mellanox infiniband driver (I belive you are using mlx5_nodnic, correct?).  In fact there may be only one, so let me try backporting that commit to see if it works for you.  If so, this may still need to wait for 7.7, but at least we will know what we are working with.

Comment 5 Neil Horman 2018-07-02 18:07:23 UTC
1ff1eebcf7a93a237a1b91ea5d9dcc5b5f1a13bf is, for the record the upstream mlx5 commit, I've got a build going with it here:
https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=16968929


And I've uploaded it here:
http://people.redhat.com/nhorman/rpms/ipxe.tbz2

Could you please try that rpm out and let me know if it fixes your allocation issue?

Comment 6 Adam Huffman 2018-07-03 12:28:10 UTC
There is a Mellanox card in these nodes. However, it's not being used for PXE.
For that we're just using the onboard 1Gb NICs.

I've just tried with that build and while it does work for the first part of boot, it very soon crashes.

Thanks a lot for taking a look at this, and I quite understand about the need to wait. Just wanted to make sure this was captured somewhere, given all the frustration it caused me...

Comment 7 Neil Horman 2018-07-04 10:08:38 UTC
Well, wait a moment.  The error that you are reporting is definitely originating form the mellanox card in its attempt to initalize (the notes on the associated url are out of date).

Can you provide the details of the crash that results (the backtrace ideally)?  We need to either figure out whats going on here, or disable the mellanox driver.

Comment 8 Adam Huffman 2018-07-04 13:46:56 UTC
If you have instructions for grabbing the backtrace, I'd be happy to do so.

As it is, it's printed in red text that keeps flicking between different screens, so almost impossible to capture via remote console.

Comment 9 Neil Horman 2018-07-05 17:18:43 UTC
easiest way is via a serial console (either a virtual one, or a phsical device, possibly connected via usb dongle).  Once you connect via a serial port, you can just use whatever terminal emulator you choose to record the session

Comment 10 Adam Huffman 2018-07-06 06:30:29 UTC
Obviously, yes - clearly not thinking straight earlier.

Hopefully over a serial console it will be shown normally.

Will try and do that today.

Comment 11 Neil Horman 2018-07-06 15:47:52 UTC
thanks, let me know.

Comment 12 Adam Huffman 2018-07-08 22:01:28 UTC
Here you go, I think this is everything:

X64 Exception Type 0C - Stack Fault Exception

RCX=0000000000000001 DX=0000000000000001 R8=000000009D4D3D6C R9=000000009533FF98
RSP=000000009D4D3D20 BP=6500720061007700 AX=65007200610076FF BX=0000000000000000
R10=000000009D4F3B60 11=4D1AB44318A031AB 12=FFFFFFFFFFFFFFFF 13=000000009D4F27E0
R14=000000009908F290 15=0000000095387898 SI=0000000000000000 DI=6500720061007700
CR2=0000000000000000 CR3=000000009D434000 CR0=80000013 CR4=00000668 CR8=00000000
CS=00000038 DS=00000030 SS=00000030 ES=00000030 RFLAGS=00010203
MSR: 0x1D9 = 00004801, 0x345=000033C5, 0x1C9=0000001A

LBRs From              To                From              To
01h  0000000098FE945A->0000000096AE3174  0000000098FE93FC->0000000098FE944C
03h  0000000098FEF10C->0000000098FE93DC  0000000098FEF1F6->0000000098FEF0FA
05h  000000009903C9F8->0000000098FEF1F3  0000000098FFFA61->000000009903C9F0
07h  0000000098FFFA2F->0000000098FFFA57  0000000098FFF9F6->0000000098FFFA23
09h  0000000098FFFA6E->0000000098FFF9E6  000000009903C9EB->0000000098FFFA6C
0Bh  0000000099001EAB->000000009903C9E8  0000000099040A95->0000000099001EA8
0Dh  0000000099040A89->0000000099040A90  0000000099001EA3->0000000099040A86
0Fh  0000000099001EF6->0000000099001EA0  0000000099001E92->0000000099001EF4

CALL ImageBase        ImageName+Offset
00h  0000000098FCC000 ipxe.efi+01D45Ah

CALL ImageBase        ImageName+Offset
 
STACK   00h      04h      08h      0Ch      10h      14h      18h      1Ch
RSP+00h 94298D20 9570FCD8 9570FC98 9908F2A8 9908F1E8 9908F290 9908F2A8 9908F1E8
RSP+20h 00000000 9533FF98 990779C0 00000001 00000000 00000000 00000000 00000000
RSP+40h 00000000 00000000 00000000 00000000 00000000 00000000 00000001 00000001
RSP+60h 00000008 00000001 00000000 00000000 00000000 9D4D4280 9D4F27E0 99098550
RSP+80h 00000000 95387898 00000000 99039F9E 99078E30 99098550 9D4D3F90 94286018
RSP+A0h 000001E8 99078DA0 000000AA 99079DD0 00000001 FFFFFFE0 00000000 00000000
RSP+C0h 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
RSP+E0h 00000000 9D50C380 9D50C380 00000008 FFFFFFE0 00000000 9D4D41F8 00000004

Comment 13 Neil Horman 2018-07-09 13:26:42 UTC
hmm, there wasn't a stack trace leading up to that?

Comment 14 Adam Huffman 2018-07-09 16:24:15 UTC
It kept cycling through what looked like the same text, so there may have been a slightly different one.

If I have time I'll trigger it again.

Comment 15 Adam Huffman 2018-07-09 16:48:48 UTC
Just grabbed it again:

X64 Exception Type 0C - Stack Fault Exception

RCX=0000000000000001 DX=0000000000000001 R8=000000009D4D3D6C R9=0000000095342C18
RSP=000000009D4D3D20 BP=6500720061007700 AX=65007200610076FF BX=0000000000000000
R10=000000009D4F3B60 11=4D1AB44318A031AB 12=FFFFFFFFFFFFFFFF 13=000000009D4F27E0
R14=000000009908F290 15=0000000095384718 SI=0000000000000000 DI=6500720061007700
CR2=0000000000000000 CR3=000000009D434000 CR0=80000013 CR4=00000668 CR8=00000000
CS=00000038 DS=00000030 SS=00000030 ES=00000030 RFLAGS=00010203
MSR: 0x1D9 = 00004801, 0x345=000033C5, 0x1C9=00000002

LBRs From              To                From              To
01h  0000000098FE945A->0000000096AE3174  0000000098FE93FC->0000000098FE944C
03h  0000000099040D7B->000000009903C9E0  0000000099040ABF->0000000099040D75
05h  0000000099040AAE->0000000099040ABA  0000000099040D70->0000000099040AAB
07h  000000009903C9DB->0000000099040D4F  0000000098FEF1EE->000000009903C9D4
09h  0000000098FEF1C2->0000000098FEF1DA  0000000098FEF19B->0000000098FEF1B9
0Bh  0000000098FEF181->0000000098FEF192  000000009900B2AA->0000000098FEF13C
0Dh  0000000098FE4FA5->000000009900B29F  000000009903A490->0000000098FE4F91
0Fh  0000000099039EFA->000000009903A474  0000000096AE317F->000000009A06DCA0

CALL ImageBase        ImageName+Offset
00h  0000000098FCC000 ipxe.efi+01D45Ah

CALL ImageBase        ImageName+Offset














STACK   00h      04h      08h      0Ch      10h      14h      18h      1Ch
RSP+00h 9428ED20 9570BE58 9570BE18 9908F2A8 9908F1E8 9908F290 9908F2A8 9908F1E8
RSP+20h 00000000 95342C18 990779C0 00000001 00000000 00000000 00000000 00000000
RSP+40h 00000000 00000000 00000000 00000000 00000000 00000000 00000001 00000001
RSP+60h 00000008 00000001 00000000 00000000 00000000 9D4D4280 9D4F27E0 99098550
RSP+80h 00000000 95384718 00000000 99039F9E 99078E30 99098550 9D4D3F90 94283018
RSP+A0h 000002C6 99078DA0 0000009D 99079DD0 00000001 FFFFFFE0 00000000 00000000
RSP+C0h 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
RSP+E0h 00000000 9D50C380 9D50C380 00000008 FFFFFFE0 00000000 9D4D41F8 00000004


X64 Exception Type 0C - Stack Fault Exception

RCX=0000000000000001 DX=0000000000000001 R8=000000009D4D3D6C R9=0000000095342C18
RSP=000000009D4D3D20 BP=6500720061007700 AX=65007200610076FF BX=0000000000000000
R10=000000009D4F3B60 11=4D1AB44318A031AB 12=FFFFFFFFFFFFFFFF 13=000000009D4F27E0
R14=000000009908F290 15=0000000095384718 SI=0000000000000000 DI=6500720061007700
CR2=0000000000000000 CR3=000000009D434000 CR0=80000013 CR4=00000668 CR8=00000000
CS=00000038 DS=00000030 SS=00000030 ES=00000030 RFLAGS=00010203
MSR: 0x1D9 = 00004801, 0x345=000033C5, 0x1C9=00000002

LBRs From              To                From              To
01h  0000000098FE945A->0000000096AE3174  0000000098FE93FC->0000000098FE944C
03h  0000000099040D7B->000000009903C9E0  0000000099040ABF->0000000099040D75
05h  0000000099040AAE->0000000099040ABA  0000000099040D70->0000000099040AAB
07h  000000009903C9DB->0000000099040D4F  0000000098FEF1EE->000000009903C9D4
09h  0000000098FEF1C2->0000000098FEF1DA  0000000098FEF19B->0000000098FEF1B9
0Bh  0000000098FEF181->0000000098FEF192  000000009900B2AA->0000000098FEF13C
0Dh  0000000098FE4FA5->000000009900B29F  000000009903A490->0000000098FE4F91
0Fh  0000000099039EFA->000000009903A474  0000000096AE317F->000000009A06DCA0

CALL ImageBase        ImageName+Offset
00h  0000000098FCC000 ipxe.efi+01D45Ah

Comment 16 Neil Horman 2018-07-11 10:45:05 UTC
Hmm, unfortunately that doesn't help much.  All it really tells me is that somehow our stack got corrupted.  Can you tell me anything more about how far into the pxe boot it got prior to crashing?  Perhaps we can try some guess and check here to see if we can identify the bad module, and edit it out of the build.

Comment 17 Adam Huffman 2018-07-11 10:50:04 UTC
It had got as far as requesting boot images from the iPXE server and was loading those.
Hard to say whether it had finished loading the larger image.

Comment 18 Neil Horman 2018-07-11 13:54:39 UTC
what nic are you using for the pxe request?  make and model?

Comment 19 Adam Huffman 2018-07-12 17:23:09 UTC
02:00.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)

Comment 20 Neil Horman 2018-07-13 18:11:21 UTC
Ok, so nothing exotic.  Let me try taking the mellanox driver out of the build....

Comment 21 Neil Horman 2018-07-16 11:47:34 UTC
http://people.redhat.com/nhorman/rpms/ipxe.tar

Please give that build a try if you would please
Thanks!

Comment 22 Adam Huffman 2018-07-26 14:22:12 UTC
Just checking - that appears to have an older NVR than the currently installed package.

Here's what I currently have:

ipxe-bootimgs-20170123-3.git4e85b27.el7.noarch

while what's in the tarball is:

ipxe-bootimgs-20170123-2.1.git4e85b27.el7.noarch.rpm

Is that what you intended?

Comment 23 Neil Horman 2018-07-26 18:04:26 UTC
yeah, sorry about that, I bumped the version number when I gave you the first build, but forgot to do so with this one.  Just uninstall and install the 2.1 package, it will be fine for testing.

Comment 24 Adam Huffman 2018-07-27 14:59:57 UTC
In the end I manually extracted ipxe.efi from the RPMs, and in fact booting did complete with this version.

Comment 25 Neil Horman 2018-07-27 17:02:22 UTC
ok, so something about the mellanox driver is causing the system to crash.  I'll try another build readding it and limiting its functionality to see if we can creep up on the problem

Comment 26 Neil Horman 2018-07-27 17:59:13 UTC
hey, out of curiosity, how much ram does your system have?

Comment 27 Adam Huffman 2018-07-30 14:43:34 UTC
Some of them have 192GB, the others 384GB.

Comment 30 Neil Horman 2018-07-31 17:01:25 UTC
hey, so I'm really having trouble getting any visibility on this issue beyond just determining that we are running out of memory on the system (which I know seems odd for a system as large as this, but pxe environments are limited to a small area.  As such, I'm taking another pass at just upgrading the infiniband stack, which I may be able to get accepted for 7.6 (no guarantee, but this is probably the fastest path to a fix that we can put into 7.6 at this stage.  I've uploaded a set of ipxe roms here:
http://people.redhat.com/nhorman/rpms/ipxe.tbz2

which has the infiniband stack only updated to the latest upstream.  could you please give it a shot and let me know how well that works for you?

Comment 31 Adam Huffman 2018-08-03 17:46:24 UTC
Yes, it does boot with that build.

Comment 32 Neil Horman 2018-08-03 19:28:46 UTC
Interesting, ok.  I'm not sure ipxe can get on the approved list for a 7.6 upodate at this point, but I'll see if I can request it.

Comment 33 Neil Horman 2018-08-06 10:43:19 UTC
ok, looks like we won't be able to get this approved for 7.6, but I'll see about pushing to have this included in the 7.7 update

Comment 34 Neil Horman 2018-08-17 13:38:14 UTC
*** Bug 1395512 has been marked as a duplicate of this bug. ***

Comment 36 Neil Horman 2018-11-05 12:34:13 UTC
*** Bug 1638280 has been marked as a duplicate of this bug. ***

Comment 37 Edu Alcaniz 2018-11-07 09:24:37 UTC
Thanks for the information but we have some other servers Huawei with the NICs

NetXtreme BCM5719 Gigabit Ethernet PCIe
vendor/device id : 14e4:1657

That they have a problem using the images to deploy the overcloud nodes.

Can we work on a solution, hotfix or workaround for the current 7.5.z or 7.6.z

Thanks
Edu Alcaniz

Comment 42 Neil Horman 2018-11-14 22:03:17 UTC
Based on that it still seems like its got to have something to do with pci ids, but we've already decided that we're going to rebase to upstream, so I'll take a look for pci id differences as I do the rebase

Comment 43 Edu Alcaniz 2018-11-15 13:17:51 UTC
We tried an installation using DVD rhel 7.4 and 7.5 so  it's not issue with driver as such, as the card is detected (and the test with 7.6 is not needed anymore).

Therefore, if the driver from RHL7.4+ is included into ipxe.efi image, the issue would be most probably resolved.

Comment 44 Edu Alcaniz 2018-11-15 13:18:54 UTC
so we are sure now that the driver is ok, we just need to: 1) create ipxe.efi using the current driver, and 2) include this new ipxe.efi into the OSP installation image

Comment 45 Neil Horman 2018-11-19 12:21:14 UTC
Thats my understanding, yes, though, to be clear, since you set the flags, because this is a wholesale update, its not elligible for z-stream inclusion, I'm clearing those flags.

Comment 46 Edu Alcaniz 2018-11-19 12:22:44 UTC
Thanks Neil, please update me if you need anything else and with any update.

Comment 47 Miroslav Rezanina 2018-11-19 15:54:01 UTC
Driver should be already included in the ipxe.efi. We do not diverge from upstream build and version we use supports the device.

Comment 55 Neil Horman 2019-01-14 16:31:57 UTC
*** Bug 1478203 has been marked as a duplicate of this bug. ***

Comment 59 Steve Almy 2019-02-26 14:46:02 UTC
*** Bug 1679416 has been marked as a duplicate of this bug. ***

Comment 65 errata-xmlrpc 2019-08-06 12:40:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2059


Note You need to log in before you can comment on or make changes to this bug.