Bug 1597210
Summary: | Rebase ipxe to latest upstream | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Adam Huffman <bloch> | |
Component: | ipxe | Assignee: | Neil Horman <nhorman> | |
ipxe sub component: | ipxe-bootimgs | QA Contact: | Erico Nunes <ernunes> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | augol, bhu, bloch, chayang, ealcaniz, emcnabb, ernunes, knoha, mjenner, mrezanin, nhorman, pveiga, salmy, toneata, tumeya, vivpatil | |
Version: | 7.5 | Keywords: | FutureFeature, Rebase, ZStream | |
Target Milestone: | rc | |||
Target Release: | 7.7 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | ipxe-20180825-1.git133f4c.el7 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1634838 1652518 1683386 (view as bug list) | Environment: | ||
Last Closed: | 2019-08-06 12:40:03 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1614004, 1634838, 1649833, 1652518, 1683386 |
Description
Adam Huffman
2018-07-02 09:09:14 UTC
Is there a specific upstream commit that you believe fixes this error? Its far to late in the 7.6 development cycle to consider a complete rebase. If you can point out a specific commit we can look at the possiblity of backporting, or if you can show that using the latest upstream ipxe image results in an operational system we can look at a wholesale update for 7.7 I can't give you a specific commit at the moment, I'm afraid. When I contacted the iPXE developers, they advised using the latest available, so that's what I did, and I haven't yet had time to go back and narrow it down. I can confirm that using the latest upstream iPXE image gives an operational system. Ok, well, thats a step in the right direction. On a positive note, there have been a very limited number of commits to the mellanox infiniband driver (I belive you are using mlx5_nodnic, correct?). In fact there may be only one, so let me try backporting that commit to see if it works for you. If so, this may still need to wait for 7.7, but at least we will know what we are working with. 1ff1eebcf7a93a237a1b91ea5d9dcc5b5f1a13bf is, for the record the upstream mlx5 commit, I've got a build going with it here: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=16968929 And I've uploaded it here: http://people.redhat.com/nhorman/rpms/ipxe.tbz2 Could you please try that rpm out and let me know if it fixes your allocation issue? There is a Mellanox card in these nodes. However, it's not being used for PXE. For that we're just using the onboard 1Gb NICs. I've just tried with that build and while it does work for the first part of boot, it very soon crashes. Thanks a lot for taking a look at this, and I quite understand about the need to wait. Just wanted to make sure this was captured somewhere, given all the frustration it caused me... Well, wait a moment. The error that you are reporting is definitely originating form the mellanox card in its attempt to initalize (the notes on the associated url are out of date). Can you provide the details of the crash that results (the backtrace ideally)? We need to either figure out whats going on here, or disable the mellanox driver. If you have instructions for grabbing the backtrace, I'd be happy to do so. As it is, it's printed in red text that keeps flicking between different screens, so almost impossible to capture via remote console. easiest way is via a serial console (either a virtual one, or a phsical device, possibly connected via usb dongle). Once you connect via a serial port, you can just use whatever terminal emulator you choose to record the session Obviously, yes - clearly not thinking straight earlier. Hopefully over a serial console it will be shown normally. Will try and do that today. thanks, let me know. Here you go, I think this is everything: X64 Exception Type 0C - Stack Fault Exception RCX=0000000000000001 DX=0000000000000001 R8=000000009D4D3D6C R9=000000009533FF98 RSP=000000009D4D3D20 BP=6500720061007700 AX=65007200610076FF BX=0000000000000000 R10=000000009D4F3B60 11=4D1AB44318A031AB 12=FFFFFFFFFFFFFFFF 13=000000009D4F27E0 R14=000000009908F290 15=0000000095387898 SI=0000000000000000 DI=6500720061007700 CR2=0000000000000000 CR3=000000009D434000 CR0=80000013 CR4=00000668 CR8=00000000 CS=00000038 DS=00000030 SS=00000030 ES=00000030 RFLAGS=00010203 MSR: 0x1D9 = 00004801, 0x345=000033C5, 0x1C9=0000001A LBRs From To From To 01h 0000000098FE945A->0000000096AE3174 0000000098FE93FC->0000000098FE944C 03h 0000000098FEF10C->0000000098FE93DC 0000000098FEF1F6->0000000098FEF0FA 05h 000000009903C9F8->0000000098FEF1F3 0000000098FFFA61->000000009903C9F0 07h 0000000098FFFA2F->0000000098FFFA57 0000000098FFF9F6->0000000098FFFA23 09h 0000000098FFFA6E->0000000098FFF9E6 000000009903C9EB->0000000098FFFA6C 0Bh 0000000099001EAB->000000009903C9E8 0000000099040A95->0000000099001EA8 0Dh 0000000099040A89->0000000099040A90 0000000099001EA3->0000000099040A86 0Fh 0000000099001EF6->0000000099001EA0 0000000099001E92->0000000099001EF4 CALL ImageBase ImageName+Offset 00h 0000000098FCC000 ipxe.efi+01D45Ah CALL ImageBase ImageName+Offset STACK 00h 04h 08h 0Ch 10h 14h 18h 1Ch RSP+00h 94298D20 9570FCD8 9570FC98 9908F2A8 9908F1E8 9908F290 9908F2A8 9908F1E8 RSP+20h 00000000 9533FF98 990779C0 00000001 00000000 00000000 00000000 00000000 RSP+40h 00000000 00000000 00000000 00000000 00000000 00000000 00000001 00000001 RSP+60h 00000008 00000001 00000000 00000000 00000000 9D4D4280 9D4F27E0 99098550 RSP+80h 00000000 95387898 00000000 99039F9E 99078E30 99098550 9D4D3F90 94286018 RSP+A0h 000001E8 99078DA0 000000AA 99079DD0 00000001 FFFFFFE0 00000000 00000000 RSP+C0h 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 RSP+E0h 00000000 9D50C380 9D50C380 00000008 FFFFFFE0 00000000 9D4D41F8 00000004 hmm, there wasn't a stack trace leading up to that? It kept cycling through what looked like the same text, so there may have been a slightly different one. If I have time I'll trigger it again. Just grabbed it again: X64 Exception Type 0C - Stack Fault Exception RCX=0000000000000001 DX=0000000000000001 R8=000000009D4D3D6C R9=0000000095342C18 RSP=000000009D4D3D20 BP=6500720061007700 AX=65007200610076FF BX=0000000000000000 R10=000000009D4F3B60 11=4D1AB44318A031AB 12=FFFFFFFFFFFFFFFF 13=000000009D4F27E0 R14=000000009908F290 15=0000000095384718 SI=0000000000000000 DI=6500720061007700 CR2=0000000000000000 CR3=000000009D434000 CR0=80000013 CR4=00000668 CR8=00000000 CS=00000038 DS=00000030 SS=00000030 ES=00000030 RFLAGS=00010203 MSR: 0x1D9 = 00004801, 0x345=000033C5, 0x1C9=00000002 LBRs From To From To 01h 0000000098FE945A->0000000096AE3174 0000000098FE93FC->0000000098FE944C 03h 0000000099040D7B->000000009903C9E0 0000000099040ABF->0000000099040D75 05h 0000000099040AAE->0000000099040ABA 0000000099040D70->0000000099040AAB 07h 000000009903C9DB->0000000099040D4F 0000000098FEF1EE->000000009903C9D4 09h 0000000098FEF1C2->0000000098FEF1DA 0000000098FEF19B->0000000098FEF1B9 0Bh 0000000098FEF181->0000000098FEF192 000000009900B2AA->0000000098FEF13C 0Dh 0000000098FE4FA5->000000009900B29F 000000009903A490->0000000098FE4F91 0Fh 0000000099039EFA->000000009903A474 0000000096AE317F->000000009A06DCA0 CALL ImageBase ImageName+Offset 00h 0000000098FCC000 ipxe.efi+01D45Ah CALL ImageBase ImageName+Offset STACK 00h 04h 08h 0Ch 10h 14h 18h 1Ch RSP+00h 9428ED20 9570BE58 9570BE18 9908F2A8 9908F1E8 9908F290 9908F2A8 9908F1E8 RSP+20h 00000000 95342C18 990779C0 00000001 00000000 00000000 00000000 00000000 RSP+40h 00000000 00000000 00000000 00000000 00000000 00000000 00000001 00000001 RSP+60h 00000008 00000001 00000000 00000000 00000000 9D4D4280 9D4F27E0 99098550 RSP+80h 00000000 95384718 00000000 99039F9E 99078E30 99098550 9D4D3F90 94283018 RSP+A0h 000002C6 99078DA0 0000009D 99079DD0 00000001 FFFFFFE0 00000000 00000000 RSP+C0h 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 RSP+E0h 00000000 9D50C380 9D50C380 00000008 FFFFFFE0 00000000 9D4D41F8 00000004 X64 Exception Type 0C - Stack Fault Exception RCX=0000000000000001 DX=0000000000000001 R8=000000009D4D3D6C R9=0000000095342C18 RSP=000000009D4D3D20 BP=6500720061007700 AX=65007200610076FF BX=0000000000000000 R10=000000009D4F3B60 11=4D1AB44318A031AB 12=FFFFFFFFFFFFFFFF 13=000000009D4F27E0 R14=000000009908F290 15=0000000095384718 SI=0000000000000000 DI=6500720061007700 CR2=0000000000000000 CR3=000000009D434000 CR0=80000013 CR4=00000668 CR8=00000000 CS=00000038 DS=00000030 SS=00000030 ES=00000030 RFLAGS=00010203 MSR: 0x1D9 = 00004801, 0x345=000033C5, 0x1C9=00000002 LBRs From To From To 01h 0000000098FE945A->0000000096AE3174 0000000098FE93FC->0000000098FE944C 03h 0000000099040D7B->000000009903C9E0 0000000099040ABF->0000000099040D75 05h 0000000099040AAE->0000000099040ABA 0000000099040D70->0000000099040AAB 07h 000000009903C9DB->0000000099040D4F 0000000098FEF1EE->000000009903C9D4 09h 0000000098FEF1C2->0000000098FEF1DA 0000000098FEF19B->0000000098FEF1B9 0Bh 0000000098FEF181->0000000098FEF192 000000009900B2AA->0000000098FEF13C 0Dh 0000000098FE4FA5->000000009900B29F 000000009903A490->0000000098FE4F91 0Fh 0000000099039EFA->000000009903A474 0000000096AE317F->000000009A06DCA0 CALL ImageBase ImageName+Offset 00h 0000000098FCC000 ipxe.efi+01D45Ah Hmm, unfortunately that doesn't help much. All it really tells me is that somehow our stack got corrupted. Can you tell me anything more about how far into the pxe boot it got prior to crashing? Perhaps we can try some guess and check here to see if we can identify the bad module, and edit it out of the build. It had got as far as requesting boot images from the iPXE server and was loading those. Hard to say whether it had finished loading the larger image. what nic are you using for the pxe request? make and model? 02:00.0 Ethernet controller: Broadcom Limited NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01) Ok, so nothing exotic. Let me try taking the mellanox driver out of the build.... http://people.redhat.com/nhorman/rpms/ipxe.tar Please give that build a try if you would please Thanks! Just checking - that appears to have an older NVR than the currently installed package. Here's what I currently have: ipxe-bootimgs-20170123-3.git4e85b27.el7.noarch while what's in the tarball is: ipxe-bootimgs-20170123-2.1.git4e85b27.el7.noarch.rpm Is that what you intended? yeah, sorry about that, I bumped the version number when I gave you the first build, but forgot to do so with this one. Just uninstall and install the 2.1 package, it will be fine for testing. In the end I manually extracted ipxe.efi from the RPMs, and in fact booting did complete with this version. ok, so something about the mellanox driver is causing the system to crash. I'll try another build readding it and limiting its functionality to see if we can creep up on the problem hey, out of curiosity, how much ram does your system have? Some of them have 192GB, the others 384GB. hey, so I'm really having trouble getting any visibility on this issue beyond just determining that we are running out of memory on the system (which I know seems odd for a system as large as this, but pxe environments are limited to a small area. As such, I'm taking another pass at just upgrading the infiniband stack, which I may be able to get accepted for 7.6 (no guarantee, but this is probably the fastest path to a fix that we can put into 7.6 at this stage. I've uploaded a set of ipxe roms here: http://people.redhat.com/nhorman/rpms/ipxe.tbz2 which has the infiniband stack only updated to the latest upstream. could you please give it a shot and let me know how well that works for you? Yes, it does boot with that build. Interesting, ok. I'm not sure ipxe can get on the approved list for a 7.6 upodate at this point, but I'll see if I can request it. ok, looks like we won't be able to get this approved for 7.6, but I'll see about pushing to have this included in the 7.7 update *** Bug 1395512 has been marked as a duplicate of this bug. *** *** Bug 1638280 has been marked as a duplicate of this bug. *** Thanks for the information but we have some other servers Huawei with the NICs NetXtreme BCM5719 Gigabit Ethernet PCIe vendor/device id : 14e4:1657 That they have a problem using the images to deploy the overcloud nodes. Can we work on a solution, hotfix or workaround for the current 7.5.z or 7.6.z Thanks Edu Alcaniz Based on that it still seems like its got to have something to do with pci ids, but we've already decided that we're going to rebase to upstream, so I'll take a look for pci id differences as I do the rebase We tried an installation using DVD rhel 7.4 and 7.5 so it's not issue with driver as such, as the card is detected (and the test with 7.6 is not needed anymore). Therefore, if the driver from RHL7.4+ is included into ipxe.efi image, the issue would be most probably resolved. so we are sure now that the driver is ok, we just need to: 1) create ipxe.efi using the current driver, and 2) include this new ipxe.efi into the OSP installation image Thats my understanding, yes, though, to be clear, since you set the flags, because this is a wholesale update, its not elligible for z-stream inclusion, I'm clearing those flags. Thanks Neil, please update me if you need anything else and with any update. Driver should be already included in the ipxe.efi. We do not diverge from upstream build and version we use supports the device. *** Bug 1478203 has been marked as a duplicate of this bug. *** *** Bug 1679416 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2059 |