Bug 1322056
Summary: | ipxe freeze during HTTP download with last RPM | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Gonéri Le Bouder <goneri> | ||||||||
Component: | ipxe | Assignee: | Ladi Prosek <lprosek> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Raviv Bar-Tal <rbartal> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 7.0 | CC: | apevec, gael_rehault, goneri, jguiditt, knoel, lhh, lmartins, mburns, mrezanin, rbartal, srevivo, weliao | ||||||||
Target Milestone: | pre-dev-freeze | ||||||||||
Target Release: | 7.3 | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | ipxe-20160127-4.git6366fa7a.el7 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 1336473 (view as bug list) | Environment: | |||||||||
Last Closed: | 2016-11-04 00:39:08 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Gonéri Le Bouder
2016-03-29 16:43:50 UTC
This bug did not make the OSP 8.0 release. It is being deferred to OSP 10. https://review.openstack.org/#/c/310283/ resolves the introspection issue with iPXE. I backported the patch for Liberty: https://review.openstack.org/#/c/315255/ Gonéri, Thanks for the patches. I think we need to do a little cleanup on this BZ to get it targetted to the right places. Here is what I propose (and I can do this if you like). Firstly, your patches are against puppet, not ipxe component, even though they fix an issue _related_ to ipxe. So the first step is to reassign to openstack-puppet-modules. Next, we need to make 2 clones of this BZ against osp 7 and 8 (liberty and mitaka), so they can be properly assessed for possible inclusion. Let me know if any issues with this, otherwise, I will do these steps in a little while today. I don't (yet) plan to fix Kilo/OSP7. ok, I clone the issue and reassign to openstack-puppet-modules. Gonéri, can you please elaborate on what's believed to be broken in iPXE and the best way to reproduce it? iPXE hangs as soon as something unexpected happens on the network. For example, if we a packet is broken because of a wrong MTU, iPXE won't request a resend but will just freeze. This behavior is documented and the --timeout parameter is here to avoid the freeze. http://lists.ipxe.org/pipermail/ipxe-devel/2014-October/003829.html We now use this parameter in OSP8 to avoid the issue but to do so, we had to upgrade to iPXE RPM. The one from RHEL7 is to old and does not support the parameter. A better long term solution is probably to fix the TCP stack of iPXE. It should request a resend instead of freezing. Thank you! Yes, the HTTP/TCP implementation seems to be pretty bad. iPXE> imgfetch http://speedtest.reliableservers.com/100MBtest.bin usually freezes within a few MB and rarely runs to completion on my VM. (In reply to Gonéri Le Bouder from comment #0) > This is very similar to https://bugzilla.redhat.com/show_bug.cgi?id=1310778, > the mains difference is the increased number of time where the problem occur. One thing I noticed in my testing is that if I let the VM use the default MAC address (52:54:00:12:34:56 for virtio-net), the chances of the connection freezing are extremely high. There's likely more than one endpoint claiming this MAC within the local network infrastructure and the switches get confused. This is basically equivalent to pulling the cable as described by the ipxe-devel thread. This may be a long shot but is it possible that the reason why the problem occurs more often are MAC address conflicts? Or maybe just reusing MAC addresses aggressively while there's still stray traffic associated with the previous physical location of the MAC? If I break the MTU (e.g: 1550 for the server, 1400 for the client) and so send too large frame to the iPXE client, it will also free. This time, the problem at the TCP level. (In reply to Gonéri Le Bouder from comment #13) > If I break the MTU (e.g: 1550 for the server, 1400 for the client) and so > send too large frame to the iPXE client, it will also free. This time, the > problem at the TCP level. Not sure I'm following. If the client sits on a lower MTU network than the server, it is a responsibility of the infra between them to do fragmentation as needed. Are you saying that iPXE doesn't correctly compute the link MTU? Can you please elaborate? I have prototyped TCP keepalive in iPXE and the results on my test setup are promising - no more freezes. How difficult would it be for you to give it try? No, sorry if I was not clear enough. This is just a way to generate broken TCP frame. In this case, the frame checksum is broken and iPXE should request a resend and maybe end up with a timeout. I can test an iPXE build on my environment if you want. Created attachment 1162513 [details]
iPXE with TCP keepalive
Thanks - I'm assuming that 1af41000.rom (attached) is all you need, please let me know otherwise.
Can I get a undionly.kpxe file instead? OpenStack Ironic chainloads from the regular PXE client to iPXE with: tag:!ipxe,option:bootfile-name,undionly.kpxe Created attachment 1162836 [details]
iPXE with TCP keepalive (undionly)
Sure thing, attaching undionly.kpxe.
I did about 90 successful boot with the patched iPXE. The patch seems to improve the situation. (In reply to Gonéri Le Bouder from comment #19) > I did about 90 successful boot with the patched iPXE. The patch seems to > improve the situation. Thank you. May I ask what the confidence level of this statement is? Would you expect it to fail in 1/20 cases as mentioned in the description? I will pursue getting the TCP keepalive functionality into iPXE. At the same time, though, I suspect that there may be underlying networking problems on your end probably worth a closer look. The patch simply sends a TCP keepalive packet every 5 seconds if the connection stalls. This makes sure that the L2/L3 path to the server is "refreshed". Switch tables, NAT tables, whatever is in the way and needs to see a client->server packet before server->client starts working could be the culprit. I did something like 200 iPXE boot of the 9 nodes platform during the night. So far so good. The platform has been rebooting happily since yesterday without any issue. Excellent, thanks for testing it! I'll work on getting the fix in. Created attachment 1167467 [details] iPXE with final upstream TCP keepalive (undionly) I am attaching iPXE with the final fix committed upstream by Michael Brown: http://git.ipxe.org/ipxe.git/commitdiff/188789e Gonéri, could you please give this one a try also? It is slightly different from the prototype so it warrants retesting. Thanks! Moving to POST and adding Lucas to cc. Hi Ladi, I give a try to the new rom. It works fine as expected but I don't have enough time to do advanced testing. (In reply to Gonéri Le Bouder from comment #27) > Hi Ladi, > > I give a try to the new rom. It works fine as expected but I don't have > enough time to do advanced testing. Thank you! Fix included in ipxe-20160127-4.git6366fa7a.el7 There are fewer freeze with the new ipxe images, and the --timeout parameter restart the remaining few freeze which can happen for many network issues, even on virt and dedicated switch environment. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2214.html |