Created attachment 1129398 [details] screenshot Description of problem: I already had some freezes during the HTTP download of the kernel/initrd with the ipxe from the ipxe rpm, including the one from BZ1267030. In this case the Ironic introspection will fail with a timeout. Today I managed to reproduce the issue with the last upstream ipxe ( git f468f12b1eca15e703aa2a79f1c82969c04c2322 ) on a OVB environment. IMO, ipxe should be able to raise a timeout in such case. I attached a screenshoot. Version-Release number of selected component (if applicable): How reproducible: Boot a VM and request them to fetch the kernel/initrd from a HTTP share. Actual results: 99% of the cases will and up with a success. Expected results: Additional info:
According to http://lists.ipxe.org/pipermail/ipxe-devel/2014-October/003829.html, this is the purpose of the --timeout parameter, but I don't see this argument in the configuration file generated by Ironic. ------------------- #!ipxe dhcp goto deploy :deploy kernel http://192.0.2.240:8088/bfc53b24-9f79-4e17-8bc0-b9657047a3c4/deploy_kernel selinux=0 disk=cciss/c0d0,sda,hda,vda iscsi_target_iqn=iqn.2008-10.org.openstack:bfc53b24-9f79-4e17-8bc0-b9657047a3c4 deployment_id=bfc53b24-9f79-4e17-8bc0-b9657047a3c4 deployment_key=2Q1KKOTSH42OS9FYPYYHF0DT564PDWCU ironic_api_url=http://192.0.2.240:6385 troubleshoot=0 text nofb nomodeset vga=normal boot_option=local ip=${ip}:${next-server}:${gateway}:${netmask} BOOTIF=${mac} ipa-api-url=http://192.0.2.240:6385 ipa-driver-name=pxe_ipmitool coreos.configdrive=0 initrd http://192.0.2.240:8088/bfc53b24-9f79-4e17-8bc0-b9657047a3c4/deploy_ramdisk boot :boot_partition kernel http://192.0.2.240:8088/bfc53b24-9f79-4e17-8bc0-b9657047a3c4/kernel root={{ ROOT }} ro text nofb nomodeset vga=normal initrd http://192.0.2.240:8088/bfc53b24-9f79-4e17-8bc0-b9657047a3c4/ramdisk boot :boot_whole_disk kernel chain.c32 append mbr:{{ DISK_IDENTIFIER }}
If you're using OVB you'll likely need to adjust the undercloud MTU settings, in my environment I do it by: mtu=$(ifconfig eth0 |egrep -o "mtu [^ ]*") ifconfig eth1 $mtu ifconfig eth2 $mtu This comment may not be helpful if you actually want IPXE to have better timeout handling
Thanks Steve. Indeed this seems to be the root of the problem here.
https://review.openstack.org/283893 I pushed a review to add a default timeout to reduce the impact of this kind of problem.
I opened https://review.openstack.org/#/c/288041 to be able to adjust the MTU through the undercloud.conf
Chris, I'm pretty sure this is not a blocker in your case. The problem happens only when the MTU is < 1500. We had the issue once with Gael but it was because of a misconfiguration.
Ok, I'd just a freeze during agent.ramdisk download. This time the MTU was ok (1400). I use upstream ipxe image, not the one provided by Red Hat.
The final patch to get the --timeout merged is here: https://review.openstack.org/#/c/294787 and is blocked by: https://bugs.launchpad.net/ironic/+bug/1567449
This is where we stand with this - this bug went away for a while, and reappeared in Beta9/RC/GA releases. in https://bugzilla.redhat.com/show_bug.cgi?id=1322056, Goneri highlighted the fact that the ipxe packages got updated,so we tried (using both beta9 and GA) a workaround, installing the previous packages & locking them prior to installing the undercloud [osp_admin@director ~]$ rpm -qa | grep ipxe ipxe-bootimgs-20150821-1.git4e03af8e.el7.noarch ipxe-roms-qemu-20150821-1.git4e03af8e.el7.noarch Unfortunately, that does not prevent the freezes to happen for me (ran into it 2 out of 3 deployments today with the above in place), so there might be something else at play besides those packages versions Also to clarify, this bug can occur in 2 different steps of the deployment process : 1 - during the bulk introspection of the nodes 2 - during the overcloud deployment
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1220