This is the instack-undercloud portion of the fix +++ This bug was initially created as a clone of Bug #1310778 +++ Description of problem: I already had some freezes during the HTTP download of the kernel/initrd with the ipxe from the ipxe rpm, including the one from BZ1267030. In this case the Ironic introspection will fail with a timeout. Today I managed to reproduce the issue with the last upstream ipxe ( git f468f12b1eca15e703aa2a79f1c82969c04c2322 ) on a OVB environment. IMO, ipxe should be able to raise a timeout in such case. I attached a screenshoot. Version-Release number of selected component (if applicable): How reproducible: Boot a VM and request them to fetch the kernel/initrd from a HTTP share. Actual results: 99% of the cases will and up with a success. Expected results: Additional info: --- Additional comment from Gonéri Le Bouder on 2016-02-22 11:26:40 EST --- According to http://lists.ipxe.org/pipermail/ipxe-devel/2014-October/003829.html, this is the purpose of the --timeout parameter, but I don't see this argument in the configuration file generated by Ironic. ------------------- #!ipxe dhcp goto deploy :deploy kernel http://192.0.2.240:8088/bfc53b24-9f79-4e17-8bc0-b9657047a3c4/deploy_kernel selinux=0 disk=cciss/c0d0,sda,hda,vda iscsi_target_iqn=iqn.2008-10.org.openstack:bfc53b24-9f79-4e17-8bc0-b9657047a3c4 deployment_id=bfc53b24-9f79-4e17-8bc0-b9657047a3c4 deployment_key=2Q1KKOTSH42OS9FYPYYHF0DT564PDWCU ironic_api_url=http://192.0.2.240:6385 troubleshoot=0 text nofb nomodeset vga=normal boot_option=local ip=${ip}:${next-server}:${gateway}:${netmask} BOOTIF=${mac} ipa-api-url=http://192.0.2.240:6385 ipa-driver-name=pxe_ipmitool coreos.configdrive=0 initrd http://192.0.2.240:8088/bfc53b24-9f79-4e17-8bc0-b9657047a3c4/deploy_ramdisk boot :boot_partition kernel http://192.0.2.240:8088/bfc53b24-9f79-4e17-8bc0-b9657047a3c4/kernel root={{ ROOT }} ro text nofb nomodeset vga=normal initrd http://192.0.2.240:8088/bfc53b24-9f79-4e17-8bc0-b9657047a3c4/ramdisk boot :boot_whole_disk kernel chain.c32 append mbr:{{ DISK_IDENTIFIER }} --- Additional comment from Steve Baker on 2016-02-22 22:28:07 EST --- If you're using OVB you'll likely need to adjust the undercloud MTU settings, in my environment I do it by: mtu=$(ifconfig eth0 |egrep -o "mtu [^ ]*") ifconfig eth1 $mtu ifconfig eth2 $mtu This comment may not be helpful if you actually want IPXE to have better timeout handling --- Additional comment from Gonéri Le Bouder on 2016-02-23 10:35:52 EST --- Thanks Steve. Indeed this seems to be the root of the problem here. --- Additional comment from Gonéri Le Bouder on 2016-02-23 20:11:23 EST --- https://review.openstack.org/283893 I pushed a review to add a default timeout to reduce the impact of this kind of problem. --- Additional comment from Gonéri Le Bouder on 2016-03-03 13:15:51 EST --- I opened https://review.openstack.org/#/c/288041 to be able to adjust the MTU through the undercloud.conf --- Additional comment from Gonéri Le Bouder on 2016-03-05 11:24:38 EST --- Chris, I'm pretty sure this is not a blocker in your case. The problem happens only when the MTU is < 1500. We had the issue once with Gael but it was because of a misconfiguration. --- Additional comment from Gonéri Le Bouder on 2016-03-12 13:22:40 EST --- Ok, I'd just a freeze during agent.ramdisk download. This time the MTU was ok (1400). I use upstream ipxe image, not the one provided by Red Hat. --- Additional comment from Gonéri Le Bouder on 2016-04-07 10:07:17 EDT --- The final patch to get the --timeout merged is here: https://review.openstack.org/#/c/294787 and is blocked by: https://bugs.launchpad.net/ironic/+bug/1567449 --- Additional comment from Gael Rehault on 2016-04-14 09:07:05 EDT --- This is where we stand with this - this bug went away for a while, and reappeared in Beta9/RC/GA releases. in https://bugzilla.redhat.com/show_bug.cgi?id=1322056, Goneri highlighted the fact that the ipxe packages got updated,so we tried (using both beta9 and GA) a workaround, installing the previous packages & locking them prior to installing the undercloud [osp_admin@director ~]$ rpm -qa | grep ipxe ipxe-bootimgs-20150821-1.git4e03af8e.el7.noarch ipxe-roms-qemu-20150821-1.git4e03af8e.el7.noarch Unfortunately, that does not prevent the freezes to happen for me (ran into it 2 out of 3 deployments today with the above in place), so there might be something else at play besides those packages versions Also to clarify, this bug can occur in 2 different steps of the deployment process : 1 - during the bulk introspection of the nodes 2 - during the overcloud deployment
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1229