Red Hat Bugzilla – Bug 1302143
Bulk introspection fails - some of the nodes don't complete PXE boot
Last modified: 2016-10-03 11:23:00 EDT
Description of problem:
For about two weeks CI has had jobs fail due to bulk introspection not succeeding. This happens intermittently on both virt and baremetal installs - in both director 7 and 8.
Looking at the console of a failing baremetal node, it looks as if PXE boot hangs. If introspection is restarted, most often it succeeds. CI is getting around this problem by using single node-by-node introspection, checking each node after introspection and rerunning introspection if the first attempt produced a timeout error.
Version-Release number of selected component (if applicable):
Happens in both director 7 and 8 installs
Intermittent - when a full set of CI jobs is run for a poodle or puddle, typically one or two out of the 5 to 7 jobs will fail with a bulk introspection timeout.
Steps to Reproduce:
1. Install openstack director (version 7 or 8)
2. Use bulk introspection 'openstack baremetal introspection bulk start;'
3. Watch the consoles of the nodes. Sometimes one or more of the nodes will hang in PXE boot.
bulk introspection does not try to rerun. It outputs a message that introspection for xxx node did not complete successfully.
Bulk introspection should not fail intermittently. If it does - ? Should there be an option to initiate rerun?
Using single node introspection with retrying failed nodes seems to keep CI ticking over.
could you please attach screen log from the ironic-inspector-dnsmasq unit;
this might be a duplicate of Bug #1301659
This bug did not make the OSP 8.0 release. It is being deferred to OSP 10.
I think this is a different issue - we see the hang while getting agent.ramdisk
Got it. Then it might be an iPXE ROM problem (bug 1308611). Is it on VM's or BM's? We can try testing the latest packaged or upstream iPXE ROM. For VM's it involves updating the ROM on the virtual host system.
Dmitry, this is a problem almost exclusively on baremetal machines.
We see it very rarely, if ever anymore, on VMs.
Will the latest packaged iPXE ROM also assist for issues related to BMs?
We've got the a new ROM package for OSPd8. Please check with it, if you haven't already.
(In reply to Dmitry Tantsur from comment #10)
> We've got the a new ROM package for OSPd8. Please check with it, if you
> haven't already.
I can verify that the Dell R630 and R730xd systems with Intel X520 i350 nics are booting properly using the following ROMS:
NB: the git hash should match the ipxe version hash displayed when chainloading.
Hi Ronelle! Does this issue still happen to you even with the updated iPXE package? If so, could you please make a screenshot of the machine booting, so that we see the version of the iPXE ROM?
I see this happen occasionally still but far less than before. So far, I have not seen it happen with RDO master. I will keep watch.
Just as an additional information: all this kind of problems where solved from my side using the workaround described here: https://bugzilla.redhat.com/show_bug.cgi?id=1324422
Hi! I assume this bug is fixed by shipping a newer iPXE ROM. Please let us know if it still happens even when using it.