Bug 1302143 - Bulk introspection fails - some of the nodes don't complete PXE boot
Bulk introspection fails - some of the nodes don't complete PXE boot
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic-discoverd (Show other bugs)
Unspecified Unspecified
unspecified Severity unspecified
: ---
: 10.0 (Newton)
Assigned To: RHOS Maint
Raviv Bar-Tal
: Automation, AutomationBlocker
Depends On:
  Show dependency treegraph
Reported: 2016-01-26 17:21 EST by Ronelle Landy
Modified: 2016-10-03 11:23 EDT (History)
12 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2016-10-03 11:23:00 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Ronelle Landy 2016-01-26 17:21:15 EST
Description of problem:

For about two weeks CI has had jobs fail due to bulk introspection not succeeding. This happens intermittently on both virt and baremetal installs - in both director 7 and 8. 

Looking at the console of a failing baremetal node, it looks as if PXE boot hangs. If introspection is restarted, most often it succeeds. CI is getting around this problem by using single node-by-node introspection, checking each node after introspection and rerunning introspection if the first attempt produced a timeout error. 

Version-Release number of selected component (if applicable):

Happens in both director 7 and 8 installs

How reproducible:

Intermittent - when a full set of CI jobs is run for a poodle or puddle, typically one or two out of the 5 to 7 jobs will fail with a bulk introspection timeout. 

Steps to Reproduce:
1. Install openstack director (version 7 or 8)
2. Use bulk introspection 'openstack baremetal introspection bulk start;'
3. Watch the consoles of the nodes. Sometimes one or more of the nodes will hang in PXE boot.

Actual results:

bulk introspection does not try to rerun. It outputs a message that introspection for xxx node did not complete successfully.

Expected results:

Bulk introspection should not fail intermittently. If it does - ? Should there be an option to initiate rerun?

Additional info:

Using single node introspection with retrying failed nodes seems to keep CI ticking over.
Comment 2 mkovacik 2016-02-29 07:45:12 EST

could you please attach screen log from the ironic-inspector-dnsmasq unit;
this might be a duplicate of Bug #1301659

Comment 6 Mike Burns 2016-04-07 17:07:13 EDT
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.
Comment 7 Ronelle Landy 2016-04-12 09:36:13 EDT
I think this is a different issue - we see the hang while getting agent.ramdisk
Comment 8 Dmitry Tantsur 2016-04-18 04:09:22 EDT
Got it. Then it might be an iPXE ROM problem (bug 1308611). Is it on VM's or BM's? We can try testing the latest packaged or upstream iPXE ROM. For VM's it involves updating the ROM on the virtual host system.
Comment 9 Ronelle Landy 2016-04-18 10:44:57 EDT
Dmitry, this is a problem almost exclusively on baremetal machines.
We see it very rarely, if ever anymore, on VMs.

Will the latest packaged iPXE ROM also assist for issues related to BMs?
Comment 10 Dmitry Tantsur 2016-04-18 11:13:31 EDT
We've got the a new ROM package for OSPd8. Please check with it, if you haven't already.
Comment 11 Dan Yocum 2016-04-21 10:44:15 EDT
(In reply to Dmitry Tantsur from comment #10)
> We've got the a new ROM package for OSPd8. Please check with it, if you
> haven't already.

I can verify that the Dell R630 and R730xd systems with Intel X520 i350 nics are booting properly using the following ROMS:


NB: the git hash should match the ipxe version hash displayed when chainloading.
Comment 12 Dmitry Tantsur 2016-05-09 08:30:53 EDT
Hi Ronelle! Does this issue still happen to you even with the updated iPXE package? If so, could you please make a screenshot of the machine booting, so that we see the version of the iPXE ROM?
Comment 13 Ronelle Landy 2016-06-10 09:13:04 EDT
Hi Dmitry,

I see this happen occasionally still but far less than before. So far, I have not seen it happen with RDO master. I will keep watch.
Comment 14 Raoul Scarazzini 2016-06-10 10:02:52 EDT
Just as an additional information: all this kind of problems where solved from my side using the workaround described here: https://bugzilla.redhat.com/show_bug.cgi?id=1324422
Comment 15 Dmitry Tantsur 2016-10-03 11:23:00 EDT
Hi! I assume this bug is fixed by shipping a newer iPXE ROM. Please let us know if it still happens even when using it.

Note You need to log in before you can comment on or make changes to this bug.