Bug 1302143

Summary: Bulk introspection fails - some of the nodes don't complete PXE boot
Product: Red Hat OpenStack Reporter: Ronelle Landy <rlandy>
Component: openstack-ironic-discoverdAssignee: RHOS Maint <rhos-maint>
Status: CLOSED WORKSFORME QA Contact: Raviv Bar-Tal <rbartal>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: apevec, dtantsur, dyocum, lhh, mburns, mkovacik, rbartal, rhel-osp-director-maint, rlandy, rscarazz, slinaber, whayutin
Target Milestone: ---Keywords: Automation, AutomationBlocker
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-03 15:23:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ronelle Landy 2016-01-26 22:21:15 UTC
Description of problem:

For about two weeks CI has had jobs fail due to bulk introspection not succeeding. This happens intermittently on both virt and baremetal installs - in both director 7 and 8. 

Looking at the console of a failing baremetal node, it looks as if PXE boot hangs. If introspection is restarted, most often it succeeds. CI is getting around this problem by using single node-by-node introspection, checking each node after introspection and rerunning introspection if the first attempt produced a timeout error. 

Version-Release number of selected component (if applicable):

Happens in both director 7 and 8 installs

How reproducible:

Intermittent - when a full set of CI jobs is run for a poodle or puddle, typically one or two out of the 5 to 7 jobs will fail with a bulk introspection timeout. 

Steps to Reproduce:
1. Install openstack director (version 7 or 8)
2. Use bulk introspection 'openstack baremetal introspection bulk start;'
3. Watch the consoles of the nodes. Sometimes one or more of the nodes will hang in PXE boot.

Actual results:

bulk introspection does not try to rerun. It outputs a message that introspection for xxx node did not complete successfully.

Expected results:

Bulk introspection should not fail intermittently. If it does - ? Should there be an option to initiate rerun?

Additional info:

Using single node introspection with retrying failed nodes seems to keep CI ticking over.

Comment 2 mkovacik 2016-02-29 12:45:12 UTC
Ronelle,

could you please attach screen log from the ironic-inspector-dnsmasq unit;
this might be a duplicate of Bug #1301659

Thanks,
milan

Comment 6 Mike Burns 2016-04-07 21:07:13 UTC
This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.

Comment 7 Ronelle Landy 2016-04-12 13:36:13 UTC
I think this is a different issue - we see the hang while getting agent.ramdisk

Comment 8 Dmitry Tantsur 2016-04-18 08:09:22 UTC
Got it. Then it might be an iPXE ROM problem (bug 1308611). Is it on VM's or BM's? We can try testing the latest packaged or upstream iPXE ROM. For VM's it involves updating the ROM on the virtual host system.

Comment 9 Ronelle Landy 2016-04-18 14:44:57 UTC
Dmitry, this is a problem almost exclusively on baremetal machines.
We see it very rarely, if ever anymore, on VMs.

Will the latest packaged iPXE ROM also assist for issues related to BMs?

Comment 10 Dmitry Tantsur 2016-04-18 15:13:31 UTC
We've got the a new ROM package for OSPd8. Please check with it, if you haven't already.

Comment 11 Dan Yocum 2016-04-21 14:44:15 UTC
(In reply to Dmitry Tantsur from comment #10)
> We've got the a new ROM package for OSPd8. Please check with it, if you
> haven't already.

I can verify that the Dell R630 and R730xd systems with Intel X520 i350 nics are booting properly using the following ROMS:

ipxe-bootimgs-20160127-1.git6366fa7a.el7.noarch

NB: the git hash should match the ipxe version hash displayed when chainloading.

Comment 12 Dmitry Tantsur 2016-05-09 12:30:53 UTC
Hi Ronelle! Does this issue still happen to you even with the updated iPXE package? If so, could you please make a screenshot of the machine booting, so that we see the version of the iPXE ROM?

Comment 13 Ronelle Landy 2016-06-10 13:13:04 UTC
Hi Dmitry,

I see this happen occasionally still but far less than before. So far, I have not seen it happen with RDO master. I will keep watch.

Comment 14 Raoul Scarazzini 2016-06-10 14:02:52 UTC
Just as an additional information: all this kind of problems where solved from my side using the workaround described here: https://bugzilla.redhat.com/show_bug.cgi?id=1324422

Comment 15 Dmitry Tantsur 2016-10-03 15:23:00 UTC
Hi! I assume this bug is fixed by shipping a newer iPXE ROM. Please let us know if it still happens even when using it.