Bug 1302892 - Discovery and Provisioning process is brittle, node issues that require manual intervention can end with deployments not recognizing the node that was just provisioned
Discovery and Provisioning process is brittle, node issues that require manua...
Status: NEW
Product: Red Hat Quickstart Cloud Installer
Classification: Red Hat
Component: Installation - RHCI (Show other bugs)
1.0
Unspecified Unspecified
unspecified Severity high
: ---
: ---
Assigned To: Derek Whatley
Dave Johnson
Dan Macpherson
: Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-01-28 16:56 EST by Matt Reid
Modified: 2016-07-28 13:46 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Error Message Displayed on Installation Progress screen (46.97 KB, image/png)
2016-01-28 17:16 EST, Matt Reid
no flags Details

  None (edit)
Description Matt Reid 2016-01-28 16:56:29 EST
Description of problem:
In a recent deployment of RHCI trying to set up RHEV + CFME, I got myself in trouble during provisioning the two RHEV nodes, and despite our efforts to get back on track, were unable to.

The blades in our environment are old and need manual rebooting and PXE selection during the discovery and provisioning process, and when the unified installer was waiting for the hypervisor system to deploy (I needed to reboot it and ensure it PXE'd off the right NIC), I accidentally rebooted the engine system and it was re-registered with Satellite, showing two entries in Discovered Hosts, one with the name I had set in the unified installer, and one that was just the mac address. After the hypervisor was successfully deployed, I started the process for the engine, but when it rebooted to be provisioned, it was blocked from doing so, its MAC and IP addresses were recognized as being a duplicate of another system, and the system just looped, complaining about that.

We ended up deleting both entries from the Discovered Hosts, and manually provisioning the engine system, giving it the same name as originally specified in the unified installer. After the system was provisioned and online, we thought we were back on track, but the unified installer never recognized it as the system it was waiting for, and eventually timed out, with an Error Couldn't find Host::Base with id=5.

It would be great if there was something we could do to make this process more robust. Having to throw out the deployment and start over again hurts. I don't know if there's something we could do to recognize duplicate hosts and not have that screw things up, or if there's something we could do to manually point out that the node it was looking for had issues, but this node over here is what we now want it to use, or something smarter. I imagine running into issues like this would be more difficult in a more modern environment, but mistakes happen and it would be nice if we had more leeway to get back on track when troubleshooting has to happen.

Version-Release number of selected component (if applicable):
1-22 build
Comment 2 Matt Reid 2016-01-28 17:16 EST
Created attachment 1119302 [details]
Error Message Displayed on Installation Progress screen
Comment 3 Jason Montleon 2016-01-28 17:21:33 EST
The root of the problem is that rebooting a discovered host after it has been renamed on the rhev configuration pages, but before it's converted to a managed host causes a duplicate discovered host entry with a conflicting IP and MAC address. This will generally also cause the host to fail being converted from a discovered to managed host.

The simplest strategy is probably looking at not renaming the discovered host until we convert it to a managed host.
Comment 9 John Matthews 2016-07-28 11:26:55 EDT
We are deferring this to post-GA.

Once we come back to working this we would like the fix to ensure that no changes are made to the state of Satellite until the deploy button is clicked.

i.e., let's not make any name changes to the discovered hosts during the UI selection process.  We want to queue the changes and only execute them after deploy is clicked.

Note You need to log in before you can comment on or make changes to this bug.