Bug 1302892 - Discovery and Provisioning process is brittle, node issues that require manual intervention can end with deployments not recognizing the node that was just provisioned
Summary: Discovery and Provisioning process is brittle, node issues that require manua...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Quickstart Cloud Installer
Classification: Red Hat
Component: Installation - RHCI
Version: 1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Derek Whatley
QA Contact: Dave Johnson
Dan Macpherson
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-01-28 21:56 UTC by Matt Reid
Modified: 2020-05-08 20:53 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-08 20:53:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Error Message Displayed on Installation Progress screen (46.97 KB, image/png)
2016-01-28 22:16 UTC, Matt Reid
no flags Details

Description Matt Reid 2016-01-28 21:56:29 UTC
Description of problem:
In a recent deployment of RHCI trying to set up RHEV + CFME, I got myself in trouble during provisioning the two RHEV nodes, and despite our efforts to get back on track, were unable to.

The blades in our environment are old and need manual rebooting and PXE selection during the discovery and provisioning process, and when the unified installer was waiting for the hypervisor system to deploy (I needed to reboot it and ensure it PXE'd off the right NIC), I accidentally rebooted the engine system and it was re-registered with Satellite, showing two entries in Discovered Hosts, one with the name I had set in the unified installer, and one that was just the mac address. After the hypervisor was successfully deployed, I started the process for the engine, but when it rebooted to be provisioned, it was blocked from doing so, its MAC and IP addresses were recognized as being a duplicate of another system, and the system just looped, complaining about that.

We ended up deleting both entries from the Discovered Hosts, and manually provisioning the engine system, giving it the same name as originally specified in the unified installer. After the system was provisioned and online, we thought we were back on track, but the unified installer never recognized it as the system it was waiting for, and eventually timed out, with an Error Couldn't find Host::Base with id=5.

It would be great if there was something we could do to make this process more robust. Having to throw out the deployment and start over again hurts. I don't know if there's something we could do to recognize duplicate hosts and not have that screw things up, or if there's something we could do to manually point out that the node it was looking for had issues, but this node over here is what we now want it to use, or something smarter. I imagine running into issues like this would be more difficult in a more modern environment, but mistakes happen and it would be nice if we had more leeway to get back on track when troubleshooting has to happen.

Version-Release number of selected component (if applicable):
1-22 build

Comment 2 Matt Reid 2016-01-28 22:16:00 UTC
Created attachment 1119302 [details]
Error Message Displayed on Installation Progress screen

Comment 3 Jason Montleon 2016-01-28 22:21:33 UTC
The root of the problem is that rebooting a discovered host after it has been renamed on the rhev configuration pages, but before it's converted to a managed host causes a duplicate discovered host entry with a conflicting IP and MAC address. This will generally also cause the host to fail being converted from a discovered to managed host.

The simplest strategy is probably looking at not renaming the discovered host until we convert it to a managed host.

Comment 9 John Matthews 2016-07-28 15:26:55 UTC
We are deferring this to post-GA.

Once we come back to working this we would like the fix to ensure that no changes are made to the state of Satellite until the deploy button is clicked.

i.e., let's not make any name changes to the discovered hosts during the UI selection process.  We want to queue the changes and only execute them after deploy is clicked.


Note You need to log in before you can comment on or make changes to this bug.