Bug 1228862 - Can `openstack undercloud install` have a --force-clean option so an error doesn't require restarting?
Summary: Can `openstack undercloud install` have a --force-clean option so an error do...
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: instack-undercloud
Version: Director
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: y1
: 7.0 (Kilo)
Assignee: James Slagle
QA Contact: Toure Dunnon
URL:
Whiteboard:
Keywords: Triaged, ZStream
: 1266201 (view as bug list)
Depends On:
Blocks: 1191185 1243520
TreeView+ depends on / blocked
 
Reported: 2015-06-05 23:44 UTC by John Fulton
Modified: 2015-10-08 12:08 UTC (History)
20 users (show)

(edit)
Feature: 

The undercloud installation will now reapply changes from undercloud.conf on subsequent runs. The changed values in undercloud.conf will be reapplied as needed so that the undercloud is configured as is specified in undercloud.conf.

Note that the undercloud installer should not be rerun if an overcloud deployment has already been completed or is in progress. Further, the undercloud installer will intentionally fail to continue if the Neutron network is in use by a current or previous overcloud deployment.

Reason: 

The feature allows the installer to be rerun so that desired configuration changes can be applied to the undercloud. This allows for such needed changes due to previous errors or requirement changes to be applied.

Result:
Clone Of:
(edit)
Last Closed: 2015-10-08 12:08:52 UTC


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2015:1862 normal SHIPPED_LIVE Moderate: Red Hat Enterprise Linux OpenStack Platform 7 director update 2015-10-08 16:05:50 UTC

Description John Fulton 2015-06-05 23:44:24 UTC
When running openstack undercloud install a second time, it fails with the error below. 

[2015/06/05 02:48:42 PM] [WARNING] DEPRECATED: falling back to /var/run/os-collect-config/os_config_files.json
+ NETWORK_GATEWAY=172.16.173.1
+ METADATA_SERVER=10.10.8.1
+ PHYSICAL_NETWORK=ctlplane
++ mktemp
+ NETWORK_JSON=/tmp/tmp.R3PuH2T5UD
+ jq .
+ setup-neutron -n /tmp/tmp.R3PuH2T5UD
/usr/lib/python2.7/site-packages/novaclient/v1_1/__init__.py:30: UserWarning: Module novaclient.v1_1 is deprecated (taken as a basis for novaclient.v2). The preferable way to get client class or object you can find in novaclient.client module.
  warnings.warn("Module novaclient.v1_1 is deprecated (taken as a basis for "
2015-06-05 14:48:43 - root - ERROR - Unexpected error during command execution
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/os_cloud_config/cmd/setup_neutron.py", line 77, in main
    keystone_client=keystone_client)
  File "/usr/lib/python2.7/site-packages/os_cloud_config/neutron.py", line 46, in initialize_neutron
    net = _create_net(neutron_client, network_desc, network_type, admin_tenant)
  File "/usr/lib/python2.7/site-packages/os_cloud_config/neutron.py", line 95, in _create_net
    return neutron.create_network({'network': network})
  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 102, in with_params
    ret = self.function(instance, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 571, in create_network
    return self.post(self.networks_path, body=body)
  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 298, in post
    headers=headers, params=params)
  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 211, in do_request
    self._handle_fault_response(status_code, replybody)
  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 185, in _handle_fault_response
    exception_handler_v20(status_code, des_error_body)
  File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 70, in exception_handler_v20
    status_code=status_code)
Conflict: Unable to create the flat network. Physical network ctlplane is in use.
[2015-06-05 14:48:43,419] (os-refresh-config) [ERROR] during post-configure phase. [Command '['dib-run-parts', '/usr/libexec/os-refresh-config/post-configure.d']' returned non-zero exit status 1]

[2015-06-05 14:48:43,420] (os-refresh-config) [ERROR] Aborting...
ERROR: openstack Command 'instack-install-undercloud' returned non-zero exit status 1

Comment 3 John Fulton 2015-06-05 23:51:24 UTC
I saw the same error on the rdo-list [1]. If ctlplane network is in use because neutron ports were allocated on the assigned subnet because it's a 2nd run through of instack-install-undercloud after a failed deployment [2], then could instack-install-undercloud have a --force-clean option so that re-kicking the box is not necessary? Is there a simpler workaround than re-starting clean? 

  John

[1] https://www.redhat.com/archives/rdo-list/2015-April/msg00169.html
[2] In this case the install failed the first time because of reasons documented in bz1228438

Comment 4 John Fulton 2015-06-09 00:08:19 UTC
Is deleting the ctlplane network (as shown by `neutron net-list`) via `neutron net-delete` on the undercloud node an alternative workaround that's easier than re-starting clean?

Comment 7 chris alfonso 2015-06-30 17:40:53 UTC
Keith, would you mind looking into this to see if you want the feature added for A1? John, James will add info for a workaround.

Comment 8 John Fulton 2015-07-16 16:36:08 UTC
Any update on the workaround James?

Comment 9 James Slagle 2015-07-16 17:52:03 UTC
delete the subnet first:

neutron subnet-list
neutron subnet-delete <subnet-uuid>
neutron net-delete ctlplane

that should get you around this issue.

Comment 10 James Slagle 2015-08-26 21:04:07 UTC
what is the behavior of --force-clean that is being requested? should it completely uninstall/reinstall the undercloud (would obviously lose any ability to manage a deployed overcloud).

Comment 12 John Fulton 2015-08-31 20:39:49 UTC
The desired behavior of the --force-clean option that is being
requested is that it should completely uninstall/reinstall the
undercloud. Thus, if the physical network ctlplane was in use, 
it would do something like the following: 

neutron subnet-list
neutron subnet-delete <subnet-uuid>
neutron net-delete ctlplane

But it would also remove other changes that came about from 
installing the undercloud which might get in the way when 
`openstack undercloud install` is run on top of a partial
installation. 

The issue is that it is time consuming for someone installing an
undercloud to have to re-install RHEL, just to be sure anything 
that the undercloud put in place is removed. 

You said "[you] would obviously lose any ability to manage a deployed
overcloud". If the undercould isn't yet installed, why would we have
an overcloud to loose the ability to manage? This usecase would apply 
to a situation where the overcloud doesn't yet exist.

Comment 14 James Slagle 2015-09-09 14:10:13 UTC
The goal of the installer is to be re-runnable without having to manually clean 
anything up. The specific issue that caused the error message about the neutron network being in use has been fixed (bug 1228438).

We can also account for the neutron network possibly being in use due to another future unexpected failure from a previous install attempt in 7.1 (that's tracked by this bug).

We can do without the need for a --force-clean option that uninstalls and cleans up any system changes that were made by the installer. To accomplish the use case that was driving the request of that option, we will document how to make a change to the undercloud.conf configuration file and rerun the installer so that the requested changes are applied.

If further testing of the documented solution uncovers additional bugs with this method, we can address those as needed.

The recommended solution for a true --force-clean option that will completely reset the system to a pristine state is to run the undercloud in a vm and use vm snapshots to rollback to a known good state.

Comment 15 Graeme Gillies 2015-09-10 05:22:02 UTC
Please note that there is another use case when this occurs.

When doing backups and restoration of the undercloud, you restore the configuration files and databases, before finally running openstack undercloud install so that it will recreate the ovs bridges etc.

Overall what needs to happen is two things

* openstack undercloud install by default needs to be fully idempotent. No matter what, I should be able to run it multiple times and it shouldn't fail (it should detect everything is done/correct and not do anything, only making the changes it needs to do. This would solve the case where you are restoring the undercloud from backup (deleting the ctlplane network in this case is not an option as you want the ctlplane network to have the same uuid in neutron as it always has, not changed)

* openstack cli needs a --force-cleanup or something similar in the cases where you do want it to trash all existing data and completely start from a clean slate (for whatever reason)

Regards,

Graeme

Comment 16 Jorn Argelo 2015-09-10 07:15:13 UTC
So as user I figured it may be worthwhile to weigh in my €0.02 - a --force-cleanup option would be highly desirable! I've been re-kickstarting my RDO lab box several times because you have no clue in what state the undercloud deployment is. Upgrading RPMs which require an altered database will also put you in a spot where dropping the databases itself will only get the tables re-created, but the tables itself are not filled with the data the undercloud expects. 

I figured out that removing /opt/stack/.undercloud-setup and rerunning undercloud install is a way to get the tables filled again, but it's quite time consuming to figure out this stuff by yourself. So yeah, being able to consistently start from scratch, especially in a lab environment, is very useful indeed.

Comment 17 James Slagle 2015-09-10 15:38:40 UTC
(In reply to Graeme Gillies from comment #15)
> Please note that there is another use case when this occurs.
> 
> When doing backups and restoration of the undercloud, you restore the
> configuration files and databases, before finally running openstack
> undercloud install so that it will recreate the ovs bridges etc.
> 
> Overall what needs to happen is two things
> 
> * openstack undercloud install by default needs to be fully idempotent. No
> matter what, I should be able to run it multiple times and it shouldn't fail
> (it should detect everything is done/correct and not do anything, only
> making the changes it needs to do. This would solve the case where you are
> restoring the undercloud from backup (deleting the ctlplane network in this
> case is not an option as you want the ctlplane network to have the same uuid
> in neutron as it always has, not changed)

you may like to file a rfe for this if there isn't one already.

I'm in agreement with the spirit of what you're asking for, but not sure it's quite this simple. There are valid times I'd think the installer should stop running and not apply requested changes. Such as if the ctlplane is in use (as you say), yet someone is mistakingly trying to change the subnet IP range via undercloud.conf and rerunning the installer.

> 
> * openstack cli needs a --force-cleanup or something similar in the cases
> where you do want it to trash all existing data and completely start from a
> clean slate (for whatever reason)

i'm not convinced a --force-cleanup flag is the correct technical solution to the feature you're requesting. I'd like to focus on the feature, and not the requested technical implementation. It sounds like the feature is "trash all existing data and completely start from a clean slate".

We do already have superior solutions to that, far more effective than any programmatic/scripted (puppet, whatever) based solution could ever accomplish.

One such wrinkle is that puppet knows about the resources it's told about. Suppose wrong repos were enabled, packages/deps installed from the repos, and an error encountered. My feeling around the expectation of a --force-clean would be that it would completely clean all of that up. That problem is probably solvable with enough code, but I do contend that such a solution might not be the best way to accomplish such a thing. There may very well be existing puppet modules that solve this problem, but it's a question of identifying all such scenarios and solving them all in order to at least say we strive for 100% consistency with a --force-clean.

Otherwise, we could say, it's best effort. But, again, that doesn't sound like what you're asking for.

I think a container based undercloud could/would offer a --force-clean. So perhaps the answer is wrapped up in that.

Anyway, the plan for this particular bugzilla was to fix the reported problem and solve it as I laid out in comment 14.

Please file an RFE if you'd like to continue the discussion about these other 2 points in a new bugzilla.

Comment 18 Graeme Gillies 2015-09-11 00:18:39 UTC
(In reply to James Slagle from comment #17)
> (In reply to Graeme Gillies from comment #15)
> > Please note that there is another use case when this occurs.
> > 
> > When doing backups and restoration of the undercloud, you restore the
> > configuration files and databases, before finally running openstack
> > undercloud install so that it will recreate the ovs bridges etc.
> > 
> > Overall what needs to happen is two things
> > 
> > * openstack undercloud install by default needs to be fully idempotent. No
> > matter what, I should be able to run it multiple times and it shouldn't fail
> > (it should detect everything is done/correct and not do anything, only
> > making the changes it needs to do. This would solve the case where you are
> > restoring the undercloud from backup (deleting the ctlplane network in this
> > case is not an option as you want the ctlplane network to have the same uuid
> > in neutron as it always has, not changed)
> 
> you may like to file a rfe for this if there isn't one already.
> 
> I'm in agreement with the spirit of what you're asking for, but not sure
> it's quite this simple. There are valid times I'd think the installer should
> stop running and not apply requested changes. Such as if the ctlplane is in
> use (as you say), yet someone is mistakingly trying to change the subnet IP
> range via undercloud.conf and rerunning the installer.

Oh yes definitely agreed.

> 
> > 
> > * openstack cli needs a --force-cleanup or something similar in the cases
> > where you do want it to trash all existing data and completely start from a
> > clean slate (for whatever reason)
> 
> i'm not convinced a --force-cleanup flag is the correct technical solution
> to the feature you're requesting. I'd like to focus on the feature, and not
> the requested technical implementation. It sounds like the feature is "trash
> all existing data and completely start from a clean slate".
> 
> We do already have superior solutions to that, far more effective than any
> programmatic/scripted (puppet, whatever) based solution could ever
> accomplish.
> 
> One such wrinkle is that puppet knows about the resources it's told about.
> Suppose wrong repos were enabled, packages/deps installed from the repos,
> and an error encountered. My feeling around the expectation of a
> --force-clean would be that it would completely clean all of that up. That
> problem is probably solvable with enough code, but I do contend that such a
> solution might not be the best way to accomplish such a thing. There may
> very well be existing puppet modules that solve this problem, but it's a
> question of identifying all such scenarios and solving them all in order to
> at least say we strive for 100% consistency with a --force-clean.
> 
> Otherwise, we could say, it's best effort. But, again, that doesn't sound
> like what you're asking for.
> 
> I think a container based undercloud could/would offer a --force-clean. So
> perhaps the answer is wrapped up in that.
> 
> Anyway, the plan for this particular bugzilla was to fix the reported
> problem and solve it as I laid out in comment 14.
> 
> Please file an RFE if you'd like to continue the discussion about these
> other 2 points in a new bugzilla.

Agreed also. Your right I probably got too much into the implementation and not expressing the idea. We want to be able to "revert" the undercloud state in some fashion, and as you mentioned, doing that completely correctly is incredibly hard to do without immutable infrastructure patterns, so deferring it until we achieve that (which I believe is on the roadmap) makes sense

Regards,

Graeme

Comment 19 James Slagle 2015-09-11 21:58:24 UTC
i still need to add the doc text here

Comment 21 Omri Hochman 2015-09-24 18:00:08 UTC
Verified with :
----------------
instack-0.0.7-1.el7ost.noarch
instack-undercloud-2.1.2-26.el7ost.noarch



After undercloud was installed, I've changed the parameters of undercloud.conf and rerun  - openstack undercloud install . 

Results: 
----------
undercloud re-installed successfully and the changes from undercloud.conf were effective.

Comment 22 James Slagle 2015-09-24 20:49:35 UTC
*** Bug 1266201 has been marked as a duplicate of this bug. ***

Comment 23 James Slagle 2015-09-24 21:01:01 UTC
moving back to MODIFIED to indicate there's a new build available.

the additional verification should be covered by doing a HA deploy on baremetal per:
https://bugzilla.redhat.com/show_bug.cgi?id=1266201

Comment 25 James Slagle 2015-09-25 19:57:48 UTC
although the spec was updated, the patch wasn't included in the build. not sure how that happened, but I didn't notice.

rebuilding again into instack-undercloud-2.1.2-29.el7ost

Comment 27 Mike Burns 2015-09-25 20:38:02 UTC
The workaround for this if using an older build is to manually restart the openstack-nova-api service prior to deploying.

Comment 28 Omri Hochman 2015-09-30 21:55:54 UTC
Verified with: instack-undercloud-2.1.2-29.el7ost.noarch 

Deployed the undercloud and then re-deploy it with the same command and then successfully deployed Overcloud HA.

Comment 30 errata-xmlrpc 2015-10-08 12:08:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:1862


Note You need to log in before you can comment on or make changes to this bug.