Bug 1189502

Summary: [RFE][sahara]: Clean up clusters that are in non-final state for a long time
Product: Red Hat OpenStack Reporter: RHOS Integration <rhos-integ>
Component: openstack-saharaAssignee: Elise Gafford <egafford>
Status: CLOSED ERRATA QA Contact: Luigi Toscano <ltoscano>
Severity: high Docs Contact:
Priority: high    
Version: unspecifiedCC: dnavale, jschluet, kbasil, markmc, matt, mimccune, sclewis, yeylon
Target Milestone: betaKeywords: FutureFeature
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Unspecified   
URL: https://blueprints.launchpad.net/sahara/+spec/periodic-cleanup
Whiteboard: upstream_milestone_kilo-3 upstream_definition_approved upstream_status_implemented
Fixed In Version: openstack-sahara-3.0.0-3.el7ost Doc Type: Enhancement
Doc Text:
With this update, configuration settings now exist to set timeouts, after which clusters which have failed to reach the 'Active' state will be automatically deleted.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-07 21:00:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1233159    
Bug Blocks:    

Description RHOS Integration 2015-02-05 13:47:15 UTC
Cloned from launchpad blueprint https://blueprints.launchpad.net/sahara/+spec/periodic-cleanup.

Description:

For now it is possible that sahara cluster becomes stuck because of different reasons (e.g. if sahara service was restarted during provisioning or neutron failed to assign floating IP). This could lead to clusters holding resources for a long time. This could happen in different tenants and it is hard to check such conditions manually.

Proposed solution: delete sahara cluster in non-final state if it wasn't updated for a long time.

Specification URL (additional information):

http://specs.openstack.org/openstack/sahara-specs/specs/kilo/periodic-cleanup.html

Comment 5 Elise Gafford 2015-06-26 18:41:39 UTC
Hi Keith,

The feature "[RFE][sahara]: Clean up clusters that are in non-final state for a long time" [1] (documented upstream at [2]), is bugged. Luigi found that there is an error on security context creation [3][4]: at present, the job attempts to clean up all old clusters, but requires a delegated trust id in order to do so. Such trust ids are only created for transient clusters. 

In order to properly fix this issue, I have signed on to implement a new feature (creating delegated trusts on long-running clusters, to be deleted after provisioning is complete [5]). I will complete this task; it's relatively simple technically. However, there is a question of timeline.

The original RFE was, effectively, never completed successfully upstream. At this point, in order to finish it for RHOS 7, we'll be implementing a new, security-related feature, potentially without significant time for upstream review. As the RFE feature is a convenience feature, it seems to me that this is an unnecessary risk, and that the prudent course would be to let the security change go through its full upstream process before backporting to stable/kilo and to RHOS 7 (likely in the first point release.)

However, if we think the feature is important enough for RHOS 7 that it warrants rapid implementation and backport, I'm happy to mark this as a blocker and do that; just looking for your opinion given the new data.

Thanks,
Ethan

[1]: RFE: https://bugzilla.redhat.com/show_bug.cgi?id=1189502
[2]: Spec: http://specs.openstack.org/openstack/sahara-specs/specs/kilo/periodic-cleanup.html
[3]: Bug on RFE: https://bugzilla.redhat.com/show_bug.cgi?id=1233159
[4]: Upstream bug: https://bugs.launchpad.net/sahara/+bug/1468722
[5]: Feature required to finish RFE: https://blueprints.launchpad.net/sahara/+spec/cluster-creation-with-trust

Comment 7 Elise Gafford 2015-09-09 16:38:33 UTC
Agreed not to backport this feature to RHOS 7, as it is only repairable via addition of a fairly major security feature. Pushing to RHOS 8.

Comment 11 Luigi Toscano 2016-03-04 17:29:54 UTC
If cleanup_time_for_incomplete_clusters is set to 1 (== 1 hour), and cluster provisioning is forcibly interrupted (by restarting the -engine daemon when the cluster is in the initialization phase), the periodic cluster cleanup process is triggered after cleanup_time_for_incomplete_clusters time and the cluster in non-final state ("Spawning", "Waiting" or "Preparing") is removed.

Verified on:
openstack-sahara-api-3.0.1-1.el7ost.noarch
openstack-sahara-common-3.0.1-1.el7ost.noarch
openstack-sahara-engine-3.0.1-1.el7ost.noarch

Comment 13 errata-xmlrpc 2016-04-07 21:00:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0603.html