Cloned from launchpad blueprint https://blueprints.launchpad.net/sahara/+spec/periodic-cleanup. Description: For now it is possible that sahara cluster becomes stuck because of different reasons (e.g. if sahara service was restarted during provisioning or neutron failed to assign floating IP). This could lead to clusters holding resources for a long time. This could happen in different tenants and it is hard to check such conditions manually. Proposed solution: delete sahara cluster in non-final state if it wasn't updated for a long time. Specification URL (additional information): http://specs.openstack.org/openstack/sahara-specs/specs/kilo/periodic-cleanup.html
Hi Keith, The feature "[RFE][sahara]: Clean up clusters that are in non-final state for a long time" [1] (documented upstream at [2]), is bugged. Luigi found that there is an error on security context creation [3][4]: at present, the job attempts to clean up all old clusters, but requires a delegated trust id in order to do so. Such trust ids are only created for transient clusters. In order to properly fix this issue, I have signed on to implement a new feature (creating delegated trusts on long-running clusters, to be deleted after provisioning is complete [5]). I will complete this task; it's relatively simple technically. However, there is a question of timeline. The original RFE was, effectively, never completed successfully upstream. At this point, in order to finish it for RHOS 7, we'll be implementing a new, security-related feature, potentially without significant time for upstream review. As the RFE feature is a convenience feature, it seems to me that this is an unnecessary risk, and that the prudent course would be to let the security change go through its full upstream process before backporting to stable/kilo and to RHOS 7 (likely in the first point release.) However, if we think the feature is important enough for RHOS 7 that it warrants rapid implementation and backport, I'm happy to mark this as a blocker and do that; just looking for your opinion given the new data. Thanks, Ethan [1]: RFE: https://bugzilla.redhat.com/show_bug.cgi?id=1189502 [2]: Spec: http://specs.openstack.org/openstack/sahara-specs/specs/kilo/periodic-cleanup.html [3]: Bug on RFE: https://bugzilla.redhat.com/show_bug.cgi?id=1233159 [4]: Upstream bug: https://bugs.launchpad.net/sahara/+bug/1468722 [5]: Feature required to finish RFE: https://blueprints.launchpad.net/sahara/+spec/cluster-creation-with-trust
Agreed not to backport this feature to RHOS 7, as it is only repairable via addition of a fairly major security feature. Pushing to RHOS 8.
If cleanup_time_for_incomplete_clusters is set to 1 (== 1 hour), and cluster provisioning is forcibly interrupted (by restarting the -engine daemon when the cluster is in the initialization phase), the periodic cluster cleanup process is triggered after cleanup_time_for_incomplete_clusters time and the cluster in non-final state ("Spawning", "Waiting" or "Preparing") is removed. Verified on: openstack-sahara-api-3.0.1-1.el7ost.noarch openstack-sahara-common-3.0.1-1.el7ost.noarch openstack-sahara-engine-3.0.1-1.el7ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-0603.html