Bug 1233159 - Periodic cleanup of non-final clusters moves the cluster into Error instead of removing it
Summary: Periodic cleanup of non-final clusters moves the cluster into Error instead o...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-sahara
Version: 8.0 (Liberty)
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: beta
: 8.0 (Liberty)
Assignee: Elise Gafford
QA Contact: Luigi Toscano
URL:
Whiteboard:
Depends On:
Blocks: 1189502
TreeView+ depends on / blocked
 
Reported: 2015-06-18 11:04 UTC by Luigi Toscano
Modified: 2016-04-07 21:01 UTC (History)
8 users (show)

Fixed In Version: openstack-sahara-3.0.0-3.el7ost
Doc Type: Bug Fix
Doc Text:
Previously, the tenant context information was not available to the periodic task responsible for cleaning up stale clusters. With this update, temporary trusts are established between the tenant and admin, allowing the periodic job to use this trust to delete stale clusters.
Clone Of:
Environment:
Last Closed: 2016-04-07 21:01:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1468722 0 None None None Never
Red Hat Product Errata RHEA-2016:0603 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 8 Enhancement Advisory 2016-04-08 00:53:53 UTC

Description Luigi Toscano 2015-06-18 11:04:32 UTC
Description of problem:

Since Kilo, Sahara can start a periodic job to cleanup stale clusters, which means clusters not in final states (different from Active, Error, Deleting states).
http://specs.openstack.org/openstack/sahara-specs/specs/kilo/periodic-cleanup.html

In order to test this, the cleanup_time_for_incomplete_clusters key has been set to 1 (== 1 hour).
A suggested way to trigger the "reaper" is restart the -engine daemon when the cluster is in the initialization phase ("Waiting" state). 
What I observed is that the cleanup process is triggered 1 hour (after the restart of the -engine daemon), but something goes wrong and the cluster is moved into "Error" state:

2015-06-17 20:35:51.343 29763 DEBUG sahara.service.periodic [-] Terminating old clusters in non-final state terminate_incomplete_clusters /usr/lib/python2.7/site-packages/sahara/service/
periodic.py:174
2015-06-17 20:35:51.444 29763 DEBUG sahara.service.periodic [-] Terminating incomplete cluster cccc2 in "Waiting" state with id 55715c6c-b5dd-4ae1-905b-92c8089ac142 terminate_cluster /us
r/lib/python2.7/site-packages/sahara/service/periodic.py:90
2015-06-17 20:35:51.771 29763 ERROR sahara.service.ops [-] Error during operating on cluster cccc2 (reason: Service "compute" not found in service catalog
Error ID: 0367176a-933b-447a-8b91-30bf0a7eef63)
2015-06-17 20:35:52.008 29763 ERROR sahara.service.ops [-] Error during rollback of cluster cccc2 (reason: Service "compute" not found in service catalog
Error ID: 78e16133-8a20-417e-86a5-ebb9b29f7eda)


I'm opening the issue here first because it could be related to the specific setup. If it turns out to be more generic, it will be cloned on LP.


Version-Release number of selected component (if applicable):
openstack-sahara-common-2015.1.0-2.el7ost.noarch
openstack-sahara-engine-2015.1.0-2.el7ost.noarch
openstack-sahara-api-2015.1.0-2.el7ost.noarch
python-saharaclient-0.9.0-1.el7ost.noarch

Comment 2 Elise Gafford 2015-06-23 23:29:28 UTC
Hi Luigi,

After some wrestling, I believe that this is a bug, that it affects the upstream, and that it is not a blocker.

Initially, I found a similar result in the master upstream codebase (for heat rather than nova, but otherwise identical).

Essentially, the issue here is that:
1) The periodic cluster cleanup task depends on the existence of a trust_id (see sahara.service.trusts.trusts.use_os_admin_auth_token and sahara.context.get_admin_context).
2) The trust_id is only populated for transient clusters (see sahara.service.ops._provision_cluster).
3) The periodic cluster cleanup task selects in clusters based only on CONF.use_identity_api_v3, without regard to whether the cluster has a trust_id (see sahara.service.periodic.terminate_cluster).

So the logical gate to create a trust seems to be more restrictive than the logical gate to enter the periodic cleanup flow, which depends on the existence of that trust to populate its context.

This issue can be fixed one of three ways: 
1) Gate cluster cleanup on whether a cluster's trust_id exists. This allows only transient clusters to be cleaned; other clusters will enter an "AwaitingTermination" state.
2) Always establish a trust if CONF.use_identity_api_v3 (even if the cluster isn't transient). This allows any cluster in a bad state to be cleaned. (I'll note that in this case, we should also do 1 for the deep corner case of backward-compatibility.) This isn't necessarily desirable: if a long-running cluster fails, and it automatically self-destructs, critical debug information is lost.
3) Change the way sahara.context.get_admin_context works, so that it actually populates a fully functional, cross-tenant admin context with a real client, and not just an empty object that says it is an admin but lacks certain nice-to-haves like a keystone URL until they're given to it later in a code path that isn't guaranteed to happen.

So this is a real issue, and it's both upstream and downstream. I'm removing the blocker flag, though, because it only affects the case in which a long-running cluster times out its creation (the flow as it exists should work for transient clusters.) As having these clusters sit around to debug if needed is actually useful, and as the desired 'AwaitingTermination' state is effectively synonymous with 'ErrorByTimeout', it seems a little harsh to block release because it says 'Error' instead. 'Error' is really pretty correct. This does render the specified feature less than fully functional (only functional for transient clusters), and we should fix this, but blocker still seems severe to me for the issue. Please speak up if you disagree now that we understand scope better.

Feel free to file an upstream bug, or to ask me to; if you file, please assign me; I understand the code path to take for each of the above solutions.

Also, I will note that if you permit yourself a bit of hackery, and haven't already, changing the periodic_task.periodic_task spacing kwarg on the decorator on sahara.service.periodic.terminate_incomplete_clusters and the multiplier of CONF.cleanup_time_for_incomplete clusters in the same method (say, both to 60) will drastically speed up your runs of this test if you're doing it manually. :)

Thanks,
Ethan

Comment 3 Luigi Toscano 2015-06-24 10:17:08 UTC
(In reply to Ethan Gafford from comment #2)
> Hi Luigi,
> 

Thanks for the analysis!

> So this is a real issue, and it's both upstream and downstream. I'm removing
> the blocker flag, though, because it only affects the case in which a
> long-running cluster times out its creation (the flow as it exists should
> work for transient clusters.) As having these clusters sit around to debug
> if needed is actually useful, and as the desired 'AwaitingTermination' state
> is effectively synonymous with 'ErrorByTimeout', it seems a little harsh to
> block release because it says 'Error' instead.

I disagree: if I read the blueprint correctly:
http://specs.openstack.org/openstack/sahara-specs/specs/kilo/periodic-cleanup.html
the feature was created exactly to cleanup long-term running clusters which failed for any reason 

Moreover, it's not enabled by default, so it would be a conscious choice of the OpenStack operator.

Asking upstream would probably help.

> Feel free to file an upstream bug, or to ask me to; if you file, please
> assign me; I understand the code path to take for each of the above
> solutions.

I will do, thanks.

Comment 4 Elise Gafford 2015-06-24 18:31:37 UTC
Recording for posterity that I agree with tosky that the bug fails the feature, per IRC conversation. Blocker status still undetermined.

I'll discuss with alazarev once the upstream bug is posted, and we'll move from there.

Thanks,
Ethan

Comment 5 Elise Gafford 2015-06-29 15:17:55 UTC
Per kbasil, after review of this issue:

(11:05:10 AM) kbasil: egafford, tosky: work it upstream as normal, but with a security focus.. it's not a blocker.
(11:06:08 AM) tosky: noslzzp, egafford: should we remove the RFE which is then not working from RHOS7 errata, and readd it in a point release when it's fixed?
(11:06:25 AM) kbasil: tosky: yep, point release..

Comment 6 Elise Gafford 2015-07-16 17:54:20 UTC
Changes in review upstream:

Spec: https://review.openstack.org/#/c/200275/
Code: https://review.openstack.org/#/c/200719/

Comment 7 Elise Gafford 2015-09-09 16:44:32 UTC
Agreed not to backport this fix, as it is a fairly major security feature which is required in RHOS 7 only to enable a fairly minor convenience feature. Pushing validation to RHOS 8 against upstream codebase (where this patch has landed at this time.)

Comment 11 Luigi Toscano 2016-03-04 17:20:55 UTC
The steps in the bug "Description" field now leads to a successful removal of the cluster in non-final state after the requested recheck time.

Verified on:
openstack-sahara-api-3.0.1-1.el7ost.noarch
openstack-sahara-common-3.0.1-1.el7ost.noarch
openstack-sahara-engine-3.0.1-1.el7ost.noarch

Comment 13 errata-xmlrpc 2016-04-07 21:01:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0603.html


Note You need to log in before you can comment on or make changes to this bug.