Bug 1233159

Summary:	Periodic cleanup of non-final clusters moves the cluster into Error instead of removing it
Product:	Red Hat OpenStack	Reporter:	Luigi Toscano <ltoscano>
Component:	openstack-sahara	Assignee:	Elise Gafford <egafford>
Status:	CLOSED ERRATA	QA Contact:	Luigi Toscano <ltoscano>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	8.0 (Liberty)	CC:	dnavale, jschluet, kbasil, ltoscano, matt, mimccune, pkshiras, yeylon
Target Milestone:	beta
Target Release:	8.0 (Liberty)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-sahara-3.0.0-3.el7ost	Doc Type:	Bug Fix
Doc Text:	Previously, the tenant context information was not available to the periodic task responsible for cleaning up stale clusters. With this update, temporary trusts are established between the tenant and admin, allowing the periodic job to use this trust to delete stale clusters.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-04-07 21:01:57 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1189502

Description Luigi Toscano 2015-06-18 11:04:32 UTC

Description of problem:

Since Kilo, Sahara can start a periodic job to cleanup stale clusters, which means clusters not in final states (different from Active, Error, Deleting states).
http://specs.openstack.org/openstack/sahara-specs/specs/kilo/periodic-cleanup.html

In order to test this, the cleanup_time_for_incomplete_clusters key has been set to 1 (== 1 hour).
A suggested way to trigger the "reaper" is restart the -engine daemon when the cluster is in the initialization phase ("Waiting" state). 
What I observed is that the cleanup process is triggered 1 hour (after the restart of the -engine daemon), but something goes wrong and the cluster is moved into "Error" state:

2015-06-17 20:35:51.343 29763 DEBUG sahara.service.periodic [-] Terminating old clusters in non-final state terminate_incomplete_clusters /usr/lib/python2.7/site-packages/sahara/service/
periodic.py:174
2015-06-17 20:35:51.444 29763 DEBUG sahara.service.periodic [-] Terminating incomplete cluster cccc2 in "Waiting" state with id 55715c6c-b5dd-4ae1-905b-92c8089ac142 terminate_cluster /us
r/lib/python2.7/site-packages/sahara/service/periodic.py:90
2015-06-17 20:35:51.771 29763 ERROR sahara.service.ops [-] Error during operating on cluster cccc2 (reason: Service "compute" not found in service catalog
Error ID: 0367176a-933b-447a-8b91-30bf0a7eef63)
2015-06-17 20:35:52.008 29763 ERROR sahara.service.ops [-] Error during rollback of cluster cccc2 (reason: Service "compute" not found in service catalog
Error ID: 78e16133-8a20-417e-86a5-ebb9b29f7eda)


I'm opening the issue here first because it could be related to the specific setup. If it turns out to be more generic, it will be cloned on LP.


Version-Release number of selected component (if applicable):
openstack-sahara-common-2015.1.0-2.el7ost.noarch
openstack-sahara-engine-2015.1.0-2.el7ost.noarch
openstack-sahara-api-2015.1.0-2.el7ost.noarch
python-saharaclient-0.9.0-1.el7ost.noarch

Comment 2 Elise Gafford 2015-06-23 23:29:28 UTC

Hi Luigi,

After some wrestling, I believe that this is a bug, that it affects the upstream, and that it is not a blocker.

Initially, I found a similar result in the master upstream codebase (for heat rather than nova, but otherwise identical).

Essentially, the issue here is that:
1) The periodic cluster cleanup task depends on the existence of a trust_id (see sahara.service.trusts.trusts.use_os_admin_auth_token and sahara.context.get_admin_context).
2) The trust_id is only populated for transient clusters (see sahara.service.ops._provision_cluster).
3) The periodic cluster cleanup task selects in clusters based only on CONF.use_identity_api_v3, without regard to whether the cluster has a trust_id (see sahara.service.periodic.terminate_cluster).

So the logical gate to create a trust seems to be more restrictive than the logical gate to enter the periodic cleanup flow, which depends on the existence of that trust to populate its context.

This issue can be fixed one of three ways:
1) Gate cluster cleanup on whether a cluster's trust_id exists. This allows only transient clusters to be cleaned; other clusters will enter an "AwaitingTermination" state.
2) Always establish a trust if CONF.use_identity_api_v3 (even if the cluster isn't transient). This allows any cluster in a bad state to be cleaned. (I'll note that in this case, we should also do 1 for the deep corner case of backward-compatibility.) This isn't necessarily desirable: if a long-running cluster fails, and it automatically self-destructs, critical debug information is lost.
3) Change the way sahara.context.get_admin_context works, so that it actually populates a fully functional, cross-tenant admin context with a real client, and not just an empty object that says it is an admin but lacks certain nice-to-haves like a keystone URL until they're given to it later in a code path that isn't guaranteed to happen.

So this is a real issue, and it's both upstream and downstream. I'm removing the blocker flag, though, because it only affects the case in which a long-running cluster times out its creation (the flow as it exists should work for transient clusters.) As having these clusters sit around to debug if needed is actually useful, and as the desired 'AwaitingTermination' state is effectively synonymous with 'ErrorByTimeout', it seems a little harsh to block release because it says 'Error' instead. 'Error' is really pretty correct. This does render the specified feature less than fully functional (only functional for transient clusters), and we should fix this, but blocker still seems severe to me for the issue. Please speak up if you disagree now that we understand scope better.

Feel free to file an upstream bug, or to ask me to; if you file, please assign me; I understand the code path to take for each of the above solutions.

Also, I will note that if you permit yourself a bit of hackery, and haven't already, changing the periodic_task.periodic_task spacing kwarg on the decorator on sahara.service.periodic.terminate_incomplete_clusters and the multiplier of CONF.cleanup_time_for_incomplete clusters in the same method (say, both to 60) will drastically speed up your runs of this test if you're doing it manually. :)

Thanks,
Ethan

Comment 3 Luigi Toscano 2015-06-24 10:17:08 UTC

(In reply to Ethan Gafford from comment #2)
> Hi Luigi,
> 

Thanks for the analysis!

> So this is a real issue, and it's both upstream and downstream. I'm removing
> the blocker flag, though, because it only affects the case in which a
> long-running cluster times out its creation (the flow as it exists should
> work for transient clusters.) As having these clusters sit around to debug
> if needed is actually useful, and as the desired 'AwaitingTermination' state
> is effectively synonymous with 'ErrorByTimeout', it seems a little harsh to
> block release because it says 'Error' instead.

I disagree: if I read the blueprint correctly:
http://specs.openstack.org/openstack/sahara-specs/specs/kilo/periodic-cleanup.html
the feature was created exactly to cleanup long-term running clusters which failed for any reason 

Moreover, it's not enabled by default, so it would be a conscious choice of the OpenStack operator.

Asking upstream would probably help.

> Feel free to file an upstream bug, or to ask me to; if you file, please
> assign me; I understand the code path to take for each of the above
> solutions.

I will do, thanks.

Comment 4 Elise Gafford 2015-06-24 18:31:37 UTC

Recording for posterity that I agree with tosky that the bug fails the feature, per IRC conversation. Blocker status still undetermined.

I'll discuss with alazarev once the upstream bug is posted, and we'll move from there.

Thanks,
Ethan

Comment 5 Elise Gafford 2015-06-29 15:17:55 UTC

Per kbasil, after review of this issue:

(11:05:10 AM) kbasil: egafford, tosky: work it upstream as normal, but with a security focus.. it's not a blocker.
(11:06:08 AM) tosky: noslzzp, egafford: should we remove the RFE which is then not working from RHOS7 errata, and readd it in a point release when it's fixed?
(11:06:25 AM) kbasil: tosky: yep, point release..

Comment 6 Elise Gafford 2015-07-16 17:54:20 UTC

Changes in review upstream:

Spec: https://review.openstack.org/#/c/200275/
Code: https://review.openstack.org/#/c/200719/

Comment 7 Elise Gafford 2015-09-09 16:44:32 UTC

Agreed not to backport this fix, as it is a fairly major security feature which is required in RHOS 7 only to enable a fairly minor convenience feature. Pushing validation to RHOS 8 against upstream codebase (where this patch has landed at this time.)

Comment 11 Luigi Toscano 2016-03-04 17:20:55 UTC

The steps in the bug "Description" field now leads to a successful removal of the cluster in non-final state after the requested recheck time.

Verified on:
openstack-sahara-api-3.0.1-1.el7ost.noarch
openstack-sahara-common-3.0.1-1.el7ost.noarch
openstack-sahara-engine-3.0.1-1.el7ost.noarch

Comment 13 errata-xmlrpc 2016-04-07 21:01:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0603.html