Bug 1233159
| Summary: | Periodic cleanup of non-final clusters moves the cluster into Error instead of removing it | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Luigi Toscano <ltoscano> |
| Component: | openstack-sahara | Assignee: | Elise Gafford <egafford> |
| Status: | CLOSED ERRATA | QA Contact: | Luigi Toscano <ltoscano> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.0 (Liberty) | CC: | dnavale, jschluet, kbasil, ltoscano, matt, mimccune, pkshiras, yeylon |
| Target Milestone: | beta | ||
| Target Release: | 8.0 (Liberty) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | openstack-sahara-3.0.0-3.el7ost | Doc Type: | Bug Fix |
| Doc Text: |
Previously, the tenant context information was not available to the periodic task responsible for cleaning up stale clusters.
With this update, temporary trusts are established between the tenant and admin, allowing the periodic job to use this trust to delete stale clusters.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-04-07 21:01:57 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1189502 | ||
|
Description
Luigi Toscano
2015-06-18 11:04:32 UTC
Hi Luigi, After some wrestling, I believe that this is a bug, that it affects the upstream, and that it is not a blocker. Initially, I found a similar result in the master upstream codebase (for heat rather than nova, but otherwise identical). Essentially, the issue here is that: 1) The periodic cluster cleanup task depends on the existence of a trust_id (see sahara.service.trusts.trusts.use_os_admin_auth_token and sahara.context.get_admin_context). 2) The trust_id is only populated for transient clusters (see sahara.service.ops._provision_cluster). 3) The periodic cluster cleanup task selects in clusters based only on CONF.use_identity_api_v3, without regard to whether the cluster has a trust_id (see sahara.service.periodic.terminate_cluster). So the logical gate to create a trust seems to be more restrictive than the logical gate to enter the periodic cleanup flow, which depends on the existence of that trust to populate its context. This issue can be fixed one of three ways: 1) Gate cluster cleanup on whether a cluster's trust_id exists. This allows only transient clusters to be cleaned; other clusters will enter an "AwaitingTermination" state. 2) Always establish a trust if CONF.use_identity_api_v3 (even if the cluster isn't transient). This allows any cluster in a bad state to be cleaned. (I'll note that in this case, we should also do 1 for the deep corner case of backward-compatibility.) This isn't necessarily desirable: if a long-running cluster fails, and it automatically self-destructs, critical debug information is lost. 3) Change the way sahara.context.get_admin_context works, so that it actually populates a fully functional, cross-tenant admin context with a real client, and not just an empty object that says it is an admin but lacks certain nice-to-haves like a keystone URL until they're given to it later in a code path that isn't guaranteed to happen. So this is a real issue, and it's both upstream and downstream. I'm removing the blocker flag, though, because it only affects the case in which a long-running cluster times out its creation (the flow as it exists should work for transient clusters.) As having these clusters sit around to debug if needed is actually useful, and as the desired 'AwaitingTermination' state is effectively synonymous with 'ErrorByTimeout', it seems a little harsh to block release because it says 'Error' instead. 'Error' is really pretty correct. This does render the specified feature less than fully functional (only functional for transient clusters), and we should fix this, but blocker still seems severe to me for the issue. Please speak up if you disagree now that we understand scope better. Feel free to file an upstream bug, or to ask me to; if you file, please assign me; I understand the code path to take for each of the above solutions. Also, I will note that if you permit yourself a bit of hackery, and haven't already, changing the periodic_task.periodic_task spacing kwarg on the decorator on sahara.service.periodic.terminate_incomplete_clusters and the multiplier of CONF.cleanup_time_for_incomplete clusters in the same method (say, both to 60) will drastically speed up your runs of this test if you're doing it manually. :) Thanks, Ethan (In reply to Ethan Gafford from comment #2) > Hi Luigi, > Thanks for the analysis! > So this is a real issue, and it's both upstream and downstream. I'm removing > the blocker flag, though, because it only affects the case in which a > long-running cluster times out its creation (the flow as it exists should > work for transient clusters.) As having these clusters sit around to debug > if needed is actually useful, and as the desired 'AwaitingTermination' state > is effectively synonymous with 'ErrorByTimeout', it seems a little harsh to > block release because it says 'Error' instead. I disagree: if I read the blueprint correctly: http://specs.openstack.org/openstack/sahara-specs/specs/kilo/periodic-cleanup.html the feature was created exactly to cleanup long-term running clusters which failed for any reason Moreover, it's not enabled by default, so it would be a conscious choice of the OpenStack operator. Asking upstream would probably help. > Feel free to file an upstream bug, or to ask me to; if you file, please > assign me; I understand the code path to take for each of the above > solutions. I will do, thanks. Recording for posterity that I agree with tosky that the bug fails the feature, per IRC conversation. Blocker status still undetermined. I'll discuss with alazarev once the upstream bug is posted, and we'll move from there. Thanks, Ethan Per kbasil, after review of this issue: (11:05:10 AM) kbasil: egafford, tosky: work it upstream as normal, but with a security focus.. it's not a blocker. (11:06:08 AM) tosky: noslzzp, egafford: should we remove the RFE which is then not working from RHOS7 errata, and readd it in a point release when it's fixed? (11:06:25 AM) kbasil: tosky: yep, point release.. Changes in review upstream: Spec: https://review.openstack.org/#/c/200275/ Code: https://review.openstack.org/#/c/200719/ Agreed not to backport this fix, as it is a fairly major security feature which is required in RHOS 7 only to enable a fairly minor convenience feature. Pushing validation to RHOS 8 against upstream codebase (where this patch has landed at this time.) The steps in the bug "Description" field now leads to a successful removal of the cluster in non-final state after the requested recheck time. Verified on: openstack-sahara-api-3.0.1-1.el7ost.noarch openstack-sahara-common-3.0.1-1.el7ost.noarch openstack-sahara-engine-3.0.1-1.el7ost.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-0603.html |