Bug 1916956
| Summary: | openstack usage tool giving error 'nova.exception.OrphanedObjectError' | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Allan Greentree <agreentr> | |
| Component: | openstack-nova | Assignee: | melanie witt <mwitt> | |
| Status: | CLOSED ERRATA | QA Contact: | OSP DFG:Compute <osp-dfg-compute> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 16.1 (Train) | CC: | dasmith, eglynn, gkadam, jhakimra, jmelvin, jparker, kchamart, mhofmann, mporrato, mwitt, nweinber, rhopp, sbauza, sgordon, vromanso | |
| Target Milestone: | beta | Keywords: | Triaged | |
| Target Release: | 17.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | openstack-nova-23.0.0-0.20210326171618.4a285b1.el8ost | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1918039 1944899 2203391 (view as bug list) | Environment: | ||
| Last Closed: | 2022-09-21 12:13:29 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1918039, 1944899, 1944913, 2203391 | |||
|
Description
Allan Greentree
2021-01-15 23:34:38 UTC
The root cause of this problem is a partially archived database where a deleted=ID 'instances' record remains in the database after one or more of its dependent records (example: 'instance_extra') have already been moved to the shadow table(s). So any API that can access deleted (but not purged) instances will raise a OrphanedObjectError. This isn't surprising, considering how the archive operation works. Today, we reverse sort the tables so that "leaves" are first, so that our deletes of records [from the normal tables] don't get blocked by violated foreign key constraints. This means that all the records that depend on an instance are archived before the instances table record itself. And if you happen to stop archiving in the middle of a particular instance's records, you hit this bug. The 'nova-manage db archive_deleted_rows' command is intended to run to completion, by either running it repeatedly until it returns 0 (meaning no more work to do) or by running it with the --until-complete option. The --until-complete option can be problematic because there is no ability to pause between batches of rows, so it runs in a tight loop and might cause issues in a busy environment as many other requests are trying to write to the database. I've opened an upstream bug https://bugs.launchpad.net/nova/+bug/1912579 and proposed a patch to add a --sleep option https://review.opendev.org/c/openstack/nova/+/771749 to address the problem with --until-complete. In order to [mostly] prevent this from being able to happen, we would have to change the archive logic to keep all of an instance's records "together" and move them to the shadow tables in a single transaction. AFAIK, we don't have support for cascading deletes in all or most of the nova tables, so the logic would likely have to be manually implemented. The potential issues with that are (1) complexity, (2) maintenance if/when more tables related to an instance are added, (3) some records related to an instance reside in a separate database (example: nova_api) and thus cannot be moved in a single transaction alongside nova/nova_cell0/nova_cell1 database records and could still end up separated if the process is interrupted for some reason. We need to discuss as a team whether we should implement something like the above ^ or whether providing a --sleep option and augmenting documentation to emphasize the fact that archive_deleted_rows needs to be run to completion when it is run is enough. On the triage call last Friday, we agreed to let the --sleep option and the command documentation serve as mitigations for this issue in the nearer term, with the docs stating [1]: "Return Codes Return code Description 0 Nothing was archived. 1 Some number of rows were archived. 2 Invalid value for --max_rows. 3 No connection to the API database could be established using api_database.connection. 4 Invalid value for --before. 255 An unexpected error occurred. If automating, this should be run continuously while the result is 1, stopping at 0, or use the --until-complete option." It is important to run the database archive command to completion, until there is nothing left to archive, when running it. Future work can involve investigation into following FK relationships during database archiving to help prevent separation of instance records. [1] https://docs.openstack.org/nova/train/cli/nova-manage.html#nova-database Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2022:6543 |