Bug 1916956

Summary:	openstack usage tool giving error 'nova.exception.OrphanedObjectError'
Product:	Red Hat OpenStack	Reporter:	Allan Greentree <agreentr>
Component:	openstack-nova	Assignee:	melanie witt <mwitt>
Status:	CLOSED ERRATA	QA Contact:	OSP DFG:Compute <osp-dfg-compute>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	16.1 (Train)	CC:	dasmith, eglynn, gkadam, jhakimra, jmelvin, jparker, kchamart, mhofmann, mporrato, mwitt, nweinber, rhopp, sbauza, sgordon, vromanso
Target Milestone:	beta	Keywords:	Triaged
Target Release:	17.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-nova-23.0.0-0.20210326171618.4a285b1.el8ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1918039 1944899 2203391 (view as bug list)		Environment:
Last Closed:	2022-09-21 12:13:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1918039, 1944899, 1944913, 2203391

Description Allan Greentree 2021-01-15 23:34:38 UTC

Description of problem:

$ openstack usage list                                                                                                                                                                                                               

Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'nova.exception.OrphanedObjectError'> (HTTP 500) (Request-ID: req-6fbb3b4c-039e-44cf-b6df-9e60c4e94c13)

This appears to occur when there is an instance uuid present in nova.instances, nova.shadow_instance_extra, but is not present in the  nova.instance_extra table. And for some reason, the usage utility is reading this as an error.

Additionally observed, when db instance archiving actions complete (movement of instance uuids around from instance_extra and shadow_instance_extra) things are momentarily fixed. Then subsequent archiving actions cause the issue to appear again in a cycle.

Version-Release number of selected component (if applicable):

How reproducible:

always

Steps to Reproduce:

to reproduce in a devstack vm
create the following script:

#!/bin/bash

BATCH_SIZE=5
LOOPS=50

for i in $(seq $LOOPS); do
        echo Step $i
        openstack server create --flavor m1.tiny --key-name mporrato --image cirros-0.5.1-x86_64-disk --network private --min $BATCH_SIZE --max $BATCH_SIZE deleteme
        while [ $(openstack server list -n --name deleteme- --status active -f value -c ID | wc -l) -lt $BATCH_SIZE ] ; do
                sleep 5
        done
        openstack server list -n --name deleteme- -f value -c ID | xargs openstack server delete
        while [ $(openstack server list -n --name deleteme- --status active -f value -c ID | wc -l) -gt 0 ] ; do
                sleep 5
        done
done

let it run for about half an hour and then, while it is still running, start manually running the following:
nova-manage db archive_deleted_rows --verbose --max_rows 350 --all-cells
until it starts reporting it deleted some rows from instance_extra:

stack@devstack:~/devstack$ nova-manage db archive_deleted_rows --verbose --max_rows 350 --all-cells
+----------------------------+-------------------------+
| Table                      | Number of Rows Archived |
+----------------------------+-------------------------+
| cell1.instance_actions     | 11                      |
| cell1.instance_extra       | 197                     |
| cell1.instance_info_caches | 142                     |
+----------------------------+-------------------------+
NOTE: --all-cells is only needed on devstack

Actual results:

stack@devstack:~/devstack$ openstack usage show
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'nova.exception.OrphanedObjectError'> (HTTP 500) (Request-ID: req-17dd0411-7fed-4165-821f-5369c8b0cff1)

Expected results:

for example:
(overcloud) [stack@undercloud-0 ~]$ openstack usage show
Usage from 2020-12-18 to 2021-01-16 on project aad08502db9d4b2ea413633d4278dfab: 
+---------------+-----------+
| Field         | Value     |
+---------------+-----------+
| CPU Hours     | 723.97    |
| Disk GB-Hours | 7239.72   |
| RAM MB-Hours  | 370673.72 |
| Servers       | 3         |
+---------------+-----------+

Additional info:
maybe related to https://bugs.launchpad.net/nova/+bug/1837995

Comment 7 melanie witt 2021-01-21 20:36:57 UTC

The root cause of this problem is a partially archived database where a deleted=ID 'instances' record remains in the database after one or more of its dependent records (example: 'instance_extra') have already been moved to the shadow table(s). So any API that can access deleted (but not purged) instances will raise a OrphanedObjectError.

This isn't surprising, considering how the archive operation works.

Today, we reverse sort the tables so that "leaves" are first, so that our deletes of records [from the normal tables] don't get blocked by violated foreign key constraints. This means that all the records that depend on an instance are archived before the instances table record itself. And if you happen to stop archiving in the middle of a particular instance's records, you hit this bug.

The 'nova-manage db archive_deleted_rows' command is intended to run to completion, by either running it repeatedly until it returns 0 (meaning no more work to do) or by running it with the --until-complete option. The --until-complete option can be problematic because there is no ability to pause between batches of rows, so it runs in a tight loop and might cause issues in a busy environment as many other requests are trying to write to the database. I've opened an upstream bug https://bugs.launchpad.net/nova/+bug/1912579 and proposed a patch to add a --sleep option https://review.opendev.org/c/openstack/nova/+/771749 to address the problem with --until-complete.

In order to [mostly] prevent this from being able to happen, we would have to change the archive logic to keep all of an instance's records "together" and move them to the shadow tables in a single transaction. AFAIK, we don't have support for cascading deletes in all or most of the nova tables, so the logic would likely have to be manually implemented. The potential issues with that are (1) complexity, (2) maintenance if/when more tables related to an instance are added, (3) some records related to an instance reside in a separate database (example: nova_api) and thus cannot be moved in a single transaction alongside nova/nova_cell0/nova_cell1 database records and could still end up separated if the process is interrupted for some reason.

We need to discuss as a team whether we should implement something like the above ^ or whether providing a --sleep option and augmenting documentation to emphasize the fact that archive_deleted_rows needs to be run to completion when it is run is enough.

Comment 8 melanie witt 2021-01-25 16:25:27 UTC

On the triage call last Friday, we agreed to let the --sleep option and the command documentation serve as mitigations for this issue in the nearer term, with the docs stating [1]:

"Return Codes

Return code Description

0           Nothing was archived.

1           Some number of rows were archived.

2           Invalid value for --max_rows.

3           No connection to the API database could be established using api_database.connection.

4           Invalid value for --before.

255         An unexpected error occurred.

If automating, this should be run continuously while the result is 1, stopping at 0, or use the --until-complete option."

It is important to run the database archive command to completion, until there is nothing left to archive, when running it.

Future work can involve investigation into following FK relationships during database archiving to help prevent separation of instance records.

[1] https://docs.openstack.org/nova/train/cli/nova-manage.html#nova-database

Comment 24 errata-xmlrpc 2022-09-21 12:13:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 17.0 (Wallaby)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:6543