Bug 1778814 - [OSP16] deleted instances in multi cellv2 environment won't get archived/purged
Summary: [OSP16] deleted instances in multi cellv2 environment won't get archived/purged
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.0 (Train)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z1
: 16.0 (Train on RHEL 8.1)
Assignee: Martin Schuppert
QA Contact: Paras Babbar
URL:
Whiteboard:
: 1778905 (view as bug list)
Depends On:
Blocks: 1784091 1784092
TreeView+ depends on / blocked
 
Reported: 2019-12-02 14:49 UTC by Martin Schuppert
Modified: 2020-03-03 09:45 UTC (History)
3 users (show)

Fixed In Version: puppet-nova-15.4.1-0.20191210120219.7d002fd.el8ost openstack-tripleo-heat-templates-11.3.1-0.20191214200147.d19b42d.el8ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1784091 (view as bug list)
Environment:
Last Closed: 2020-03-03 09:45:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 697299 0 'None' MERGED Adds --before archive parameter to cron job 2020-09-25 03:35:47 UTC
OpenStack gerrit 697621 0 'None' MERGED Adds --all-cells to the archive command 2020-09-25 03:35:47 UTC
OpenStack gerrit 698411 0 None MERGED New Parameter NovaCronArchiveDeleteAllCells and NovaCronArchiveDeleteRowsAge 2020-09-25 03:35:47 UTC
Red Hat Product Errata RHBA-2020:0655 0 None None None 2020-03-03 09:45:47 UTC

Description Martin Schuppert 2019-12-02 14:49:35 UTC
Description of problem:

When checking on cellcontroller of e.g. this cell1 the celeted cell instances won’t get archived:
MariaDB [nova]> select count(*) from instances;
+----------+
| count(*) |
+----------+
|    86255 |
+----------+
1 row in set (0.021 sec)

Running instances:
MariaDB [nova]> select count(*) from instances where deleted_at is null;
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.023 sec)

Total:
MariaDB [nova]> select count(*) from instances where deleted_at is not null;
+----------+
| count(*) |
+----------+
|    86266 |
+----------+
1 row in set (0.024 sec)

Difference because rally job is still running.

Version-Release number of selected component (if applicable):
OSP16

How reproducible:
always

Steps to Reproduce:
1. create instances in an additional cell
2. delete instances from the cell
3. check if instances get archived

Actual results:


Expected results:


Additional info:

Comment 1 Martin Schuppert 2019-12-02 15:02:24 UTC
we should change the current archive_deleted_rows cron command to something like:

()[root@controller-0 /]$ nova-manage db archive_deleted_rows --before `date --date='today - 2 days' +\%F` --until-complete --all-cells --verbose
Archiving.....................................................................................................................................................................................................................complete
+--------------------------------+-------------------------+
| Table                          | Number of Rows Archived |
+--------------------------------+-------------------------+
| API_DB.instance_group_member   | 0                       |                                                                                                                                                                                                                             
| API_DB.instance_mappings       | 11270                   |                                                                                                                                                                                                                             
| API_DB.request_specs           | 11270                   |
| cell1.block_device_mapping     | 7792                    |
| cell1.instance_actions         | 7792                    |
| cell1.instance_extra           | 3896                    |
| cell1.instance_id_mappings     | 3896                    |
| cell1.instance_info_caches     | 3896                    |
| cell1.instance_system_metadata | 35064                   |
| cell1.instances                | 3896                    |
| cell2.block_device_mapping     | 1684                    |
| cell2.instance_actions         | 1684                    |
| cell2.instance_actions_events  | 1684                    |
| cell2.instance_extra           | 842                     |
| cell2.instance_id_mappings     | 842                     |
| cell2.instance_info_caches     | 842                     |
| cell2.instance_system_metadata | 7578                    |
| cell2.instances                | 842                     |
| cell4.block_device_mapping     | 5734                    |
| cell4.instance_actions         | 5734                    |
| cell4.instance_actions_events  | 5734                    |
| cell4.instance_extra           | 2867                    |
| cell4.instance_id_mappings     | 2867                    |
| cell4.instance_info_caches     | 2867                    |
| cell4.instance_system_metadata | 25803                   |
| cell4.instances                | 2867                    |
| cell5.block_device_mapping     | 3194                    |
| cell5.instance_actions         | 3194                    |
| cell5.instance_actions_events  | 3194                    |
| cell5.instance_extra           | 1597                    |
| cell5.instance_id_mappings     | 1597                    |
| cell5.instance_info_caches     | 1597                    |
| cell5.instance_system_metadata | 14373                   |
| cell5.instances                | 1597                    |
| cell6.block_device_mapping     | 3594                    |
| cell6.instance_actions         | 3594                    |
| cell6.instance_actions_events  | 3594                    |
| cell6.instance_extra           | 1797                    |
| cell6.instance_id_mappings     | 1797                    |
| cell6.instance_info_caches     | 1797                    |
| cell6.instance_system_metadata | 16173                   |
| cell6.instances                | 1797                    |
| cell7.block_device_mapping     | 542                     |
| cell7.instance_actions         | 542                     |
| cell7.instance_actions_events  | 542                     |
| cell7.instance_extra           | 271                     |
| cell7.instance_id_mappings     | 271                     |
| cell7.instance_info_caches     | 271                     |
| cell7.instance_system_metadata | 2439                    |
| cell7.instances                | 271                     |
+--------------------------------+-------------------------+

Comment 2 Martin Schuppert 2019-12-02 15:11:58 UTC
current cron jobs:

[root@controller-0 cron]# cat  /var/lib/config-data/puppet-generated/nova/var/spool/cron/nova 
# HEADER: This file was autogenerated at 2019-11-19 11:16:27 +0000 by puppet.
# HEADER: While it can still be managed manually, it is definitely not recommended.
# HEADER: Note particularly that the comments starting with 'Puppet Name' should
# HEADER: not be deleted, as doing so could cause duplicate cron jobs.
# Puppet Name: nova-manage db archive_deleted_rows
PATH=/bin:/usr/bin:/usr/sbin SHELL=/bin/sh
1 0 * * * sleep `expr ${RANDOM} \% 3600`; nova-manage db archive_deleted_rows  --max_rows 100 --until-complete >>/var/log/nova/nova-rowsflush.log 2>&1
# Puppet Name: nova-manage db purge
PATH=/bin:/usr/bin:/usr/sbin SHELL=/bin/sh
0 5 * * * sleep `expr ${RANDOM} \% 3600`; nova-manage db purge --before `date --date='today - 14 days' +\%D`                       >>/var/log/nova/nova-rowspurge.log 2>&1

Comment 3 melanie witt 2019-12-04 02:13:56 UTC
I agree that we need to add the --before and --all-cells options to the 'nova-manage db archive_deleted_rows' and 'nova-manage db purge' commands in our cron jobs. We need --before to prevent the orphaning of libvirt guests if/when nova-compute is down when a db archive cron job fires and we need --all-cells to (1) ensure the cell0 database is archived in a single cell deployment and (2) ensure additional cell databases are archived in a multi cell deployment.

These are the following rhbz's (which are also cloned for OSP15, OSP14, and OSP13) that could be added as dependencies for this rhbz:

--before: https://bugzilla.redhat.com/show_bug.cgi?id=1749382
--all-cells: https://bugzilla.redhat.com/show_bug.cgi?id=1703091

Also, I had just opened this rhbz to add the --all-cells option to our cron jobs yesterday:

https://bugzilla.redhat.com/show_bug.cgi?id=1778905

and I think you could probably close it ^ as a duplicate of this rhbz and let Paras know as he should be the QE contact here, I think.

Comment 4 Martin Schuppert 2019-12-04 09:20:26 UTC
*** Bug 1778905 has been marked as a duplicate of this bug. ***

Comment 12 Alex McLeod 2020-02-19 12:39:27 UTC
If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to '-'.

Comment 15 errata-xmlrpc 2020-03-03 09:45:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0655


Note You need to log in before you can comment on or make changes to this bug.