Bug 1778814

Summary: [OSP16] deleted instances in multi cellv2 environment won't get archived/purged
Product: Red Hat OpenStack Reporter: Martin Schuppert <mschuppe>
Component: openstack-tripleo-heat-templatesAssignee: Martin Schuppert <mschuppe>
Status: CLOSED ERRATA QA Contact: Paras Babbar <pbabbar>
Severity: high Docs Contact:
Priority: high    
Version: 16.0 (Train)CC: mburns, mwitt, slinaber
Target Milestone: z1Keywords: Patch, Triaged
Target Release: 16.0 (Train on RHEL 8.1)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: puppet-nova-15.4.1-0.20191210120219.7d002fd.el8ost openstack-tripleo-heat-templates-11.3.1-0.20191214200147.d19b42d.el8ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 1784091 (view as bug list) Environment:
Last Closed: 2020-03-03 09:45:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1784091, 1784092    

Description Martin Schuppert 2019-12-02 14:49:35 UTC
Description of problem:

When checking on cellcontroller of e.g. this cell1 the celeted cell instances won’t get archived:
MariaDB [nova]> select count(*) from instances;
+----------+
| count(*) |
+----------+
|    86255 |
+----------+
1 row in set (0.021 sec)

Running instances:
MariaDB [nova]> select count(*) from instances where deleted_at is null;
+----------+
| count(*) |
+----------+
|        3 |
+----------+
1 row in set (0.023 sec)

Total:
MariaDB [nova]> select count(*) from instances where deleted_at is not null;
+----------+
| count(*) |
+----------+
|    86266 |
+----------+
1 row in set (0.024 sec)

Difference because rally job is still running.

Version-Release number of selected component (if applicable):
OSP16

How reproducible:
always

Steps to Reproduce:
1. create instances in an additional cell
2. delete instances from the cell
3. check if instances get archived

Actual results:


Expected results:


Additional info:

Comment 1 Martin Schuppert 2019-12-02 15:02:24 UTC
we should change the current archive_deleted_rows cron command to something like:

()[root@controller-0 /]$ nova-manage db archive_deleted_rows --before `date --date='today - 2 days' +\%F` --until-complete --all-cells --verbose
Archiving.....................................................................................................................................................................................................................complete
+--------------------------------+-------------------------+
| Table                          | Number of Rows Archived |
+--------------------------------+-------------------------+
| API_DB.instance_group_member   | 0                       |                                                                                                                                                                                                                             
| API_DB.instance_mappings       | 11270                   |                                                                                                                                                                                                                             
| API_DB.request_specs           | 11270                   |
| cell1.block_device_mapping     | 7792                    |
| cell1.instance_actions         | 7792                    |
| cell1.instance_extra           | 3896                    |
| cell1.instance_id_mappings     | 3896                    |
| cell1.instance_info_caches     | 3896                    |
| cell1.instance_system_metadata | 35064                   |
| cell1.instances                | 3896                    |
| cell2.block_device_mapping     | 1684                    |
| cell2.instance_actions         | 1684                    |
| cell2.instance_actions_events  | 1684                    |
| cell2.instance_extra           | 842                     |
| cell2.instance_id_mappings     | 842                     |
| cell2.instance_info_caches     | 842                     |
| cell2.instance_system_metadata | 7578                    |
| cell2.instances                | 842                     |
| cell4.block_device_mapping     | 5734                    |
| cell4.instance_actions         | 5734                    |
| cell4.instance_actions_events  | 5734                    |
| cell4.instance_extra           | 2867                    |
| cell4.instance_id_mappings     | 2867                    |
| cell4.instance_info_caches     | 2867                    |
| cell4.instance_system_metadata | 25803                   |
| cell4.instances                | 2867                    |
| cell5.block_device_mapping     | 3194                    |
| cell5.instance_actions         | 3194                    |
| cell5.instance_actions_events  | 3194                    |
| cell5.instance_extra           | 1597                    |
| cell5.instance_id_mappings     | 1597                    |
| cell5.instance_info_caches     | 1597                    |
| cell5.instance_system_metadata | 14373                   |
| cell5.instances                | 1597                    |
| cell6.block_device_mapping     | 3594                    |
| cell6.instance_actions         | 3594                    |
| cell6.instance_actions_events  | 3594                    |
| cell6.instance_extra           | 1797                    |
| cell6.instance_id_mappings     | 1797                    |
| cell6.instance_info_caches     | 1797                    |
| cell6.instance_system_metadata | 16173                   |
| cell6.instances                | 1797                    |
| cell7.block_device_mapping     | 542                     |
| cell7.instance_actions         | 542                     |
| cell7.instance_actions_events  | 542                     |
| cell7.instance_extra           | 271                     |
| cell7.instance_id_mappings     | 271                     |
| cell7.instance_info_caches     | 271                     |
| cell7.instance_system_metadata | 2439                    |
| cell7.instances                | 271                     |
+--------------------------------+-------------------------+

Comment 2 Martin Schuppert 2019-12-02 15:11:58 UTC
current cron jobs:

[root@controller-0 cron]# cat  /var/lib/config-data/puppet-generated/nova/var/spool/cron/nova 
# HEADER: This file was autogenerated at 2019-11-19 11:16:27 +0000 by puppet.
# HEADER: While it can still be managed manually, it is definitely not recommended.
# HEADER: Note particularly that the comments starting with 'Puppet Name' should
# HEADER: not be deleted, as doing so could cause duplicate cron jobs.
# Puppet Name: nova-manage db archive_deleted_rows
PATH=/bin:/usr/bin:/usr/sbin SHELL=/bin/sh
1 0 * * * sleep `expr ${RANDOM} \% 3600`; nova-manage db archive_deleted_rows  --max_rows 100 --until-complete >>/var/log/nova/nova-rowsflush.log 2>&1
# Puppet Name: nova-manage db purge
PATH=/bin:/usr/bin:/usr/sbin SHELL=/bin/sh
0 5 * * * sleep `expr ${RANDOM} \% 3600`; nova-manage db purge --before `date --date='today - 14 days' +\%D`                       >>/var/log/nova/nova-rowspurge.log 2>&1

Comment 3 melanie witt 2019-12-04 02:13:56 UTC
I agree that we need to add the --before and --all-cells options to the 'nova-manage db archive_deleted_rows' and 'nova-manage db purge' commands in our cron jobs. We need --before to prevent the orphaning of libvirt guests if/when nova-compute is down when a db archive cron job fires and we need --all-cells to (1) ensure the cell0 database is archived in a single cell deployment and (2) ensure additional cell databases are archived in a multi cell deployment.

These are the following rhbz's (which are also cloned for OSP15, OSP14, and OSP13) that could be added as dependencies for this rhbz:

--before: https://bugzilla.redhat.com/show_bug.cgi?id=1749382
--all-cells: https://bugzilla.redhat.com/show_bug.cgi?id=1703091

Also, I had just opened this rhbz to add the --all-cells option to our cron jobs yesterday:

https://bugzilla.redhat.com/show_bug.cgi?id=1778905

and I think you could probably close it ^ as a duplicate of this rhbz and let Paras know as he should be the QE contact here, I think.

Comment 4 Martin Schuppert 2019-12-04 09:20:26 UTC
*** Bug 1778905 has been marked as a duplicate of this bug. ***

Comment 12 Alex McLeod 2020-02-19 12:39:27 UTC
If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to '-'.

Comment 15 errata-xmlrpc 2020-03-03 09:45:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0655