Bug 1873238 - panko-expirer jobs with nested SQL SELECT statements never complete
Summary: panko-expirer jobs with nested SQL SELECT statements never complete
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-panko
Version: 13.0 (Queens)
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: z15
: 13.0 (Queens)
Assignee: Pradeep Kilambi
QA Contact: Leonid Natapov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-27 17:01 UTC by Freddy Wissing
Modified: 2023-12-15 19:03 UTC (History)
8 users (show)

Fixed In Version: puppet-panko-12.4.0-4.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-18 13:08:47 UTC
Target Upstream Version:
Embargoed:
lnatapov: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 753154 0 None MERGED Adds new configuration for panko-expirer 2021-02-09 16:40:47 UTC
OpenStack gerrit 753550 0 None MERGED Allow configuration of limit of batch size for exirer 2021-02-09 16:40:47 UTC
Red Hat Issue Tracker OSP-30864 0 None None None 2023-12-15 19:03:56 UTC
Red Hat Product Errata RHBA-2021:0932 0 None None None 2021-03-18 13:10:06 UTC

Description Freddy Wissing 2020-08-27 17:01:12 UTC
Description of problem:

Galera memory size continually increases, and DB deadlocks occur in large environments due to SQL statements that run as jobs (panko-expirer) and never finish.


263 DELETE FROM trait_datetime WHERE trait_datetime.event_id IN (SELECT event.id \nFROM event \nWHERE event.`generated` < 1598227262.079434)
103 DELETE FROM event WHERE event.`generated` < 1598227262.079434
50 DELETE FROM trait_text WHERE trait_text.event_id IN (SELECT event.id \nFROM event \nWHERE event.`generated` < 1598227261.902322)
50 DELETE FROM trait_text WHERE trait_text.event_id IN (SELECT event.id \nFROM event \nWHERE event.`generated` < 1598227262.517131)
633 DELETE FROM trait_int WHERE trait_int.event_id IN (SELECT event.id \nFROM event \nWHERE event.`generated` < 1598227262.079434)
1304 DELETE FROM trait_text WHERE trait_text.event_id IN (SELECT event.id \nFROM event \nWHERE event.`generated` < 1598227262.079434)

Version-Release number of selected component (if applicable):

OSP 13 z10

How reproducible:

Should be reproducible, but this issue is similar to ones already resolved for other OSP components.

Steps to Reproduce:

Requires a large environment running for a very long time to build up giant record sets

Actual results:

The nested SQL statements never fully execute. Deadlocks overcome the database as it attempts to lock a huge number of records. Records are never deleted, and the memory footprint of the Galera database rises each time the panko-expirer job runs and fails. 


Expected results:

The SQL statement deleting the records succeeds, even with a large record set returned.

Additional info:

The use of the nested select statement should be examined when considered against a large record set.

Comment 3 Martin Magr 2020-09-09 14:49:10 UTC
As a workaround you can delete outstanding events if that data is not required any more:

1. Stop panko services on all controllers:
[root@controller-0 ~]# docker stop panko_api panko_api_cron       
[root@controller-1 ~]# docker stop panko_api panko_api_cron       
[root@controller-2 ~]# docker stop panko_api panko_api_cron                                                                                                                                                                            
panko_api                                                                                                                                                                                                                              
panko_api_cron                                                                                                                                                                                                                         
[root@controller-2 ~]#
Broadcast message from systemd-journald@controller-2 (Wed 2020-09-09 14:40:24 UTC):

haproxy[64609]: proxy panko has no server available!

2. Perform manual cleanup:

[root@controller-0 ~]# docker exec -it galera-bundle-docker-0 bash
()[root@controller-0 /]# mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 12943
Server version: 10.1.20-MariaDB MariaDB Server

Copyright (c) 2000, 2016, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> use panko
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [panko]> show tables;
+-----------------+
| Tables_in_panko |
+-----------------+
| event           |
| event_type      |
| trait_datetime  |
| trait_float     |
| trait_int       |
| trait_text      |
+-----------------+
6 rows in set (0.00 sec)

MariaDB [panko]> delete from trait_datetime; delete from trait_float; delete from trait_int; delete from trait_text; delete from event;
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 233 rows affected (0.02 sec)

Query OK, 101 rows affected (0.01 sec)

MariaDB [panko]>

3. Restart panko services:
[root@controller-0 ~]# docker start panko_api panko_api_cron
[root@controller-1 ~]# docker start panko_api panko_api_cron
[root@controller-2 ~]# docker start panko_api panko_api_cron
[root@controller-2 ~]# docker ps | grep panko
fab926ce3a8b        192.168.24.1:8787/rh-osbs/rhosp13-openstack-panko-api:20200730.1                   "dumb-init --singl..."   About an hour ago   Up About a minute (healthy)                       panko_api
4cddc2d33980        192.168.24.1:8787/rh-osbs/rhosp13-openstack-panko-api:20200730.1                   "dumb-init --singl..."   About an hour ago   Up About a minute                                 panko_api_cron

Comment 6 Freddy Wissing 2020-09-10 13:37:47 UTC
Seems the panko expirer appeared in z5, so they had a huge dataset by then.

Comment 7 Martin Magr 2020-09-15 12:19:46 UTC
I managed to reproduce the issue when panko-expirer fails to clean DB which is bigger than HW resources can potentially handle. I believe that patch [1] should be backported to OSP13. We should probably also add TripleO configuration possibility to OSP versions where Panko is still supported to enable reasonable setting of the deletion limit.

[1] https://github.com/openstack/panko/commit/ec84ed8aa7c75d83b50ddac3edb4025462a61084

Comment 8 Martin Magr 2020-09-15 13:21:26 UTC
In case you have issues with cleaning panko database, you can try instead:

1. Stop panko services on all controllers:
[root@controller-0 ~]# docker stop panko_api panko_api_cron       
[root@controller-1 ~]# docker stop panko_api panko_api_cron       
[root@controller-2 ~]# docker stop panko_api panko_api_cron                                                                                                                                                                            
panko_api                                                                                                                                                                                                                              
panko_api_cron                                                                                                                                                                                                                         
[root@controller-2 ~]#
Broadcast message from systemd-journald@controller-2 (Wed 2020-09-09 14:40:24 UTC):

haproxy[64609]: proxy panko has no server available!

2. Drop whole database:
[root@controller-0 ~]# docker exec -it galera-bundle-docker-0 bash
()[root@controller-0 /]# mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 12943
Server version: 10.1.20-MariaDB MariaDB Server

Copyright (c) 2000, 2016, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> drop database panko;                                                                                                 
Query OK, 6 rows affected (0.02 sec)

MariaDB [(none)]> create database panko;
Query OK, 1 row affected (0.01 sec)

MariaDB [(none)]>

3. initialize db:
[root@controller-0 ~]# docker start panko_db_sync
panko_db_sync
[root@controller-0 ~]#

4. start panko containers
[root@controller-0 ~]# docker start panko_api panko_api_cron
[root@controller-1 ~]# docker start panko_api panko_api_cron
[root@controller-2 ~]# docker start panko_api panko_api_cron

Comment 9 Martin Magr 2020-09-15 13:33:46 UTC
Backported the patch all the way down to stable/queens on upstream side: https://review.opendev.org/#/q/Icf83ffe089301b3782273923f18efd4d209131c2

Comment 14 spower 2020-11-10 09:45:58 UTC
This was not approved as a blocker for 13z14. It is not a regression

Comment 19 Leonid Natapov 2020-12-08 09:10:27 UTC
FailedQA. puppet-panko is missing

Comment 20 Martin Magr 2020-12-08 10:45:46 UTC
Adding missing puppet-panko build.

Comment 28 Leonid Natapov 2021-03-07 16:59:51 UTC
puppet-panko included.

Comment 32 errata-xmlrpc 2021-03-18 13:08:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 13.0 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0932


Note You need to log in before you can comment on or make changes to this bug.