Bug 1873238

Summary:	panko-expirer jobs with nested SQL SELECT statements never complete
Product:	Red Hat OpenStack	Reporter:	Freddy Wissing <fwissing>
Component:	puppet-panko	Assignee:	Pradeep Kilambi <pkilambi>
Status:	CLOSED ERRATA	QA Contact:	Leonid Natapov <lnatapov>
Severity:	medium	Docs Contact:
Priority:	high
Version:	13.0 (Queens)	CC:	ahyder, csibbitt, lmadsen, mmagr, pkilambi, slinaber, spower, ssmolyak
Target Milestone:	z15	Keywords:	Triaged, ZStream
Target Release:	13.0 (Queens)	Flags:	lnatapov: needinfo-
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	puppet-panko-12.4.0-4.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-03-18 13:08:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Freddy Wissing 2020-08-27 17:01:12 UTC

Description of problem:

Galera memory size continually increases, and DB deadlocks occur in large environments due to SQL statements that run as jobs (panko-expirer) and never finish.


263 DELETE FROM trait_datetime WHERE trait_datetime.event_id IN (SELECT event.id \nFROM event \nWHERE event.`generated` < 1598227262.079434)
103 DELETE FROM event WHERE event.`generated` < 1598227262.079434
50 DELETE FROM trait_text WHERE trait_text.event_id IN (SELECT event.id \nFROM event \nWHERE event.`generated` < 1598227261.902322)
50 DELETE FROM trait_text WHERE trait_text.event_id IN (SELECT event.id \nFROM event \nWHERE event.`generated` < 1598227262.517131)
633 DELETE FROM trait_int WHERE trait_int.event_id IN (SELECT event.id \nFROM event \nWHERE event.`generated` < 1598227262.079434)
1304 DELETE FROM trait_text WHERE trait_text.event_id IN (SELECT event.id \nFROM event \nWHERE event.`generated` < 1598227262.079434)

Version-Release number of selected component (if applicable):

OSP 13 z10

How reproducible:

Should be reproducible, but this issue is similar to ones already resolved for other OSP components.

Steps to Reproduce:

Requires a large environment running for a very long time to build up giant record sets

Actual results:

The nested SQL statements never fully execute. Deadlocks overcome the database as it attempts to lock a huge number of records. Records are never deleted, and the memory footprint of the Galera database rises each time the panko-expirer job runs and fails. 


Expected results:

The SQL statement deleting the records succeeds, even with a large record set returned.

Additional info:

The use of the nested select statement should be examined when considered against a large record set.

Comment 3 Martin Magr 2020-09-09 14:49:10 UTC

As a workaround you can delete outstanding events if that data is not required any more:

1. Stop panko services on all controllers:
[root@controller-0 ~]# docker stop panko_api panko_api_cron       
[root@controller-1 ~]# docker stop panko_api panko_api_cron       
[root@controller-2 ~]# docker stop panko_api panko_api_cron                                                                                                                                                                            
panko_api                                                                                                                                                                                                                              
panko_api_cron                                                                                                                                                                                                                         
[root@controller-2 ~]#
Broadcast message from systemd-journald@controller-2 (Wed 2020-09-09 14:40:24 UTC):

haproxy[64609]: proxy panko has no server available!

2. Perform manual cleanup:

[root@controller-0 ~]# docker exec -it galera-bundle-docker-0 bash
()[root@controller-0 /]# mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 12943
Server version: 10.1.20-MariaDB MariaDB Server

Copyright (c) 2000, 2016, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> use panko
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [panko]> show tables;
+-----------------+
| Tables_in_panko |
+-----------------+
| event           |
| event_type      |
| trait_datetime  |
| trait_float     |
| trait_int       |
| trait_text      |
+-----------------+
6 rows in set (0.00 sec)

MariaDB [panko]> delete from trait_datetime; delete from trait_float; delete from trait_int; delete from trait_text; delete from event;
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 233 rows affected (0.02 sec)

Query OK, 101 rows affected (0.01 sec)

MariaDB [panko]>

3. Restart panko services:
[root@controller-0 ~]# docker start panko_api panko_api_cron
[root@controller-1 ~]# docker start panko_api panko_api_cron
[root@controller-2 ~]# docker start panko_api panko_api_cron
[root@controller-2 ~]# docker ps | grep panko
fab926ce3a8b        192.168.24.1:8787/rh-osbs/rhosp13-openstack-panko-api:20200730.1                   "dumb-init --singl..."   About an hour ago   Up About a minute (healthy)                       panko_api
4cddc2d33980        192.168.24.1:8787/rh-osbs/rhosp13-openstack-panko-api:20200730.1                   "dumb-init --singl..."   About an hour ago   Up About a minute                                 panko_api_cron

Comment 6 Freddy Wissing 2020-09-10 13:37:47 UTC

Seems the panko expirer appeared in z5, so they had a huge dataset by then.

Comment 7 Martin Magr 2020-09-15 12:19:46 UTC

I managed to reproduce the issue when panko-expirer fails to clean DB which is bigger than HW resources can potentially handle. I believe that patch [1] should be backported to OSP13. We should probably also add TripleO configuration possibility to OSP versions where Panko is still supported to enable reasonable setting of the deletion limit.

[1] https://github.com/openstack/panko/commit/ec84ed8aa7c75d83b50ddac3edb4025462a61084

Comment 8 Martin Magr 2020-09-15 13:21:26 UTC

In case you have issues with cleaning panko database, you can try instead:

1. Stop panko services on all controllers:
[root@controller-0 ~]# docker stop panko_api panko_api_cron       
[root@controller-1 ~]# docker stop panko_api panko_api_cron       
[root@controller-2 ~]# docker stop panko_api panko_api_cron                                                                                                                                                                            
panko_api                                                                                                                                                                                                                              
panko_api_cron                                                                                                                                                                                                                         
[root@controller-2 ~]#
Broadcast message from systemd-journald@controller-2 (Wed 2020-09-09 14:40:24 UTC):

haproxy[64609]: proxy panko has no server available!

2. Drop whole database:
[root@controller-0 ~]# docker exec -it galera-bundle-docker-0 bash
()[root@controller-0 /]# mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 12943
Server version: 10.1.20-MariaDB MariaDB Server

Copyright (c) 2000, 2016, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> drop database panko;                                                                                                 
Query OK, 6 rows affected (0.02 sec)

MariaDB [(none)]> create database panko;
Query OK, 1 row affected (0.01 sec)

MariaDB [(none)]>

3. initialize db:
[root@controller-0 ~]# docker start panko_db_sync
panko_db_sync
[root@controller-0 ~]#

4. start panko containers
[root@controller-0 ~]# docker start panko_api panko_api_cron
[root@controller-1 ~]# docker start panko_api panko_api_cron
[root@controller-2 ~]# docker start panko_api panko_api_cron

Comment 9 Martin Magr 2020-09-15 13:33:46 UTC

Backported the patch all the way down to stable/queens on upstream side: https://review.opendev.org/#/q/Icf83ffe089301b3782273923f18efd4d209131c2

Comment 14 spower 2020-11-10 09:45:58 UTC

This was not approved as a blocker for 13z14. It is not a regression

Comment 19 Leonid Natapov 2020-12-08 09:10:27 UTC

FailedQA. puppet-panko is missing

Comment 20 Martin Magr 2020-12-08 10:45:46 UTC

Adding missing puppet-panko build.

Comment 28 Leonid Natapov 2021-03-07 16:59:51 UTC

puppet-panko included.

Comment 32 errata-xmlrpc 2021-03-18 13:08:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 13.0 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0932