Description of problem: Galera memory size continually increases, and DB deadlocks occur in large environments due to SQL statements that run as jobs (panko-expirer) and never finish. 263 DELETE FROM trait_datetime WHERE trait_datetime.event_id IN (SELECT event.id \nFROM event \nWHERE event.`generated` < 1598227262.079434) 103 DELETE FROM event WHERE event.`generated` < 1598227262.079434 50 DELETE FROM trait_text WHERE trait_text.event_id IN (SELECT event.id \nFROM event \nWHERE event.`generated` < 1598227261.902322) 50 DELETE FROM trait_text WHERE trait_text.event_id IN (SELECT event.id \nFROM event \nWHERE event.`generated` < 1598227262.517131) 633 DELETE FROM trait_int WHERE trait_int.event_id IN (SELECT event.id \nFROM event \nWHERE event.`generated` < 1598227262.079434) 1304 DELETE FROM trait_text WHERE trait_text.event_id IN (SELECT event.id \nFROM event \nWHERE event.`generated` < 1598227262.079434) Version-Release number of selected component (if applicable): OSP 13 z10 How reproducible: Should be reproducible, but this issue is similar to ones already resolved for other OSP components. Steps to Reproduce: Requires a large environment running for a very long time to build up giant record sets Actual results: The nested SQL statements never fully execute. Deadlocks overcome the database as it attempts to lock a huge number of records. Records are never deleted, and the memory footprint of the Galera database rises each time the panko-expirer job runs and fails. Expected results: The SQL statement deleting the records succeeds, even with a large record set returned. Additional info: The use of the nested select statement should be examined when considered against a large record set.
As a workaround you can delete outstanding events if that data is not required any more: 1. Stop panko services on all controllers: [root@controller-0 ~]# docker stop panko_api panko_api_cron [root@controller-1 ~]# docker stop panko_api panko_api_cron [root@controller-2 ~]# docker stop panko_api panko_api_cron panko_api panko_api_cron [root@controller-2 ~]# Broadcast message from systemd-journald@controller-2 (Wed 2020-09-09 14:40:24 UTC): haproxy[64609]: proxy panko has no server available! 2. Perform manual cleanup: [root@controller-0 ~]# docker exec -it galera-bundle-docker-0 bash ()[root@controller-0 /]# mysql Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 12943 Server version: 10.1.20-MariaDB MariaDB Server Copyright (c) 2000, 2016, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> use panko Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed MariaDB [panko]> show tables; +-----------------+ | Tables_in_panko | +-----------------+ | event | | event_type | | trait_datetime | | trait_float | | trait_int | | trait_text | +-----------------+ 6 rows in set (0.00 sec) MariaDB [panko]> delete from trait_datetime; delete from trait_float; delete from trait_int; delete from trait_text; delete from event; Query OK, 0 rows affected (0.00 sec) Query OK, 0 rows affected (0.00 sec) Query OK, 0 rows affected (0.00 sec) Query OK, 233 rows affected (0.02 sec) Query OK, 101 rows affected (0.01 sec) MariaDB [panko]> 3. Restart panko services: [root@controller-0 ~]# docker start panko_api panko_api_cron [root@controller-1 ~]# docker start panko_api panko_api_cron [root@controller-2 ~]# docker start panko_api panko_api_cron [root@controller-2 ~]# docker ps | grep panko fab926ce3a8b 192.168.24.1:8787/rh-osbs/rhosp13-openstack-panko-api:20200730.1 "dumb-init --singl..." About an hour ago Up About a minute (healthy) panko_api 4cddc2d33980 192.168.24.1:8787/rh-osbs/rhosp13-openstack-panko-api:20200730.1 "dumb-init --singl..." About an hour ago Up About a minute panko_api_cron
Seems the panko expirer appeared in z5, so they had a huge dataset by then.
I managed to reproduce the issue when panko-expirer fails to clean DB which is bigger than HW resources can potentially handle. I believe that patch [1] should be backported to OSP13. We should probably also add TripleO configuration possibility to OSP versions where Panko is still supported to enable reasonable setting of the deletion limit. [1] https://github.com/openstack/panko/commit/ec84ed8aa7c75d83b50ddac3edb4025462a61084
In case you have issues with cleaning panko database, you can try instead: 1. Stop panko services on all controllers: [root@controller-0 ~]# docker stop panko_api panko_api_cron [root@controller-1 ~]# docker stop panko_api panko_api_cron [root@controller-2 ~]# docker stop panko_api panko_api_cron panko_api panko_api_cron [root@controller-2 ~]# Broadcast message from systemd-journald@controller-2 (Wed 2020-09-09 14:40:24 UTC): haproxy[64609]: proxy panko has no server available! 2. Drop whole database: [root@controller-0 ~]# docker exec -it galera-bundle-docker-0 bash ()[root@controller-0 /]# mysql Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 12943 Server version: 10.1.20-MariaDB MariaDB Server Copyright (c) 2000, 2016, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> drop database panko; Query OK, 6 rows affected (0.02 sec) MariaDB [(none)]> create database panko; Query OK, 1 row affected (0.01 sec) MariaDB [(none)]> 3. initialize db: [root@controller-0 ~]# docker start panko_db_sync panko_db_sync [root@controller-0 ~]# 4. start panko containers [root@controller-0 ~]# docker start panko_api panko_api_cron [root@controller-1 ~]# docker start panko_api panko_api_cron [root@controller-2 ~]# docker start panko_api panko_api_cron
Backported the patch all the way down to stable/queens on upstream side: https://review.opendev.org/#/q/Icf83ffe089301b3782273923f18efd4d209131c2
This was not approved as a blocker for 13z14. It is not a regression
FailedQA. puppet-panko is missing
Adding missing puppet-panko build.
puppet-panko included.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 13.0 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0932