Bug 1404324 - Token flush does not complete when expired tokens becomes large enough
Summary: Token flush does not complete when expired tokens becomes large enough
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: puppet-keystone
Version: 10.0 (Newton)
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: Upstream M3
: 12.0 (Pike)
Assignee: John Dennis
QA Contact: Prasanth Anbalagan
URL:
Whiteboard: PerfScale
: 1438294 1466091 (view as bug list)
Depends On:
Blocks: 1451122 1467120 1469457 1470221 1470226 1470227 1470230 1473713
TreeView+ depends on / blocked
 
Reported: 2016-12-13 15:25 UTC by Alex Krzos
Modified: 2022-11-24 05:47 UTC (History)
22 users (show)

Fixed In Version: puppet-keystone-11.1.0-0.20170519122658.5fb6cce.el7ost
Doc Type: Enhancement
Doc Text:
The token flush cron job has been modified to run hourly instead of once a day. This was changed because of issues being raised in larger deployments, as the operation would take too long and sometimes even fail because the transaction was too large. Note that this only affects deployments using the UUID token provider.
Clone Of:
: 1451122 1469457 1470221 1470226 1470227 1470230 1473713 (view as bug list)
Environment:
Last Closed: 2017-12-13 20:54:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Token flush job Disk IO % util (63.73 KB, image/png)
2016-12-13 15:25 UTC, Alex Krzos
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1649616 0 None None None 2016-12-20 02:47:55 UTC
OpenStack gerrit 454351 0 None MERGED Make flushing tokens more robust 2021-01-28 20:32:10 UTC
OpenStack gerrit 456182 0 None MERGED Run token flush cron job hourly by default 2021-01-28 20:32:10 UTC
OpenStack gerrit 457553 0 None MERGED Change keystone token flush to run hourly 2021-01-28 20:32:11 UTC
OpenStack gerrit 480287 0 None MERGED Fixing flushing tokens workflow 2021-01-28 20:32:11 UTC
Red Hat Issue Tracker OSP-963 0 None None None 2022-11-24 05:47:37 UTC
Red Hat Knowledge Base (Solution) 3122971 0 None None None 2017-10-15 07:07:01 UTC
Red Hat Knowledge Base (Solution) 3167441 0 None None None 2017-09-26 08:44:27 UTC
Red Hat Product Errata RHEA-2017:3462 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 12.0 Enhancement Advisory 2018-02-16 01:43:25 UTC

Description Alex Krzos 2016-12-13 15:25:18 UTC
Created attachment 1231251 [details]
Token flush job Disk IO % util

Description of problem:
The Keystone token flush job can get into a state where it will never complete because the transaction size exceeds the mysql galara transaction size - wsrep_max_ws_size (1073741824).


Version-Release number of selected component (if applicable):
Discovered in RHOP10, Newton but perhaps affects older releases too
RHOP, Build: 2016-11-04.2


How reproducible:
You need to generate a large number of UUID tokens to reproduce, once in the state it never gets out


Steps to Reproduce:
1. Authenticate many times (>1 million)
2. Observe that keystone token flush job runs (should be a very long time depending on disk) >20 hours in my environment
3. Observe errors in mysql.log indicating a transaction that is too large


Actual results:
Expired tokens are not actually flushed from the database without any errors in keystone.log.  Only errors appear in mysql.log.


Expected results:
Expired tokens to be removed from the database


Additional info:
It is likely that you can demonstrate this with less than 1 million tokens as the >1 million token table is larger than 13GiB and the max transaction size is 1GiB, my token bench-marking Browbeat job creates more than needed.  

Once the token flush job can not complete the token table will never decrease in size and eventually the cloud will run out of disk space.  

Furthermore the flush job will consume disk utilization resources.  This was demonstrated on slow disks (Single 7.2K SATA disk).  On faster disks you will have more capacity to generate tokens, you can then generate the number of tokens to exceed the transaction size even faster.

Log evidence:
[root@overcloud-controller-0 log]# grep " Total expired" /var/log/keystone/keystone.log
2016-12-08 01:33:40.530 21614 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1082434
2016-12-09 09:31:25.301 14120 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1084241
2016-12-11 01:35:39.082 4223 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1086504
2016-12-12 01:08:16.170 32575 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1087823
2016-12-13 01:22:18.121 28669 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1089202
[root@overcloud-controller-0 log]# tail mysqld.log 
161208  1:33:41 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161208  1:33:41 [ERROR] WSREP: rbr write fail, data_len: 0, 2
161209  9:31:26 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161209  9:31:26 [ERROR] WSREP: rbr write fail, data_len: 0, 2
161211  1:35:39 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161211  1:35:40 [ERROR] WSREP: rbr write fail, data_len: 0, 2
161212  1:08:16 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161212  1:08:17 [ERROR] WSREP: rbr write fail, data_len: 0, 2
161213  1:22:18 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161213  1:22:19 [ERROR] WSREP: rbr write fail, data_len: 0, 2


Disk utilization issue graph is attached.  The entire job in that graph takes from the first spike is disk util(~5:18UTC) and culminates in about ~90 minutes of pegging the disk (between 1:09utc to 2:43utc).

Comment 1 Alex Krzos 2016-12-13 15:38:37 UTC
Upstream bug https://bugs.launchpad.net/keystone/+bug/1649616

Comment 3 Vincent S. Cojot 2017-04-06 19:05:46 UTC
Hi,

On monday, Apr3rd there was a problem with mysql on OSP10 and I realized that we had gone beyond 3 million tokens in keystone.token.
This came to my attention as the openstack commands started to fail with a timeout.

After repairing the mysql/galera DB, things went fine for a few days until today (Apr 6th) when I realized that we had 3.6million tokens in keystone.token.

So even though keystone-manage token_flush was invoked every day (confirmed by looking at cron jobs), it appears that this had been ineffective.

I had to issue a 'truncate token' manually to recover (it tooks 10 minutes).

This needs to be fixed.

Also, the line from /var/spool/cron/keystone doesn't leave a logfile:

# grep flush /var/spool/cron/keystone
1 0 * * * sleep `expr ${RANDOM} \% 3600`; keystone-manage token_flush >>/dev/null 2>&1

Perhaps this should be re-written as:
1 0 * * * sleep `expr ${RANDOM} \% 3600`; keystone-manage token_flush >>/var/log/keystone/token_flush.log 2>&1

Comment 4 David Hill 2017-04-10 13:56:38 UTC
This should be ran hourly IMHO.

Comment 7 John Dennis 2017-04-13 15:55:36 UTC
*** Bug 1438294 has been marked as a duplicate of this bug. ***

Comment 20 Vincent S. Cojot 2017-05-08 14:04:23 UTC
n a freshly installed OSP10 cloud, I started generating tokens at a rate of about 3-4 tokens/second this past friday (05/05 around 4pm EST).

Here's what I had at the time:

MariaDB [(none)]> select count(*) from keystone.token where expires < date_sub(now(), interval 1 hour) ;
+----------+
| count(*) |
+----------+
|     5435 |
+----------+
1 row in set (0.01 sec)


This monday (2017/05/08) here's what I have:

MariaDB [(none)]> select count(*) from keystone.token;
+----------+
| count(*) |
+----------+
|   494164 |
+----------+
1 row in set (0.06 sec)

MariaDB [(none)]> select count(*) from keystone.token where expires < date_sub(now(), interval 1 day) ;
+----------+
| count(*) |
+----------+
|   487672 |
+----------+
1 row in set (0.12 sec)

So it appears that token_flush failed to do what it was supposed to do over the week-end.
Still, there are signs on -all- 3 controllers that it ran:

stack@pbnec1-l-rh-ucld-0 ~]$ ssh ocld0 sudo grep token_flush /var/log/cron
May  8 00:01:01 pbnec1-l-rh-ocld-0 CROND[454010]: (keystone) CMD (sleep `expr ${RANDOM} % 3600`; keystone-manage token_flush >>/dev/null 2>&1)
[stack@pbnec1-l-rh-ucld-0 ~]$ ssh ocld1 sudo grep token_flush /var/log/cron
May  8 00:01:01 pbnec1-l-rh-ocld-1 CROND[820407]: (keystone) CMD (sleep `expr ${RANDOM} % 3600`; keystone-manage token_flush >>/dev/null 2>&1)
[stack@pbnec1-l-rh-ucld-0 ~]$ ssh ocld2 sudo grep token_flush /var/log/cron
May  8 00:01:01 pbnec1-l-rh-ocld-2 CROND[700900]: (keystone) CMD (sleep `expr ${RANDOM} % 3600`; keystone-manage token_flush >>/dev/null 2>&1)

Comment 21 Vincent S. Cojot 2017-05-08 14:04:54 UTC
Here's what I see in the logs on all 3 controllers:

Happened on all three controllers:

[root@pbnec1-l-rh-ocld-0 keystone]# zgrep DBDeadlock /var/log/keystone/keystone.log*gz
/var/log/keystone/keystone.log-20170507.gz:2017-05-07 00:51:44.952 313749 CRITICAL keystone [-] DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]
/var/log/keystone/keystone.log-20170507.gz:2017-05-07 00:51:44.952 313749 ERROR keystone DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]

[root@pbnec1-l-rh-ocld-1 ~]# zgrep DBDeadlock /var/log/keystone/keystone.log*gz
/var/log/keystone/keystone.log-20170507.gz:2017-05-07 00:13:26.671 695266 CRITICAL keystone [-] DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]
/var/log/keystone/keystone.log-20170507.gz:2017-05-07 00:13:26.671 695266 ERROR keystone DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]
/var/log/keystone/keystone.log-20170508.gz:2017-05-08 00:44:32.074 1011279 CRITICAL keystone [-] DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]
/var/log/keystone/keystone.log-20170508.gz:2017-05-08 00:44:32.074 1011279 ERROR keystone DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]

[root@pbnec1-l-rh-ocld-2 keystone]#  zgrep DBDeadlock /var/log/keystone/keystone.log*gz
/var/log/keystone/keystone.log-20170508.gz:2017-05-08 00:59:06.181 942308 CRITICAL keystone [-] DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]
/var/log/keystone/keystone.log-20170508.gz:2017-05-08 00:59:06.181 942308 ERROR keystone DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]

Comment 26 Harry Rybacki 2017-07-11 16:57:50 UTC
*** Bug 1466091 has been marked as a duplicate of this bug. ***

Comment 30 errata-xmlrpc 2017-12-13 20:54:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462


Note You need to log in before you can comment on or make changes to this bug.