1404324 – Token flush does not complete when expired tokens becomes large enough

Bug 1404324 - Token flush does not complete when expired tokens becomes large enough

Summary: Token flush does not complete when expired tokens becomes large enough

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-keystone
Sub Component:
Version:	10.0 (Newton)
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	Upstream M3
Target Release:	12.0 (Pike)
Assignee:	John Dennis
QA Contact:	Prasanth Anbalagan
Docs Contact:
URL:
Whiteboard:	PerfScale
Duplicates (2):	1438294 1466091 (view as bug list)
Depends On:
Blocks:	1451122 1467120 1469457 1470221 1470226 1470227 1470230 1473713
TreeView+	depends on / blocked

Reported:	2016-12-13 15:25 UTC by Alex Krzos
Modified:	2022-11-24 05:47 UTC (History)
CC List:	22 users (show)
Fixed In Version:	puppet-keystone-11.1.0-0.20170519122658.5fb6cce.el7ost
Doc Type:	Enhancement
Doc Text:	The token flush cron job has been modified to run hourly instead of once a day. This was changed because of issues being raised in larger deployments, as the operation would take too long and sometimes even fail because the transaction was too large. Note that this only affects deployments using the UUID token provider.
Clone Of:
Clones:	1451122 1469457 1470221 1470226 1470227 1470230 1473713 (view as bug list)
Environment:
Last Closed:	2017-12-13 20:54:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Token flush job Disk IO % util (63.73 KB, image/png) 2016-12-13 15:25 UTC, Alex Krzos	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1649616	None	None	None	2016-12-20 02:47:55 UTC
OpenStack gerrit	454351	None	MERGED	Make flushing tokens more robust	2021-01-28 20:32:10 UTC
OpenStack gerrit	456182	None	MERGED	Run token flush cron job hourly by default	2021-01-28 20:32:10 UTC
OpenStack gerrit	457553	None	MERGED	Change keystone token flush to run hourly	2021-01-28 20:32:11 UTC
OpenStack gerrit	480287	None	MERGED	Fixing flushing tokens workflow	2021-01-28 20:32:11 UTC
Red Hat Issue Tracker	OSP-963	None	None	None	2022-11-24 05:47:37 UTC
Red Hat Knowledge Base (Solution)	3122971	None	None	None	2017-10-15 07:07:01 UTC
Red Hat Knowledge Base (Solution)	3167441	None	None	None	2017-09-26 08:44:27 UTC
Red Hat Product Errata	RHEA-2017:3462	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 12.0 Enhancement Advisory	2018-02-16 01:43:25 UTC

Description Alex Krzos 2016-12-13 15:25:18 UTC

Created attachment 1231251 [details]
Token flush job Disk IO % util

Description of problem:
The Keystone token flush job can get into a state where it will never complete because the transaction size exceeds the mysql galara transaction size - wsrep_max_ws_size (1073741824).


Version-Release number of selected component (if applicable):
Discovered in RHOP10, Newton but perhaps affects older releases too
RHOP, Build: 2016-11-04.2


How reproducible:
You need to generate a large number of UUID tokens to reproduce, once in the state it never gets out


Steps to Reproduce:
1. Authenticate many times (>1 million)
2. Observe that keystone token flush job runs (should be a very long time depending on disk) >20 hours in my environment
3. Observe errors in mysql.log indicating a transaction that is too large


Actual results:
Expired tokens are not actually flushed from the database without any errors in keystone.log.  Only errors appear in mysql.log.


Expected results:
Expired tokens to be removed from the database


Additional info:
It is likely that you can demonstrate this with less than 1 million tokens as the >1 million token table is larger than 13GiB and the max transaction size is 1GiB, my token bench-marking Browbeat job creates more than needed.  

Once the token flush job can not complete the token table will never decrease in size and eventually the cloud will run out of disk space.  

Furthermore the flush job will consume disk utilization resources.  This was demonstrated on slow disks (Single 7.2K SATA disk).  On faster disks you will have more capacity to generate tokens, you can then generate the number of tokens to exceed the transaction size even faster.

Log evidence:
[root@overcloud-controller-0 log]# grep " Total expired" /var/log/keystone/keystone.log
2016-12-08 01:33:40.530 21614 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1082434
2016-12-09 09:31:25.301 14120 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1084241
2016-12-11 01:35:39.082 4223 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1086504
2016-12-12 01:08:16.170 32575 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1087823
2016-12-13 01:22:18.121 28669 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1089202
[root@overcloud-controller-0 log]# tail mysqld.log 
161208  1:33:41 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161208  1:33:41 [ERROR] WSREP: rbr write fail, data_len: 0, 2
161209  9:31:26 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161209  9:31:26 [ERROR] WSREP: rbr write fail, data_len: 0, 2
161211  1:35:39 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161211  1:35:40 [ERROR] WSREP: rbr write fail, data_len: 0, 2
161212  1:08:16 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161212  1:08:17 [ERROR] WSREP: rbr write fail, data_len: 0, 2
161213  1:22:18 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161213  1:22:19 [ERROR] WSREP: rbr write fail, data_len: 0, 2


Disk utilization issue graph is attached.  The entire job in that graph takes from the first spike is disk util(~5:18UTC) and culminates in about ~90 minutes of pegging the disk (between 1:09utc to 2:43utc).

Comment 1 Alex Krzos 2016-12-13 15:38:37 UTC

Upstream bug https://bugs.launchpad.net/keystone/+bug/1649616

Comment 3 Vincent S. Cojot 2017-04-06 19:05:46 UTC

Hi,

On monday, Apr3rd there was a problem with mysql on OSP10 and I realized that we had gone beyond 3 million tokens in keystone.token.
This came to my attention as the openstack commands started to fail with a timeout.

After repairing the mysql/galera DB, things went fine for a few days until today (Apr 6th) when I realized that we had 3.6million tokens in keystone.token.

So even though keystone-manage token_flush was invoked every day (confirmed by looking at cron jobs), it appears that this had been ineffective.

I had to issue a 'truncate token' manually to recover (it tooks 10 minutes).

This needs to be fixed.

Also, the line from /var/spool/cron/keystone doesn't leave a logfile:

# grep flush /var/spool/cron/keystone
1 0 * * * sleep `expr ${RANDOM} \% 3600`; keystone-manage token_flush >>/dev/null 2>&1

Perhaps this should be re-written as:
1 0 * * * sleep `expr ${RANDOM} \% 3600`; keystone-manage token_flush >>/var/log/keystone/token_flush.log 2>&1

Comment 4 David Hill 2017-04-10 13:56:38 UTC

This should be ran hourly IMHO.

Comment 7 John Dennis 2017-04-13 15:55:36 UTC

*** Bug 1438294 has been marked as a duplicate of this bug. ***

Comment 20 Vincent S. Cojot 2017-05-08 14:04:23 UTC

n a freshly installed OSP10 cloud, I started generating tokens at a rate of about 3-4 tokens/second this past friday (05/05 around 4pm EST).

Here's what I had at the time:

MariaDB [(none)]> select count(*) from keystone.token where expires < date_sub(now(), interval 1 hour) ;
+----------+
| count(*) |
+----------+
|     5435 |
+----------+
1 row in set (0.01 sec)


This monday (2017/05/08) here's what I have:

MariaDB [(none)]> select count(*) from keystone.token;
+----------+
| count(*) |
+----------+
|   494164 |
+----------+
1 row in set (0.06 sec)

MariaDB [(none)]> select count(*) from keystone.token where expires < date_sub(now(), interval 1 day) ;
+----------+
| count(*) |
+----------+
|   487672 |
+----------+
1 row in set (0.12 sec)

So it appears that token_flush failed to do what it was supposed to do over the week-end.
Still, there are signs on -all- 3 controllers that it ran:

stack@pbnec1-l-rh-ucld-0 ~]$ ssh ocld0 sudo grep token_flush /var/log/cron
May  8 00:01:01 pbnec1-l-rh-ocld-0 CROND[454010]: (keystone) CMD (sleep `expr ${RANDOM} % 3600`; keystone-manage token_flush >>/dev/null 2>&1)
[stack@pbnec1-l-rh-ucld-0 ~]$ ssh ocld1 sudo grep token_flush /var/log/cron
May  8 00:01:01 pbnec1-l-rh-ocld-1 CROND[820407]: (keystone) CMD (sleep `expr ${RANDOM} % 3600`; keystone-manage token_flush >>/dev/null 2>&1)
[stack@pbnec1-l-rh-ucld-0 ~]$ ssh ocld2 sudo grep token_flush /var/log/cron
May  8 00:01:01 pbnec1-l-rh-ocld-2 CROND[700900]: (keystone) CMD (sleep `expr ${RANDOM} % 3600`; keystone-manage token_flush >>/dev/null 2>&1)

Comment 21 Vincent S. Cojot 2017-05-08 14:04:54 UTC

Here's what I see in the logs on all 3 controllers:

Happened on all three controllers:

[root@pbnec1-l-rh-ocld-0 keystone]# zgrep DBDeadlock /var/log/keystone/keystone.log*gz
/var/log/keystone/keystone.log-20170507.gz:2017-05-07 00:51:44.952 313749 CRITICAL keystone [-] DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]
/var/log/keystone/keystone.log-20170507.gz:2017-05-07 00:51:44.952 313749 ERROR keystone DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]

[root@pbnec1-l-rh-ocld-1 ~]# zgrep DBDeadlock /var/log/keystone/keystone.log*gz
/var/log/keystone/keystone.log-20170507.gz:2017-05-07 00:13:26.671 695266 CRITICAL keystone [-] DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]
/var/log/keystone/keystone.log-20170507.gz:2017-05-07 00:13:26.671 695266 ERROR keystone DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]
/var/log/keystone/keystone.log-20170508.gz:2017-05-08 00:44:32.074 1011279 CRITICAL keystone [-] DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]
/var/log/keystone/keystone.log-20170508.gz:2017-05-08 00:44:32.074 1011279 ERROR keystone DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]

[root@pbnec1-l-rh-ocld-2 keystone]#  zgrep DBDeadlock /var/log/keystone/keystone.log*gz
/var/log/keystone/keystone.log-20170508.gz:2017-05-08 00:59:06.181 942308 CRITICAL keystone [-] DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]
/var/log/keystone/keystone.log-20170508.gz:2017-05-08 00:59:06.181 942308 ERROR keystone DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]

Comment 26 Harry Rybacki 2017-07-11 16:57:50 UTC

*** Bug 1466091 has been marked as a duplicate of this bug. ***

Comment 30 errata-xmlrpc 2017-12-13 20:54:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462

Note You need to log in before you can comment on or make changes to this bug.

akrzos
cshastri
dhill
dshevrin
ggillies
jdennis
jjoyce
josorior
jschluet
kbasil
lmiccini
mlopes
nchandek
nkinder
pcaruana
racedoro
rcritten
rmascena
slinaber
srevivo
tvignaud
vcojot