Bug 1404324

Summary: Token flush does not complete when expired tokens becomes large enough
Product: Red Hat OpenStack Reporter: Alex Krzos <akrzos>
Component: puppet-keystoneAssignee: John Dennis <jdennis>
Status: CLOSED ERRATA QA Contact: Prasanth Anbalagan <panbalag>
Severity: high Docs Contact:
Priority: urgent    
Version: 10.0 (Newton)CC: akrzos, cshastri, dhill, dshevrin, ggillies, jdennis, jjoyce, josorior, jschluet, kbasil, lmiccini, mlopes, nchandek, nkinder, pcaruana, racedoro, rcritten, rmascena, slinaber, srevivo, tvignaud, vcojot
Target Milestone: Upstream M3Keywords: Triaged
Target Release: 12.0 (Pike)   
Hardware: x86_64   
OS: Linux   
Whiteboard: PerfScale
Fixed In Version: puppet-keystone-11.1.0-0.20170519122658.5fb6cce.el7ost Doc Type: Enhancement
Doc Text:
The token flush cron job has been modified to run hourly instead of once a day. This was changed because of issues being raised in larger deployments, as the operation would take too long and sometimes even fail because the transaction was too large. Note that this only affects deployments using the UUID token provider.
Story Points: ---
Clone Of:
: 1451122 1469457 1470221 1470226 1470227 1470230 1473713 (view as bug list) Environment:
Last Closed: 2017-12-13 20:54:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1451122, 1467120, 1469457, 1470221, 1470226, 1470227, 1470230, 1473713    
Attachments:
Description Flags
Token flush job Disk IO % util none

Description Alex Krzos 2016-12-13 15:25:18 UTC
Created attachment 1231251 [details]
Token flush job Disk IO % util

Description of problem:
The Keystone token flush job can get into a state where it will never complete because the transaction size exceeds the mysql galara transaction size - wsrep_max_ws_size (1073741824).


Version-Release number of selected component (if applicable):
Discovered in RHOP10, Newton but perhaps affects older releases too
RHOP, Build: 2016-11-04.2


How reproducible:
You need to generate a large number of UUID tokens to reproduce, once in the state it never gets out


Steps to Reproduce:
1. Authenticate many times (>1 million)
2. Observe that keystone token flush job runs (should be a very long time depending on disk) >20 hours in my environment
3. Observe errors in mysql.log indicating a transaction that is too large


Actual results:
Expired tokens are not actually flushed from the database without any errors in keystone.log.  Only errors appear in mysql.log.


Expected results:
Expired tokens to be removed from the database


Additional info:
It is likely that you can demonstrate this with less than 1 million tokens as the >1 million token table is larger than 13GiB and the max transaction size is 1GiB, my token bench-marking Browbeat job creates more than needed.  

Once the token flush job can not complete the token table will never decrease in size and eventually the cloud will run out of disk space.  

Furthermore the flush job will consume disk utilization resources.  This was demonstrated on slow disks (Single 7.2K SATA disk).  On faster disks you will have more capacity to generate tokens, you can then generate the number of tokens to exceed the transaction size even faster.

Log evidence:
[root@overcloud-controller-0 log]# grep " Total expired" /var/log/keystone/keystone.log
2016-12-08 01:33:40.530 21614 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1082434
2016-12-09 09:31:25.301 14120 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1084241
2016-12-11 01:35:39.082 4223 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1086504
2016-12-12 01:08:16.170 32575 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1087823
2016-12-13 01:22:18.121 28669 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1089202
[root@overcloud-controller-0 log]# tail mysqld.log 
161208  1:33:41 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161208  1:33:41 [ERROR] WSREP: rbr write fail, data_len: 0, 2
161209  9:31:26 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161209  9:31:26 [ERROR] WSREP: rbr write fail, data_len: 0, 2
161211  1:35:39 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161211  1:35:40 [ERROR] WSREP: rbr write fail, data_len: 0, 2
161212  1:08:16 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161212  1:08:17 [ERROR] WSREP: rbr write fail, data_len: 0, 2
161213  1:22:18 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161213  1:22:19 [ERROR] WSREP: rbr write fail, data_len: 0, 2


Disk utilization issue graph is attached.  The entire job in that graph takes from the first spike is disk util(~5:18UTC) and culminates in about ~90 minutes of pegging the disk (between 1:09utc to 2:43utc).

Comment 1 Alex Krzos 2016-12-13 15:38:37 UTC
Upstream bug https://bugs.launchpad.net/keystone/+bug/1649616

Comment 3 Vincent S. Cojot 2017-04-06 19:05:46 UTC
Hi,

On monday, Apr3rd there was a problem with mysql on OSP10 and I realized that we had gone beyond 3 million tokens in keystone.token.
This came to my attention as the openstack commands started to fail with a timeout.

After repairing the mysql/galera DB, things went fine for a few days until today (Apr 6th) when I realized that we had 3.6million tokens in keystone.token.

So even though keystone-manage token_flush was invoked every day (confirmed by looking at cron jobs), it appears that this had been ineffective.

I had to issue a 'truncate token' manually to recover (it tooks 10 minutes).

This needs to be fixed.

Also, the line from /var/spool/cron/keystone doesn't leave a logfile:

# grep flush /var/spool/cron/keystone
1 0 * * * sleep `expr ${RANDOM} \% 3600`; keystone-manage token_flush >>/dev/null 2>&1

Perhaps this should be re-written as:
1 0 * * * sleep `expr ${RANDOM} \% 3600`; keystone-manage token_flush >>/var/log/keystone/token_flush.log 2>&1

Comment 4 David Hill 2017-04-10 13:56:38 UTC
This should be ran hourly IMHO.

Comment 7 John Dennis 2017-04-13 15:55:36 UTC
*** Bug 1438294 has been marked as a duplicate of this bug. ***

Comment 20 Vincent S. Cojot 2017-05-08 14:04:23 UTC
n a freshly installed OSP10 cloud, I started generating tokens at a rate of about 3-4 tokens/second this past friday (05/05 around 4pm EST).

Here's what I had at the time:

MariaDB [(none)]> select count(*) from keystone.token where expires < date_sub(now(), interval 1 hour) ;
+----------+
| count(*) |
+----------+
|     5435 |
+----------+
1 row in set (0.01 sec)


This monday (2017/05/08) here's what I have:

MariaDB [(none)]> select count(*) from keystone.token;
+----------+
| count(*) |
+----------+
|   494164 |
+----------+
1 row in set (0.06 sec)

MariaDB [(none)]> select count(*) from keystone.token where expires < date_sub(now(), interval 1 day) ;
+----------+
| count(*) |
+----------+
|   487672 |
+----------+
1 row in set (0.12 sec)

So it appears that token_flush failed to do what it was supposed to do over the week-end.
Still, there are signs on -all- 3 controllers that it ran:

stack@pbnec1-l-rh-ucld-0 ~]$ ssh ocld0 sudo grep token_flush /var/log/cron
May  8 00:01:01 pbnec1-l-rh-ocld-0 CROND[454010]: (keystone) CMD (sleep `expr ${RANDOM} % 3600`; keystone-manage token_flush >>/dev/null 2>&1)
[stack@pbnec1-l-rh-ucld-0 ~]$ ssh ocld1 sudo grep token_flush /var/log/cron
May  8 00:01:01 pbnec1-l-rh-ocld-1 CROND[820407]: (keystone) CMD (sleep `expr ${RANDOM} % 3600`; keystone-manage token_flush >>/dev/null 2>&1)
[stack@pbnec1-l-rh-ucld-0 ~]$ ssh ocld2 sudo grep token_flush /var/log/cron
May  8 00:01:01 pbnec1-l-rh-ocld-2 CROND[700900]: (keystone) CMD (sleep `expr ${RANDOM} % 3600`; keystone-manage token_flush >>/dev/null 2>&1)

Comment 21 Vincent S. Cojot 2017-05-08 14:04:54 UTC
Here's what I see in the logs on all 3 controllers:

Happened on all three controllers:

[root@pbnec1-l-rh-ocld-0 keystone]# zgrep DBDeadlock /var/log/keystone/keystone.log*gz
/var/log/keystone/keystone.log-20170507.gz:2017-05-07 00:51:44.952 313749 CRITICAL keystone [-] DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]
/var/log/keystone/keystone.log-20170507.gz:2017-05-07 00:51:44.952 313749 ERROR keystone DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]

[root@pbnec1-l-rh-ocld-1 ~]# zgrep DBDeadlock /var/log/keystone/keystone.log*gz
/var/log/keystone/keystone.log-20170507.gz:2017-05-07 00:13:26.671 695266 CRITICAL keystone [-] DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]
/var/log/keystone/keystone.log-20170507.gz:2017-05-07 00:13:26.671 695266 ERROR keystone DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]
/var/log/keystone/keystone.log-20170508.gz:2017-05-08 00:44:32.074 1011279 CRITICAL keystone [-] DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]
/var/log/keystone/keystone.log-20170508.gz:2017-05-08 00:44:32.074 1011279 ERROR keystone DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]

[root@pbnec1-l-rh-ocld-2 keystone]#  zgrep DBDeadlock /var/log/keystone/keystone.log*gz
/var/log/keystone/keystone.log-20170508.gz:2017-05-08 00:59:06.181 942308 CRITICAL keystone [-] DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]
/var/log/keystone/keystone.log-20170508.gz:2017-05-08 00:59:06.181 942308 ERROR keystone DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]

Comment 26 Harry Rybacki 2017-07-11 16:57:50 UTC
*** Bug 1466091 has been marked as a duplicate of this bug. ***

Comment 30 errata-xmlrpc 2017-12-13 20:54:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462