Hide Forgot
Created attachment 1231251 [details] Token flush job Disk IO % util Description of problem: The Keystone token flush job can get into a state where it will never complete because the transaction size exceeds the mysql galara transaction size - wsrep_max_ws_size (1073741824). Version-Release number of selected component (if applicable): Discovered in RHOP10, Newton but perhaps affects older releases too RHOP, Build: 2016-11-04.2 How reproducible: You need to generate a large number of UUID tokens to reproduce, once in the state it never gets out Steps to Reproduce: 1. Authenticate many times (>1 million) 2. Observe that keystone token flush job runs (should be a very long time depending on disk) >20 hours in my environment 3. Observe errors in mysql.log indicating a transaction that is too large Actual results: Expired tokens are not actually flushed from the database without any errors in keystone.log. Only errors appear in mysql.log. Expected results: Expired tokens to be removed from the database Additional info: It is likely that you can demonstrate this with less than 1 million tokens as the >1 million token table is larger than 13GiB and the max transaction size is 1GiB, my token bench-marking Browbeat job creates more than needed. Once the token flush job can not complete the token table will never decrease in size and eventually the cloud will run out of disk space. Furthermore the flush job will consume disk utilization resources. This was demonstrated on slow disks (Single 7.2K SATA disk). On faster disks you will have more capacity to generate tokens, you can then generate the number of tokens to exceed the transaction size even faster. Log evidence: [root@overcloud-controller-0 log]# grep " Total expired" /var/log/keystone/keystone.log 2016-12-08 01:33:40.530 21614 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1082434 2016-12-09 09:31:25.301 14120 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1084241 2016-12-11 01:35:39.082 4223 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1086504 2016-12-12 01:08:16.170 32575 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1087823 2016-12-13 01:22:18.121 28669 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1089202 [root@overcloud-controller-0 log]# tail mysqld.log 161208 1:33:41 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592 161208 1:33:41 [ERROR] WSREP: rbr write fail, data_len: 0, 2 161209 9:31:26 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592 161209 9:31:26 [ERROR] WSREP: rbr write fail, data_len: 0, 2 161211 1:35:39 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592 161211 1:35:40 [ERROR] WSREP: rbr write fail, data_len: 0, 2 161212 1:08:16 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592 161212 1:08:17 [ERROR] WSREP: rbr write fail, data_len: 0, 2 161213 1:22:18 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592 161213 1:22:19 [ERROR] WSREP: rbr write fail, data_len: 0, 2 Disk utilization issue graph is attached. The entire job in that graph takes from the first spike is disk util(~5:18UTC) and culminates in about ~90 minutes of pegging the disk (between 1:09utc to 2:43utc).
Upstream bug https://bugs.launchpad.net/keystone/+bug/1649616
Hi, On monday, Apr3rd there was a problem with mysql on OSP10 and I realized that we had gone beyond 3 million tokens in keystone.token. This came to my attention as the openstack commands started to fail with a timeout. After repairing the mysql/galera DB, things went fine for a few days until today (Apr 6th) when I realized that we had 3.6million tokens in keystone.token. So even though keystone-manage token_flush was invoked every day (confirmed by looking at cron jobs), it appears that this had been ineffective. I had to issue a 'truncate token' manually to recover (it tooks 10 minutes). This needs to be fixed. Also, the line from /var/spool/cron/keystone doesn't leave a logfile: # grep flush /var/spool/cron/keystone 1 0 * * * sleep `expr ${RANDOM} \% 3600`; keystone-manage token_flush >>/dev/null 2>&1 Perhaps this should be re-written as: 1 0 * * * sleep `expr ${RANDOM} \% 3600`; keystone-manage token_flush >>/var/log/keystone/token_flush.log 2>&1
This should be ran hourly IMHO.
*** Bug 1438294 has been marked as a duplicate of this bug. ***
n a freshly installed OSP10 cloud, I started generating tokens at a rate of about 3-4 tokens/second this past friday (05/05 around 4pm EST). Here's what I had at the time: MariaDB [(none)]> select count(*) from keystone.token where expires < date_sub(now(), interval 1 hour) ; +----------+ | count(*) | +----------+ | 5435 | +----------+ 1 row in set (0.01 sec) This monday (2017/05/08) here's what I have: MariaDB [(none)]> select count(*) from keystone.token; +----------+ | count(*) | +----------+ | 494164 | +----------+ 1 row in set (0.06 sec) MariaDB [(none)]> select count(*) from keystone.token where expires < date_sub(now(), interval 1 day) ; +----------+ | count(*) | +----------+ | 487672 | +----------+ 1 row in set (0.12 sec) So it appears that token_flush failed to do what it was supposed to do over the week-end. Still, there are signs on -all- 3 controllers that it ran: stack@pbnec1-l-rh-ucld-0 ~]$ ssh ocld0 sudo grep token_flush /var/log/cron May 8 00:01:01 pbnec1-l-rh-ocld-0 CROND[454010]: (keystone) CMD (sleep `expr ${RANDOM} % 3600`; keystone-manage token_flush >>/dev/null 2>&1) [stack@pbnec1-l-rh-ucld-0 ~]$ ssh ocld1 sudo grep token_flush /var/log/cron May 8 00:01:01 pbnec1-l-rh-ocld-1 CROND[820407]: (keystone) CMD (sleep `expr ${RANDOM} % 3600`; keystone-manage token_flush >>/dev/null 2>&1) [stack@pbnec1-l-rh-ucld-0 ~]$ ssh ocld2 sudo grep token_flush /var/log/cron May 8 00:01:01 pbnec1-l-rh-ocld-2 CROND[700900]: (keystone) CMD (sleep `expr ${RANDOM} % 3600`; keystone-manage token_flush >>/dev/null 2>&1)
Here's what I see in the logs on all 3 controllers: Happened on all three controllers: [root@pbnec1-l-rh-ocld-0 keystone]# zgrep DBDeadlock /var/log/keystone/keystone.log*gz /var/log/keystone/keystone.log-20170507.gz:2017-05-07 00:51:44.952 313749 CRITICAL keystone [-] DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)] /var/log/keystone/keystone.log-20170507.gz:2017-05-07 00:51:44.952 313749 ERROR keystone DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)] [root@pbnec1-l-rh-ocld-1 ~]# zgrep DBDeadlock /var/log/keystone/keystone.log*gz /var/log/keystone/keystone.log-20170507.gz:2017-05-07 00:13:26.671 695266 CRITICAL keystone [-] DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)] /var/log/keystone/keystone.log-20170507.gz:2017-05-07 00:13:26.671 695266 ERROR keystone DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)] /var/log/keystone/keystone.log-20170508.gz:2017-05-08 00:44:32.074 1011279 CRITICAL keystone [-] DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)] /var/log/keystone/keystone.log-20170508.gz:2017-05-08 00:44:32.074 1011279 ERROR keystone DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)] [root@pbnec1-l-rh-ocld-2 keystone]# zgrep DBDeadlock /var/log/keystone/keystone.log*gz /var/log/keystone/keystone.log-20170508.gz:2017-05-08 00:59:06.181 942308 CRITICAL keystone [-] DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)] /var/log/keystone/keystone.log-20170508.gz:2017-05-08 00:59:06.181 942308 ERROR keystone DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'DELETE FROM token WHERE token.expires <= %s'] [parameters: (datetime.datetime(2017, 5, 6, 0, 39, 24),)]
*** Bug 1466091 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3462