1466091 – Keystone Token Flush job does not complete in HA deployed environment

Bug 1466091 - Keystone Token Flush job does not complete in HA deployed environment

Summary: Keystone Token Flush job does not complete in HA deployed environment

Keywords:
Status:	CLOSED DUPLICATE of bug 1404324
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-keystone
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	John Dennis
QA Contact:	Prasanth Anbalagan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-06-29 01:29 UTC by Graeme Gillies
Modified:	2017-07-11 16:57 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-07-11 16:57:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Launchpad	1649616	0	None	None	None	2017-06-29 01:29:30 UTC

Description Graeme Gillies 2017-06-29 01:29:30 UTC

This is basically a link to the upstream bug

https://bugs.launchpad.net/keystone/+bug/1649616

I'll paste description here

[Impact]

 * The Keystone token flush job can get into a state where it will never complete because the transaction size exceeds the mysql galara transaction size - wsrep_max_ws_size (1073741824).

[Test Case]

1. Authenticate many times
2. Observe that keystone token flush job runs (should be a very long time depending on disk) >20 hours in my environment
3. Observe errors in mysql.log indicating a transaction that is too large

Actual results:
Expired tokens are not actually flushed from the database without any errors in keystone.log. Only errors appear in mysql.log.

Expected results:
Expired tokens to be removed from the database

[Additional info:]

It is likely that you can demonstrate this with less than 1 million tokens as the >1 million token table is larger than 13GiB and the max transaction size is 1GiB, my token bench-marking Browbeat job creates more than needed.

Once the token flush job can not complete the token table will never decrease in size and eventually the cloud will run out of disk space.

Furthermore the flush job will consume disk utilization resources. This was demonstrated on slow disks (Single 7.2K SATA disk). On faster disks you will have more capacity to generate tokens, you can then generate the number of tokens to exceed the transaction size even faster.

Log evidence:
[root@overcloud-controller-0 log]# grep " Total expired" /var/log/keystone/keystone.log
2016-12-08 01:33:40.530 21614 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1082434
2016-12-09 09:31:25.301 14120 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1084241
2016-12-11 01:35:39.082 4223 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1086504
2016-12-12 01:08:16.170 32575 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1087823
2016-12-13 01:22:18.121 28669 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1089202
[root@overcloud-controller-0 log]# tail mysqld.log
161208 1:33:41 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161208 1:33:41 [ERROR] WSREP: rbr write fail, data_len: 0, 2
161209 9:31:26 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161209 9:31:26 [ERROR] WSREP: rbr write fail, data_len: 0, 2
161211 1:35:39 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161211 1:35:40 [ERROR] WSREP: rbr write fail, data_len: 0, 2
161212 1:08:16 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161212 1:08:17 [ERROR] WSREP: rbr write fail, data_len: 0, 2
161213 1:22:18 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592
161213 1:22:19 [ERROR] WSREP: rbr write fail, data_len: 0, 2

Disk utilization issue graph is attached. The entire job in that graph takes from the first spike is disk util(~5:18UTC) and culminates in about ~90 minutes of pegging the disk (between 1:09utc to 2:43utc).

[Regression Potential]
* Not identified

Comment 1 Michael Bayer 2017-06-29 13:49:53 UTC

the current strategy here is to ensure token flush job is at least hourly and additionally they are working on getting each batched delete in its own transaction, see https://review.openstack.org/#/c/469514/.

Comment 2 Harry Rybacki 2017-07-11 16:57:50 UTC

Fixes for this bug are being handled in BZ#1404324

*** This bug has been marked as a duplicate of bug 1404324 ***

Note You need to log in before you can comment on or make changes to this bug.