1451122 – Cloudforms causes a Token Storm on OSP10 overcloud

Bug 1451122 - Cloudforms causes a Token Storm on OSP10 overcloud

Summary: Cloudforms causes a Token Storm on OSP10 overcloud

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	C&U Capacity and Utilization
Sub Component:
Version:	unspecified
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	GA
Target Release:	5.9.0
Assignee:	Marek Aufart
QA Contact:	Ido Ovadia
Docs Contact:
URL:
Whiteboard:	openstack
Depends On:	1404324 1469457 1470221 1470226 1470227 1470230 1473713
Blocks:	1456021 1460318
TreeView+	depends on / blocked

Reported:	2017-05-15 21:00 UTC by Vincent S. Cojot
Modified:	2021-06-10 12:19 UTC (History)
CC List:	29 users (show)
Fixed In Version:	5.9.0.1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1404324
Clones:	1456021 1460318 (view as bug list)
Environment:
Last Closed:	2018-03-06 15:50:14 UTC
Category:	---
Cloudforms Team:	Openstack
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1649616	None	None	None	2017-05-15 21:00:14 UTC
OpenStack gerrit	456182	None	MERGED	Run token flush cron job hourly by default	2020-10-05 09:44:18 UTC
OpenStack gerrit	457553	None	MERGED	Change keystone token flush to run hourly	2020-10-05 09:44:18 UTC

Comment 2 Vincent S. Cojot 2017-05-15 21:03:03 UTC

Hi,
I am cloning this bug (Eng needs to fix the token_flush issue anyway) and routing to Cloudforms.
It appears that our token flush issue on OSP10 is being caused by Cloudforms generating a -lot- of tokens.
Here's an example:
$ sudo mysql keystone -e 'select count(token.id) as count,name from token,local_user where token.user_id=local_user.user_id group by name order by count desc;'
+---------+-------------------------+
| count   | name                    |
+---------+-------------------------+
| 3359889 | cloudforms              |
|  115602 | neutron                 |
|   87496 | ceilometer              |
|   55042 | nova                    |
|   27849 | glance                  |
|   27281 | gnocchi                 |
|   26843 | heat                    |
|   24704 | cinder                  |
|   21809 | sevone                  |
|    7472 | admin                   |
|    3270 | swift                   |
|     928 | foobar                  |
|     366 | evillal                 |
|      59 | heat_stack_domain_admin |
|      13 | aodh                    |
|      10 | heat-cfn                |
|       8 | coombsu                 |
+---------+-------------------------+

How come that cloudforms generated this many tokens?
Is this normal behaviour?

Comment 3 Greg Blomquist 2017-05-15 21:18:27 UTC

The only reason I can think for generating so many tokens is a CloudForms worker dying and restarting repeatedly.  Each time the worker restarts, it might try to reconnect to OSP to gather data.

Comment 11 Tzu-Mainn Chen 2017-05-16 13:39:43 UTC

Could we also get some information as to which version of CFME is being used?

Comment 15 CFME Bot 2017-05-17 09:53:06 UTC

https://github.com/ManageIQ/manageiq-gems-pending/pull/160

Comment 19 Tzu-Mainn Chen 2017-05-25 21:32:00 UTC

I've tested and merged Marek's PR:

https://github.com/ManageIQ/manageiq-providers-openstack/pull/45

The other one (https://github.com/ManageIQ/manageiq-gems-pending/pull/160) should not be considered part of this BZ.

By my testing Marek's PR reduces token generation to a quarter of what it was before, which is hopefully enough to avoid the token storm issue.  Marek, are there any other optimizations that you wanted to look into?

Comment 20 Marek Aufart 2017-05-26 08:53:34 UTC

https://github.com/ManageIQ/manageiq-providers-openstack/pull/45 should be the fix on CF side.

https://github.com/ManageIQ/manageiq-gems-pending/pull/160 is kind of enhancement and refactoring with just a minor effect on this issue, so I don't think we need backport it.

Other change which could decrease amount of Keystone Auth tokens significatly is merging EventWorkers (Cloud, Network, Storage if will be present) into one worker and distribute captured Events inside CF. This would be a bigger change in CF codebase and providers architecture. Which looks to me to be outside of scope of this BZ.

Comment 22 Rafael Urena 2017-06-01 15:56:41 UTC

the version is 5.7.0.17 if you still need it.

Comment 27 Tzu-Mainn Chen 2017-07-10 13:51:43 UTC

Ah, yep - do you have any information around the question I have in https://bugzilla.redhat.com/show_bug.cgi?id=1451122#c25?

Comment 28 Rafael Urena 2017-07-10 14:01:38 UTC

i don't see comment 25

Comment 29 Tzu-Mainn Chen 2017-07-10 14:13:04 UTC

Oh, whoops:

Thanks for the information! Just for clarification - NEC1 seems to be the environment where token generation is most under control, while MSC1, MSC2, and NEC2 seem to have unsustainable growth - does that match what you're seeing? What's the difference between these four environments?

Comment 31 Rafael Urena 2017-07-10 14:39:51 UTC

Usually if a change is made on one environment we make the change accross all 4 environments. All 4 environments were setup using the same proceedure and were done within a month of each other. There are 3 controllers on all 4 environments and 14 or 32 computes, nec or msc respectively. 

nec1 and nec2 are the same hardware and network setup. There are newer packages on nec2 due to an update we performed for a security vendor operating in this environment. I believe these will be pushed to all the environments eventually. We have been collecting data on nec1 for about a month. We've seen it go up to a little over 3k tokens.  Nec1 is the least used of the 4 environments by outside vendors. 

nec2 is being used to check for voulnerabilities and has satellite puppet modules installed for openscap scans. 

msc1 and msc2 are identical hardware/network/software. The only difference i know of is different vendors are on each environments. They work on hosting networks for wireless communications.

Comment 32 Tzu-Mainn Chen 2017-07-10 19:49:46 UTC

Thanks for the info! Looking at the logs, on MSC2 it looks like refreshes cause a small spike in tokens - maybe 9 or so - while seven tokens are generated every 45 seconds. The high frequency indicates that this has something to do with the event worker. We'll focus our efforts there.

Comment 33 Tzu-Mainn Chen 2017-07-11 21:38:02 UTC

It looks like there was an issue where certain kinds of API requests - including those for events - would result in ManageIQ looping over every single tenant and making a connection for each. https://github.com/ManageIQ/manageiq-providers-openstack/pull/62 should fix this issue.

Comment 40 Ido Ovadia 2018-02-04 12:11:43 UTC

Verified
========
CFME 5.9.0.19 + RHOS11

Comment 41 Ido Ovadia 2018-02-04 12:14:05 UTC

(In reply to Ido Ovadia from comment #40)
> Verified
> ========
> CFME 5.9.0.19 + RHOS11

typo fix the verification made on RHOS 10

Note You need to log in before you can comment on or make changes to this bug.

akaris
akrzos
cpelland
dajo
dhill
ealcaniz
gblomqui
igortiunov
ikaur
iovadia
jdennis
jdeubel
jhardy
josorior
kbasil
lmiccini
maufart
nchandek
niroy
nkinder
nlevinki
obarenbo
rspagnol
rurena
sacpatil
simaishi
srevivo
tzumainn
vcojot