Bug 1451122 - Cloudforms causes a Token Storm on OSP10 overcloud
Summary: Cloudforms causes a Token Storm on OSP10 overcloud
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat CloudForms Management Engine
Classification: Red Hat
Component: C&U Capacity and Utilization
Version: unspecified
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: GA
: 5.9.0
Assignee: Marek Aufart
QA Contact: Ido Ovadia
URL:
Whiteboard: openstack
Depends On: 1404324 1469457 1470221 1470226 1470227 1470230 1473713
Blocks: 1456021 1460318
TreeView+ depends on / blocked
 
Reported: 2017-05-15 21:00 UTC by Vincent S. Cojot
Modified: 2021-06-10 12:19 UTC (History)
29 users (show)

Fixed In Version: 5.9.0.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1404324
: 1456021 1460318 (view as bug list)
Environment:
Last Closed: 2018-03-06 15:50:14 UTC
Category: ---
Cloudforms Team: Openstack
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1649616 0 None None None 2017-05-15 21:00:14 UTC
OpenStack gerrit 456182 0 None MERGED Run token flush cron job hourly by default 2020-10-05 09:44:18 UTC
OpenStack gerrit 457553 0 None MERGED Change keystone token flush to run hourly 2020-10-05 09:44:18 UTC

Comment 2 Vincent S. Cojot 2017-05-15 21:03:03 UTC
Hi,
I am cloning this bug (Eng needs to fix the token_flush issue anyway) and routing to Cloudforms.
It appears that our token flush issue on OSP10 is being caused by Cloudforms generating a -lot- of tokens.
Here's an example:
$ sudo mysql keystone -e 'select count(token.id) as count,name from token,local_user where token.user_id=local_user.user_id group by name order by count desc;'
+---------+-------------------------+
| count   | name                    |
+---------+-------------------------+
| 3359889 | cloudforms              |
|  115602 | neutron                 |
|   87496 | ceilometer              |
|   55042 | nova                    |
|   27849 | glance                  |
|   27281 | gnocchi                 |
|   26843 | heat                    |
|   24704 | cinder                  |
|   21809 | sevone                  |
|    7472 | admin                   |
|    3270 | swift                   |
|     928 | foobar                  |
|     366 | evillal                 |
|      59 | heat_stack_domain_admin |
|      13 | aodh                    |
|      10 | heat-cfn                |
|       8 | coombsu                 |
+---------+-------------------------+

How come that cloudforms generated this many tokens?
Is this normal behaviour?

Comment 3 Greg Blomquist 2017-05-15 21:18:27 UTC
The only reason I can think for generating so many tokens is a CloudForms worker dying and restarting repeatedly.  Each time the worker restarts, it might try to reconnect to OSP to gather data.

Comment 11 Tzu-Mainn Chen 2017-05-16 13:39:43 UTC
Could we also get some information as to which version of CFME is being used?

Comment 19 Tzu-Mainn Chen 2017-05-25 21:32:00 UTC
I've tested and merged Marek's PR:

https://github.com/ManageIQ/manageiq-providers-openstack/pull/45

The other one (https://github.com/ManageIQ/manageiq-gems-pending/pull/160) should not be considered part of this BZ.

By my testing Marek's PR reduces token generation to a quarter of what it was before, which is hopefully enough to avoid the token storm issue.  Marek, are there any other optimizations that you wanted to look into?

Comment 20 Marek Aufart 2017-05-26 08:53:34 UTC
https://github.com/ManageIQ/manageiq-providers-openstack/pull/45 should be the fix on CF side.

https://github.com/ManageIQ/manageiq-gems-pending/pull/160 is kind of enhancement and refactoring with just a minor effect on this issue, so I don't think we need backport it.

Other change which could decrease amount of Keystone Auth tokens significatly is merging EventWorkers (Cloud, Network, Storage if will be present) into one worker and distribute captured Events inside CF. This would be a bigger change in CF codebase and providers architecture. Which looks to me to be outside of scope of this BZ.

Comment 22 Rafael Urena 2017-06-01 15:56:41 UTC
the version is 5.7.0.17 if you still need it.

Comment 27 Tzu-Mainn Chen 2017-07-10 13:51:43 UTC
Ah, yep - do you have any information around the question I have in https://bugzilla.redhat.com/show_bug.cgi?id=1451122#c25?

Comment 28 Rafael Urena 2017-07-10 14:01:38 UTC
i don't see comment 25

Comment 29 Tzu-Mainn Chen 2017-07-10 14:13:04 UTC
Oh, whoops:

Thanks for the information! Just for clarification - NEC1 seems to be the environment where token generation is most under control, while MSC1, MSC2, and NEC2 seem to have unsustainable growth - does that match what you're seeing? What's the difference between these four environments?

Comment 31 Rafael Urena 2017-07-10 14:39:51 UTC
Usually if a change is made on one environment we make the change accross all 4 environments. All 4 environments were setup using the same proceedure and were done within a month of each other. There are 3 controllers on all 4 environments and 14 or 32 computes, nec or msc respectively. 

nec1 and nec2 are the same hardware and network setup. There are newer packages on nec2 due to an update we performed for a security vendor operating in this environment. I believe these will be pushed to all the environments eventually. We have been collecting data on nec1 for about a month. We've seen it go up to a little over 3k tokens.  Nec1 is the least used of the 4 environments by outside vendors. 

nec2 is being used to check for voulnerabilities and has satellite puppet modules installed for openscap scans. 

msc1 and msc2 are identical hardware/network/software. The only difference i know of is different vendors are on each environments. They work on hosting networks for wireless communications.

Comment 32 Tzu-Mainn Chen 2017-07-10 19:49:46 UTC
Thanks for the info! Looking at the logs, on MSC2 it looks like refreshes cause a small spike in tokens - maybe 9 or so - while seven tokens are generated every 45 seconds. The high frequency indicates that this has something to do with the event worker. We'll focus our efforts there.

Comment 33 Tzu-Mainn Chen 2017-07-11 21:38:02 UTC
It looks like there was an issue where certain kinds of API requests - including those for events - would result in ManageIQ looping over every single tenant and making a connection for each. https://github.com/ManageIQ/manageiq-providers-openstack/pull/62 should fix this issue.

Comment 40 Ido Ovadia 2018-02-04 12:11:43 UTC
Verified
========
CFME 5.9.0.19 + RHOS11

Comment 41 Ido Ovadia 2018-02-04 12:14:05 UTC
(In reply to Ido Ovadia from comment #40)
> Verified
> ========
> CFME 5.9.0.19 + RHOS11

typo fix the verification made on RHOS 10


Note You need to log in before you can comment on or make changes to this bug.