Bug 1451122

Summary: Cloudforms causes a Token Storm on OSP10 overcloud
Product: Red Hat CloudForms Management Engine Reporter: Vincent S. Cojot <vcojot>
Component: C&U Capacity and UtilizationAssignee: Marek Aufart <maufart>
Status: CLOSED CURRENTRELEASE QA Contact: Ido Ovadia <iovadia>
Severity: high Docs Contact:
Priority: urgent    
Version: unspecifiedCC: akaris, akrzos, cpelland, dajo, dhill, ealcaniz, gblomqui, igortiunov, ikaur, iovadia, jdennis, jdeubel, jhardy, josorior, kbasil, lmiccini, maufart, nchandek, niroy, nkinder, nlevinki, obarenbo, rspagnol, rurena, sacpatil, simaishi, srevivo, tzumainn, vcojot
Target Milestone: GAKeywords: TestOnly, ZStream
Target Release: 5.9.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: openstack
Fixed In Version: 5.9.0.1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1404324
: 1456021 1460318 (view as bug list) Environment:
Last Closed: 2018-03-06 15:50:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: Openstack Target Upstream Version:
Embargoed:
Bug Depends On: 1404324, 1469457, 1470221, 1470226, 1470227, 1470230, 1473713    
Bug Blocks: 1456021, 1460318    

Comment 2 Vincent S. Cojot 2017-05-15 21:03:03 UTC
Hi,
I am cloning this bug (Eng needs to fix the token_flush issue anyway) and routing to Cloudforms.
It appears that our token flush issue on OSP10 is being caused by Cloudforms generating a -lot- of tokens.
Here's an example:
$ sudo mysql keystone -e 'select count(token.id) as count,name from token,local_user where token.user_id=local_user.user_id group by name order by count desc;'
+---------+-------------------------+
| count   | name                    |
+---------+-------------------------+
| 3359889 | cloudforms              |
|  115602 | neutron                 |
|   87496 | ceilometer              |
|   55042 | nova                    |
|   27849 | glance                  |
|   27281 | gnocchi                 |
|   26843 | heat                    |
|   24704 | cinder                  |
|   21809 | sevone                  |
|    7472 | admin                   |
|    3270 | swift                   |
|     928 | foobar                  |
|     366 | evillal                 |
|      59 | heat_stack_domain_admin |
|      13 | aodh                    |
|      10 | heat-cfn                |
|       8 | coombsu                 |
+---------+-------------------------+

How come that cloudforms generated this many tokens?
Is this normal behaviour?

Comment 3 Greg Blomquist 2017-05-15 21:18:27 UTC
The only reason I can think for generating so many tokens is a CloudForms worker dying and restarting repeatedly.  Each time the worker restarts, it might try to reconnect to OSP to gather data.

Comment 11 Tzu-Mainn Chen 2017-05-16 13:39:43 UTC
Could we also get some information as to which version of CFME is being used?

Comment 19 Tzu-Mainn Chen 2017-05-25 21:32:00 UTC
I've tested and merged Marek's PR:

https://github.com/ManageIQ/manageiq-providers-openstack/pull/45

The other one (https://github.com/ManageIQ/manageiq-gems-pending/pull/160) should not be considered part of this BZ.

By my testing Marek's PR reduces token generation to a quarter of what it was before, which is hopefully enough to avoid the token storm issue.  Marek, are there any other optimizations that you wanted to look into?

Comment 20 Marek Aufart 2017-05-26 08:53:34 UTC
https://github.com/ManageIQ/manageiq-providers-openstack/pull/45 should be the fix on CF side.

https://github.com/ManageIQ/manageiq-gems-pending/pull/160 is kind of enhancement and refactoring with just a minor effect on this issue, so I don't think we need backport it.

Other change which could decrease amount of Keystone Auth tokens significatly is merging EventWorkers (Cloud, Network, Storage if will be present) into one worker and distribute captured Events inside CF. This would be a bigger change in CF codebase and providers architecture. Which looks to me to be outside of scope of this BZ.

Comment 22 Rafael Urena 2017-06-01 15:56:41 UTC
the version is 5.7.0.17 if you still need it.

Comment 27 Tzu-Mainn Chen 2017-07-10 13:51:43 UTC
Ah, yep - do you have any information around the question I have in https://bugzilla.redhat.com/show_bug.cgi?id=1451122#c25?

Comment 28 Rafael Urena 2017-07-10 14:01:38 UTC
i don't see comment 25

Comment 29 Tzu-Mainn Chen 2017-07-10 14:13:04 UTC
Oh, whoops:

Thanks for the information! Just for clarification - NEC1 seems to be the environment where token generation is most under control, while MSC1, MSC2, and NEC2 seem to have unsustainable growth - does that match what you're seeing? What's the difference between these four environments?

Comment 31 Rafael Urena 2017-07-10 14:39:51 UTC
Usually if a change is made on one environment we make the change accross all 4 environments. All 4 environments were setup using the same proceedure and were done within a month of each other. There are 3 controllers on all 4 environments and 14 or 32 computes, nec or msc respectively. 

nec1 and nec2 are the same hardware and network setup. There are newer packages on nec2 due to an update we performed for a security vendor operating in this environment. I believe these will be pushed to all the environments eventually. We have been collecting data on nec1 for about a month. We've seen it go up to a little over 3k tokens.  Nec1 is the least used of the 4 environments by outside vendors. 

nec2 is being used to check for voulnerabilities and has satellite puppet modules installed for openscap scans. 

msc1 and msc2 are identical hardware/network/software. The only difference i know of is different vendors are on each environments. They work on hosting networks for wireless communications.

Comment 32 Tzu-Mainn Chen 2017-07-10 19:49:46 UTC
Thanks for the info! Looking at the logs, on MSC2 it looks like refreshes cause a small spike in tokens - maybe 9 or so - while seven tokens are generated every 45 seconds. The high frequency indicates that this has something to do with the event worker. We'll focus our efforts there.

Comment 33 Tzu-Mainn Chen 2017-07-11 21:38:02 UTC
It looks like there was an issue where certain kinds of API requests - including those for events - would result in ManageIQ looping over every single tenant and making a connection for each. https://github.com/ManageIQ/manageiq-providers-openstack/pull/62 should fix this issue.

Comment 40 Ido Ovadia 2018-02-04 12:11:43 UTC
Verified
========
CFME 5.9.0.19 + RHOS11

Comment 41 Ido Ovadia 2018-02-04 12:14:05 UTC
(In reply to Ido Ovadia from comment #40)
> Verified
> ========
> CFME 5.9.0.19 + RHOS11

typo fix the verification made on RHOS 10