1343328 – [RFE] change Amazon cloud provider discovery process

Bug 1343328 - [RFE] change Amazon cloud provider discovery process

Summary: [RFE] change Amazon cloud provider discovery process

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Providers
Sub Component:
Version:	5.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	cfme-future
Assignee:	John Hardy
QA Contact:	Jiri Stefanisin
Docs Contact:
URL:
Whiteboard:	provider:cloud:discovery:ec2
Depends On:
Blocks:	1347703
TreeView+	depends on / blocked

Reported:	2016-06-07 07:23 UTC by Aziza Karol
Modified:	2017-10-02 12:31 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1347703 (view as bug list)
Environment:
Last Closed:	2017-08-28 14:50:10 UTC
Category:	---
Cloudforms Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
error (21.04 KB, image/png) 2016-06-07 07:23 UTC, Aziza Karol	no flags	Details
View All

Description Aziza Karol 2016-06-07 07:23:55 UTC

Created attachment 1165517 [details]
error

Description of problem:


Version-Release number of selected component (if applicable):
5.6.0.0-rc2

How reproducible:
100%

Steps to Reproduce:
1.Navigate to compute->clouds->providers
2.From configuration do a amazon cloud provider discovery
3.

Actual results:
when you do Amazon provider discovery CFME process hit constantly a higher CPU utilization.The UI become unresponsive and proxy error is thrown. see attached screenshot.
 
Expected results:


Additional info:

Comment 3 Marcel Hild 2016-06-07 14:19:48 UTC

And did you do discovery with public image discovery enabled?

Comment 4 Aziza Karol 2016-06-08 07:15:04 UTC

Marcel I have shared the credentials through email. 

I tried discovery with public image discovery enabled on 5.6.0.10, hitting the same Proxy error.

Comment 5 Marcel Hild 2016-06-16 15:42:29 UTC

https://github.com/ManageIQ/manageiq/issues/9253

Comment 6 Marcel Hild 2016-06-17 07:32:13 UTC

I'm inclined to mark this as NOTABUG, because for 10+ regions and public image discovery the default appliance simply does not have enough memory.

If @gblomqui and @akrol dont step in I'll do so :)


My findings:
When doing a cloud discovery on amazon with get_public_images: true and instances in 10 regions the discovery will create 10 CloudManagers + 10 NetworkManagers. Every manager starts its RefreshWorker, which scan public_images, see here
RAM and swap is depleted, the Ui Worker gets killed, because its unresponsive and it cannot start again, because 60% swap is being used.

Because public images are not necessarily available in all regions, we cannot push the refresh of those up to, say, an Amazon provider.

We could:

wait until skeletor refresh is ready, so we can spread the load better - @Fryguy ?
renice RefreshWorkers or increase the timeout for the UiWorker being killed - @jrafanie ?
I'm not sure if this might be rather an edge case and we'd say, if you want public images, then you have to have more memory.

I'd say we go for 1. and consider it a rare scenario

on an appliance with 8gb ram / 4 cores

Comment 8 Dave Johnson 2016-06-22 21:00:43 UTC

I spoke to Brad Ascar and Dan Clarizio about this one and we have to change the process and at a minimum, communicate a warning that this requires more than one appliance to accomodate such a workload.  Dan proposed possibly adding the list of regions with checkboxes that the user could select so they only manage what they care about.

Comment 9 Marcel Hild 2016-07-22 13:21:07 UTC

To shortly summarize:
The problem with provider discovery is, that it creates and enables all discovered providers. This means every one has its workers started immediately. This can lead to a drain of resources (mainly memory) if it discovers a lot of providers. Mostly this is the case for AWS users that have vms or templates in a lot of regions.

DanC and AdamG and me discussed several solutions to this and found the following as most appropriate.

The discovery should create the providers with valid authentication but mark them as disabled. Now the user can manually enable each discovered provider. The process could even be smart and enable all providers if there are only that many that can be handled by the setup.

For this we would need the UI to visualize a disabled provider and have a toggle to enable/disable them. The provider side would need to implement that toggle as well.

This would not only solve the problem but also add a new feature to the product. Right now we have no way to pause the workers for a provider. The only way would be to add invalid credentials, but even that has been prohibited by the UI.
So, if there is a reason for a customer to pause, like a maintenance window, he has no option for that.

@jhardy your choice, Mr PM :)

Comment 12 Chris Pelland 2017-08-28 14:50:10 UTC

This bug has been open for more than a year and is assigned to an older release of CloudForms. 

If you would like to keep this Bugzilla open and if the issue is still present in the latest version of the product, please file a new Bugzilla which will be added and assigned to the latest release of CloudForms.

Comment 13 Marcel Hild 2017-08-29 07:32:18 UTC

Brad,
now that we have the ability to pause a provider, would it make sense to move this forward?
I think an easy solution would be to create paused providers during the discovery process. This would reduce the chance of having too many active provider consuming too much memory

Note You need to log in before you can comment on or make changes to this bug.