Created attachment 1165517 [details] error Description of problem: Version-Release number of selected component (if applicable): 5.6.0.0-rc2 How reproducible: 100% Steps to Reproduce: 1.Navigate to compute->clouds->providers 2.From configuration do a amazon cloud provider discovery 3. Actual results: when you do Amazon provider discovery CFME process hit constantly a higher CPU utilization.The UI become unresponsive and proxy error is thrown. see attached screenshot. Expected results: Additional info:
And did you do discovery with public image discovery enabled?
Marcel I have shared the credentials through email. I tried discovery with public image discovery enabled on 5.6.0.10, hitting the same Proxy error.
https://github.com/ManageIQ/manageiq/issues/9253
I'm inclined to mark this as NOTABUG, because for 10+ regions and public image discovery the default appliance simply does not have enough memory. If @gblomqui and @akrol dont step in I'll do so :) My findings: When doing a cloud discovery on amazon with get_public_images: true and instances in 10 regions the discovery will create 10 CloudManagers + 10 NetworkManagers. Every manager starts its RefreshWorker, which scan public_images, see here RAM and swap is depleted, the Ui Worker gets killed, because its unresponsive and it cannot start again, because 60% swap is being used. Because public images are not necessarily available in all regions, we cannot push the refresh of those up to, say, an Amazon provider. We could: wait until skeletor refresh is ready, so we can spread the load better - @Fryguy ? renice RefreshWorkers or increase the timeout for the UiWorker being killed - @jrafanie ? I'm not sure if this might be rather an edge case and we'd say, if you want public images, then you have to have more memory. I'd say we go for 1. and consider it a rare scenario on an appliance with 8gb ram / 4 cores
I spoke to Brad Ascar and Dan Clarizio about this one and we have to change the process and at a minimum, communicate a warning that this requires more than one appliance to accomodate such a workload. Dan proposed possibly adding the list of regions with checkboxes that the user could select so they only manage what they care about.
To shortly summarize: The problem with provider discovery is, that it creates and enables all discovered providers. This means every one has its workers started immediately. This can lead to a drain of resources (mainly memory) if it discovers a lot of providers. Mostly this is the case for AWS users that have vms or templates in a lot of regions. DanC and AdamG and me discussed several solutions to this and found the following as most appropriate. The discovery should create the providers with valid authentication but mark them as disabled. Now the user can manually enable each discovered provider. The process could even be smart and enable all providers if there are only that many that can be handled by the setup. For this we would need the UI to visualize a disabled provider and have a toggle to enable/disable them. The provider side would need to implement that toggle as well. This would not only solve the problem but also add a new feature to the product. Right now we have no way to pause the workers for a provider. The only way would be to add invalid credentials, but even that has been prohibited by the UI. So, if there is a reason for a customer to pause, like a maintenance window, he has no option for that. @jhardy your choice, Mr PM :)
This bug has been open for more than a year and is assigned to an older release of CloudForms. If you would like to keep this Bugzilla open and if the issue is still present in the latest version of the product, please file a new Bugzilla which will be added and assigned to the latest release of CloudForms.
Brad, now that we have the ability to pause a provider, would it make sense to move this forward? I think an easy solution would be to create paused providers during the discovery process. This would reduce the chance of having too many active provider consuming too much memory