Bug 1662972

Summary:	Provision dialog for ec2 with public images is broken after selecting image - first step
Product:	Red Hat CloudForms Management Engine	Reporter:	Matouš Mojžíš <mmojzis>
Component:	Appliance	Assignee:	Nick LaMuro <nlamuro>
Status:	CLOSED DEFERRED	QA Contact:	Sudhir Mallamprabhakara <smallamp>
Severity:	high	Docs Contact:	Red Hat CloudForms Documentation <cloudforms-docs>
Priority:	high
Version:	5.10.0	CC:	abellott, bmidwood, dmetzger, gtanzill, hkataria, lavenel, mfeifer, mpovolny, nlamuro, obarenbo, simaishi
Target Milestone:	GA	Flags:	mfeifer: mirror+
Target Release:	5.11.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	ui:ec2
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-06-23 18:41:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	CFME Core	Target Upstream Version:
Embargoed:

Description Matouš Mojžíš 2019-01-02 15:09:46 UTC

Description of problem:
It looks same as in this BZ:https://bugzilla.redhat.com/show_bug.cgi?id=1588082
Another problem with this dialog fixed in BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1610927

Version-Release number of selected component (if applicable):
5.10.0.28

How reproducible:
Always

Steps to Reproduce:
1. Add ec2 provider with region us-east-1 and public images enabled 
2. Wait for full refresh
3. Go to Compute -> Cloud -> Instances
4. Select provision an instance
5. Select an image and go to next step

Actual results:
When I select an image and then click on continue I get Error requesting data from server and in evm.log:
[----] E, [2018-12-11T09:58:27.492295 #13082:10d2f90] ERROR -- : MIQ(MiqServer#monitor) can't modify frozen Hash
[----] E, [2018-12-11T09:58:27.492440 #13082:10d2f90] ERROR -- : [RuntimeError]: can't modify frozen Hash  Method:[block (2 levels) in <class:LogProxy>]
[----] E, [2018-12-11T09:58:27.492501 #13082:10d2f90] ERROR -- : /opt/rh/cfme-gemset/gems/activerecord-5.0.7.1/lib/active_record/attribute_set.rb:53:in `write_from_user'

Expected results:


Additional info:

Comment 6 Marianne Feifer 2019-01-03 14:37:33 UTC

Please retry with latest build

Comment 7 Nick LaMuro 2019-01-03 20:17:47 UTC

So have a replicated environment to work with, and wanted to give an update on what I have found out so far.

Two corrections/notes regarding the original description:

* The above error in the description is a red herring and unrelated, and what should be looked at is `production.log`, and not `evm.log`
* The request being triggered is actually timing out, so the error in the UI of "Error requesting data from server" is just a message suggesting that


After determining the bug was something else, I decided to tail the specific request that was being made in the last reproduction step by doing the following:

```
[root@dhcp-8-198-2 vmdb]# tail -f log/production.log | grep pre_prov
[----] I, [2019-01-03T14:50:53.014387 #20149:486e828]  INFO -- : Started POST "/vm_cloud/pre_prov?button=continue" for 127.0.0.1 at 2019-01-03 14:50:53 -0500
[----] I, [2019-01-03T14:50:53.806026 #20149:486e828]  INFO -- : Processing by VmCloudController#pre_prov as JS
```

And noticed the PID stopped reporting after that.  I then watched the PID in `top` and noticed that it spiked in memory quite quickly, up to 1.5g before it finally was died/killed off.  I did a quick profile of a different pid making that same request using `rbspy`, and it seems to be making an active record query nearly the entire time, so I will be looking into that.


* * *


This seems to be an issue with a bad query, I suspect specifically with the AWS provider, but I will have to do more digging into the profile data to know more.  I will update again when I find out more.

Comment 11 Nick LaMuro 2019-01-14 16:55:25 UTC

Update:

I have created three patches to address this issue:

- https://github.com/ManageIQ/manageiq/pull/18353
- https://github.com/ManageIQ/manageiq-schema/pull/322
- https://github.com/ManageIQ/manageiq/pull/18354


The first are a collection of isolated fixes that is relatively noninvasive.  This patch speeds up the request by about 50% and drop the memory in half.  Unfortunately, this is down only to 1.5Gigs in memory on my tests, and that was still within the threshold of it being killed on the reference appliance.

The next two patches are a bit more involved, as the first is a migration to add some caching columns, and the second implements the changes to get the maximum benefit from those new columns.  This brings the request to under ten seconds, and the memory foot print is around 200MB extra from idle.  While a much better scenario, the patches require some significant changes to take the full affect, and may or may not be desired.



-Nick

Comment 13 Nick LaMuro 2019-08-19 21:58:04 UTC

https://github.com/ManageIQ/manageiq/pull/18353 has been merged, waiting for backport.

Comment 14 CFME Bot 2019-08-23 18:41:04 UTC

New commit detected on ManageIQ/manageiq/ivanchuk:

https://github.com/ManageIQ/manageiq/commit/1a1a2147ce289159621d158e40ce610625cb6603
commit 1a1a2147ce289159621d158e40ce610625cb6603
Author:     Greg McCullough <gmccullo>
AuthorDate: Mon Aug 19 17:55:51 2019 -0400
Commit:     Greg McCullough <gmccullo>
CommitDate: Mon Aug 19 17:55:51 2019 -0400

    Merge pull request #18353 from NickLaMuro/miq_provision_virt_workflow_better_allowed_templates

    Performance improvements to MiqProvisionVirtWorkflow#allowed_templates

    (cherry picked from commit e1ed394bf103e5dd921748b2ee5508e6bd1a454e)

    https://bugzilla.redhat.com/show_bug.cgi?id=1662972

 app/models/miq_provision_virt_workflow.rb | 78 +-
 1 file changed, 55 insertions(+), 23 deletions(-)

Comment 15 Matouš Mojžíš 2019-09-05 17:03:54 UTC

Nick,

I tried to verify this in 5.11.0.22 and I got same issue with selecting images except I don't get any error and there is no error in logs.
I was able to select image once after not doing on the appliance for an hour.

Comment 16 Nick LaMuro 2019-09-11 18:11:01 UTC

Alright, I will try and take a look at this later today, but a PR was created recently that addressed a bug I caused:

https://github.com/ManageIQ/manageiq/pull/19237


So unsure if that was part of the issue you were seeing or not.


* * *



That said, the fix that did get merged for this fix only includes some of the fixes I have proposed.  So as a result, this is still not a complete solution and has a decent amount of memory bloat.  We most likely will be looking at trying to address this further in the next release.