Bug 1571223

Summary:

[upstream][v2v] Manage IQ performs slowly over remote site

Product:

Red Hat CloudForms Management Engine

Reporter:

Mor <mkalfon>

Component:

Performance

Assignee:

Martin Hradil <mhradil>

Status:

CLOSED ERRATA

QA Contact:

Yadnyawalk Tale <ytale>

Severity:

high

Docs Contact:

Priority:

high

Version:

5.9.0

CC:

bthurber, cbudzilo, cpelland, dagur, dmetzger, hkataria, istein, kbrock, lavenel, mhradil, mlehrer, mpovolny, mshriver, obarenbo, simaishi, smallamp

Target Milestone:

Keywords:

Reopened

Target Release:

5.10.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

v2v

Fixed In Version:

5.10.0.22

Doc Type:

If docs needed, set a value

Doc Text:

This release of Red Hat CloudForms implements a UI optimization which improves performance when using Chrome as the browser accessing an appliance.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-02-07 23:01:43 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

CFME Core

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
environment 1 - evm.log	none
environment 2 - evm.log	none
First evm.log of 2nd environment	none
evm full logs (from clean install)	none
works in chrome 67	none
works in firefox 60	none
screenshots for 10.12.69.26	none

Description Mor 2018-04-24 10:55:11 UTC

Description of problem:
Upon installing ManageIQ from upstream on a VM, engine UI, specifically Menu items navigation performs very slowly, after a while (probably hours), it starts to behave normally. 

VM (appliance) specification: 16GB of RAM, 4 cores, one disk of 66Gb. Default installation procedure, running on oVirt 4.2.2.

We didn't noticed this issue or former versions that we worked on.

Version-Release number of selected component (if applicable):
MIQ master.20180417121918_2dd64f8 

How reproducible:
100% - reproduced on two different environments.

Steps to Reproduce:
1. Install MIQ using ansible playbook.

Actual results:
MIQ performance is slow.

Expected results:
Should behave as normal.

Additional info:

Comment 2 Mor 2018-04-24 11:01:44 UTC

Created attachment 1425947 [details]
environment 1 - evm.log

Comment 3 Mor 2018-04-24 11:02:15 UTC

Created attachment 1425949 [details]
environment 2 - evm.log

Comment 4 dmetzger 2018-04-24 13:52:39 UTC

Closing this ticket as it is for ManageIQ, BZ tickets are for CloudForms. Please open a  Github Issue instead.

In the meantime I've added this issue to the performance team whiteboard.

Comment 8 Daniel Gur 2018-04-25 09:30:26 UTC

Removing Need Info as this bug is already closed.

Comment 9 Daniel Gur 2018-04-25 11:06:40 UTC

Reopening As this is high priority V2V related bug, It is actually slowing the V2V work flow and our testing efforts.

We did not had such issues in CFME 5.9 or other CFME versions
MIQ is usually not tested by QE - just for this V2V effort. That why we just see it now.

Comment 10 dmetzger 2018-04-25 12:10:51 UTC

Daniel,

I closed this ticket (re-closing) because we do not track ManageIQ issues via Bugzilla. We only track CFME issues/bugs via Bugzilla, which is why I requested the issue be raised on the upstream Github for ManageIQ (mkalfon thanks for doing that). 

We are actively investigating this issue and will update the Github issue going forward.

If you reproduce e this behavior with a CFME appliance, file a ticket for that.

Comment 11 Ilanit Stein 2018-04-26 06:48:54 UTC

Created attachment 1426997 [details]
First evm.log of 2nd environment

Comment 12 Daniel Gur 2018-04-26 06:57:50 UTC

Hello Dennis, You are right in any usual case we do not track ManageIQ issues via Bugzilla.
But for the V2V initiative that currently in progress  - the instructions were changed and as currently v2v is only supported by MIQ, QE was specifically instructed to open Bugzilla on such bugs and mark it V2V in the Keywords,
I will forward you the mail from Brett Thurber with those instructions.

There are many more bugs QE opened on V2V upstream lately and they are all handled, this one is no different from them.
Check this filter : https://bugzilla.redhat.com/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=POST&bug_status=MODIFIED&bug_status=ON_DEV&bug_status=ON_QA&bug_status=VERIFIED&bug_status=RELEASE_PENDING&bug_status=CLOSED&list_id=8740349&query_format=advanced&short_desc=%5Bupstream%5D%5Bv2v%5D&short_desc_type=allwordssubstr

To avoid further Ping pong pong of the bug closing and reopening, please if you have any questions please contact me before closing it.

Daniel.

Comment 13 Mor 2018-04-26 09:49:27 UTC

Created attachment 1427105 [details]
evm full logs (from clean install)

Comment 14 Mor 2018-04-26 09:50:54 UTC

I have included old logs from this setup, please let me know if you need more logs. The entire logs in a compressed archive, take about 60MB which is higher than what is available here.

Comment 15 dmetzger 2018-04-26 12:26:26 UTC

A few questions and a request:

1. Can you provide an archive log set from the appliance(s) encountering the issue.
2. Which provider(s) are being managed by the appliance(s)?
3. What is a inventory size of the provider(s) being managed?
4. Has the provider(s) being used changed in the time the performance has been observed to slow down? Changed includes different providers or significant inventory size change of the provider(s)?
5. Are all pages / menu "slow", or is there a specific set of pages / menus?
6. Can you provide a DB dump of the appliance DB?

Comment 16 Mor 2018-04-26 13:29:59 UTC

1. See the latest attached archive, it contains the log from the first run of MIQ (after the clean install).
2. RHV 4.2 and VMware 6.5.
3. In RHV - we manage 6 clusters, 407 hosts, 8 data-stores, 4045 VMs. In VMware 4 hosts, around 15 VMs.
4. No the provider(s) did not change. The issue appears right after you run MIQ on a clean install.
5. All of them are slow, it feels like a general slow no matter what you do or enter in the UI (including login).
6. Can you direct me how to provide that dump?

Comment 17 dmetzger 2018-04-26 14:04:34 UTC

Regarding requested logs, I'm looking for the full archive log set (created by the CEE script https://github.com/redhat-cfme-support/cloudforms_4.X_log_collection/blob/master/collect_CFME_archive_script.sh) or similar, which provides a more comprehensive set of logs / view of the appliance.

I use pg_dump to create a DB dump:

pg_dump -U root -h localhost -F custom -f DumpFileName  vmdb_production

Comment 26 CFME Bot 2018-05-03 20:17:13 UTC

https://github.com/ManageIQ/manageiq-appliance/pull/191

Comment 27 CFME Bot 2018-05-03 22:02:23 UTC

New commit detected on ManageIQ/manageiq-appliance/master:

https://github.com/ManageIQ/manageiq-appliance/commit/7bd338eea1f2c609ccb0f636a592ea93562ab1c1
commit 7bd338eea1f2c609ccb0f636a592ea93562ab1c1
Author:     Keenan Brock <keenan>
AuthorDate: Thu May  3 13:46:33 2018 -0400
Commit:     Keenan Brock <keenan>
CommitDate: Thu May  3 13:46:33 2018 -0400

    Add cache headers for webpack

    We convert ETags to max-age for asset pipeline assets
    ETag:
     - the server needs to calculate the checksum
     - the client needs to ask the server if it has changed
       incurring the cost of an HTTP request, but not transferring the assets
    max-age:
     - the server just sends the current date plus one year
     - the assets have a checksum in the filename. so a file will never change
     - the client does not need to ask the server if a date has changed
       no HTTP request

    Chrome seems to mostly ignore cache headers, but firefox better follows
    the contract. For assets using ETag, firefox is hitting the server checking
    assets often.

    We were using max-age for asset pipeline, but etags for webpack assets.

    Now, we are using max-age for both.
    In addition, we are now setting Cache-control: public stating
    proxy servers can cache these assets

    Changing this value reduces the requests made to the server, especially
    for firefox.
    This makes most impact for users with a higher latency connection

    https://bugzilla.redhat.com/show_bug.cgi?id=1571223

 COPY/etc/httpd/conf.d/manageiq-https-application.conf | 9 +
 1 file changed, 9 insertions(+)

Comment 31 Keenan Brock 2018-05-07 14:30:28 UTC

The perf_primed_cache.png is the issue. It shows 10 pages being served from cache for me.

Also, custom.css - that file is one of the slowest files for me. We need to add a location or something for that file. It will probably only be changed once by the customer, but tricky to know when

Comment 32 Mor 2018-06-18 20:39:09 UTC

Issue still very relevant on CFME 5.10.0.0.20180613200131_887cc81

Comment 33 Mor 2018-06-20 09:34:14 UTC

Updating the status (if its not clear yet): I'm browsing the interface from TLV on RDU (remote site). When I try to work from RDU internally, the performance improves significantly.

Comment 36 Keenan Brock 2018-06-20 16:35:59 UTC

I would like to restate the problem:

When a client accesses a new server, the client does not have the css/js resources cached. So this will download all the resources.
This BZ describes a new server, which falls into the above description.

So this will affect any new client, regardless of how long the server has been running.

For some reason, it takes the client a long time to determine which resources can be cached. Most pages require a dozen http requests, and most of those requests result in downloading of data. The html page (rails response) tend to be only 10k, but the extra requests for static resources (mostly css and js files, though png are included) take over 500mb.

For local connections, 0.6GB per page is a little slow, but for a remote office, this takes a long time to download and is very susceptible to VPN congestion.

Firefox (especially on Fedora) seems to properly respect headers, while Chrome is more lenient. This results in Fedora downloading all the resources over and over again while Chrome downloads fewer resources. Firefox ends up being slow.

The solution to this problem seems to be a typical rails / httpd server optimization:
- reduce the number of assets needed. (Since we use webpack and asset pipeline, we have a bunch of duplicates here)
- change headers to return long running timeouts instead of using etag where possible.
- custom css (~100 bytes) takes up a large amount of time for local machines. Which is due to not being able to properly set the header names. finding another way to bundle this with the other css files would gain a lot, but the difficulty of this task is potentially tricky. the difficulty being the need to educate/walk customers through the bundling process.

In the previous PRs, I did change the product to use long page cache timeouts for webpack files - but it apparently hasn't made enough of a difference.

Throughput AND latency both will be a factor here. Since we are requesting many resources and we are downloading javascript which is very large.

I think figuring out headers would be a bigger payoff than reducing the size of the javascript files. Although, it does look like we have quite a large amount of javascript to be downloaded.

Comment 37 Keenan Brock 2018-06-21 16:51:43 UTC

ugh, so sorry, the units in my previous comment were completely wrong.

login page

file | count / mb | requests / cached
-----|------------|------------
js   | 5 / 10.2   | 5 / 0
css  | 2 /  0.5   | 2 / 0.2kb
png  | 4 /  0.1   | 4 / 0
html | 1 /  4k    | 1 / 4k

ems_infra (empty)

file | count / mb | requests / cached
-----|------------|------------
js   | 5 / 10.2   | 5 / 0
css  | 2 /  0.5   | 2 / 0.2kb
png  | 4 /  0.1   | 4 / 0
html | 4 /  0.06  | 1 / 4k
xhr  | 3 /  0.004 | 3 / 4k

note: the apache logs seem to suggest some of these files are not downloaded.

QUESTION:
Mor, Please confirm that this is still an issue with an upstream master appliance.

Comment 39 Dan Clarizio 2018-06-21 19:35:17 UTC

I will defer to MartinH as to our current efforts.  Other than that, if we want to prioritize some UI optimization for the next release, I'm sure we could come up with some things to research and implement.

Martin, please re-read through some of the comments above and see if there is something we can do to correct the caching of the assets in the various browsers.  Thx, Dan

Comment 40 Mor 2018-06-22 06:55:17 UTC

This issue is still relevant on upstream master.20180619230249_84e9fa9.

The JS asset (../assets/application-82b76a4b24285414b6b3e9bbc645af16dc593370c8ec31a007c74d410dd70c8a.js) ~7.5MB in size is downloaded every time I switch between menu items. It takes ~13 sec to complete over a good VPN tunnel connection (bandwidth and latency). I haven't checked on TLV locally, but I assume that this asset still takes the most to download when switching menu items. So if we can define cache for this asset, we will gain significant improvement.

Comment 41 Martin Hradil 2018-06-22 11:54:27 UTC

Mor... are you sure you cache is not disabled when looking at the network tab?

Or, are any non-manageiq proxies involved?



I can confirm I'm seeing the same JS asset being downloaded and taking 7.5 MB.
(The same asset hash, but the appliance version is  5.10.0.1.20180619163011_900fdc4 in my case.)



When going to any other menu item, that same asset gets server from memory cache on chrome.

The same is happening on firefox, the file gets served from cache the first time I change menu items and on any subsequent try.


That's firefox 60 and chromium 67.


So, from what I can tell, this is already fixed.
Attaching screenshots, maybe somebody can tell me what I'm trying wrong.

Comment 42 Martin Hradil 2018-06-22 11:55:13 UTC

Created attachment 1453716 [details]
works in chrome 67

Comment 43 Martin Hradil 2018-06-22 11:55:39 UTC

Created attachment 1453717 [details]
works in firefox 60

Comment 44 Mor 2018-06-23 20:28:22 UTC

5.10 is different from what we run. We run on MIQ master. I'm running Firefox 60, and I still see the JS being downloaded. Can you please try to login to the server and check? (see comment #35 for hostname).

Comment 45 Martin Hradil 2018-06-25 11:33:05 UTC

OK, tried on 10.12.69.26.

Please note that that machine's version is 5.9.3.2.20180619200710_4f909bc which does not match the "master" claim.

---

Observing the same result in both chrome and firefox, attached screenshots for both.
(The only thing I notice, firefox outputs [full size / transfered size], so 10.44 MB / 627.18 KB still looks like a lot, until you only read the second number.)

Comment 46 Martin Hradil 2018-06-25 11:36:02 UTC

Created attachment 1454327 [details]
screenshots for 10.12.69.26

Comment 47 Mor 2018-06-26 04:05:35 UTC

(In reply to Martin Hradil from comment #45)
> OK, tried on 10.12.69.26.
> 
> Please note that that machine's version is 5.9.3.2.20180619200710_4f909bc
> which does not match the "master" claim.
> 
> ---
> 
> Observing the same result in both chrome and firefox, attached screenshots
> for both.
> (The only thing I notice, firefox outputs [full size / transfered size], so
> 10.44 MB / 627.18 KB still looks like a lot, until you only read the second
> number.)

Sorry, the server name was wrong. 

This is the correct FQDN:
https://manageiq.rhev.openstack.engineering.redhat.com/

Please give it a try

Comment 48 Martin Hradil 2018-06-26 13:43:17 UTC

OK, I tried on that machine, I'm still observing the second request goes from the cache.

But: having waited a while after that and trying yet another menu item, I can indeed see that the file was downloaded again..

But seeing this only in firefox, not chrome.


So.. yes, I can see there's a bug somewhere.

But.. IMO the cache headers are correct, and firefox is simply choosing to aggresively re-download.

(Cache-Control comes straight from the MDN guide , and Expires is supposed to be ignored when max-age is provided)


If that's so, maybe the solution could be simple: we could add `immutable` to the Cache-Control header, since these files will never change anyway.

Comment 49 Martin Hradil 2018-06-26 13:47:27 UTC

.. but alas, no, firefox seems to happily ignore even the immutable flag.

I'm sorry, right now, my only solution is: use chrome.

Comment 50 Martin Hradil 2018-06-27 08:42:54 UTC

A different idea: maybe the file doesn't get cached because it's too big.

But, looks like current versions of Firefox have the maximum set to 51 megabytes, which should be enough.

This can be checked in about:config, under browser.cache.disk.max_entry_size

Comment 51 Keenan Brock 2018-09-04 18:51:50 UTC

I used mitmproxy to look at the information outside of firefox.
I think this is unrelated to the original effort.
It is looking like the headers are working great in firefox 60.0.2 (Mac)

This is frustrating because everything looks good. I currently going on the assumption that a) a cache hit is better than b) a 304, and those are better than c) a full download of a resource.

I'm also going on the assumption that we are getting hit by the number of resources rather than the quantity of data downloaded. I'm also assuming the size of a download is insignificant until it is 50k or larger.

1. https://support.mozilla.org/en-US/questions/1169302 -- you want to turn this off. I've noticed a lot of traffic from by browser for this. It was introduced firefox 52

thoughts on the server

1. would be nice if we could have the menu use the proper urls (e.g.: /cloud_volume vs /cloud_volume/show_list) It is a 140ms delay on a 756ms page (20%) That delay ends up delaying all resources since we can't determine the other files to download until after the redirect occurs. I'm assuming this resource is slower over a wan/vpn. Think this fix would be in menu.rb.

2. report_data (177ms 0.6k) seems like it could be encoded directly into the page. Not sure if it would slow the page itself down or if this request is even possible from an architectural point of view.

3. haml templates like notification-heading.html and notification-subheading.html (200ms / 0.1k) have a bunch of 200 requests over xhr. MartinH mentioned the ability to precompile these. I noticed the dashboard downloading at least 4 copies of this file and using another 8 cached copies. From my naive perspective, this sounds like more effort than it is worth, but I wanted to mention it here.

Comment 52 Martin Hradil 2018-09-05 12:52:06 UTC

1. agreed, definitely doable, app/presenters/menu/default_menu.rb

2. true, but this would go directly against the goal of consuming report data from the API, so not sure it is worth it (that said, the current hybrid approach generates some data twice, so finishing that might speed this up too)

3. the only complication there is that we'd need a separate bundle for each language, and we'd likely lose the ability to use ruby helpers in haml files (as they would get compiled by javascript code if we were precompiling) ... but maybe we could start with a fix to the angular loader, so that it does not try to request a resource multiple times in paralel if a request has already been made

Comment 54 CFME Bot 2018-10-11 14:31:01 UTC

New commit detected on ManageIQ/manageiq-ui-classic/master:

https://github.com/ManageIQ/manageiq-ui-classic/commit/408ac643d4b7fc4c5fc872890d3bb07cf8b47a88
commit 408ac643d4b7fc4c5fc872890d3bb07cf8b47a88
Author:     Martin Hradil <mhradil>
AuthorDate: Wed Oct 10 11:56:05 2018 -0400
Commit:     Martin Hradil <mhradil>
CommitDate: Wed Oct 10 11:56:05 2018 -0400

    Default menu - fix all menu items to use full url

    previously there were a lot of Menu::Item instances using an url like `/container`.

    This is OK, but it means a redirect to `/container/show_list` every time that menu item is accessed.

    Updating to make all menu items use the default redirect URL.

    This means the only non-external urls remaining without a method part are: `/bottlenecks`, `/graphql_explorer`, `/planning` and `/utilization` - all of these work without a redirect.

    https://bugzilla.redhat.com/show_bug.cgi?id=1571223

 app/presenters/menu/default_menu.rb | 124 +-
 1 file changed, 62 insertions(+), 62 deletions(-)

Comment 55 CFME Bot 2018-10-11 14:31:08 UTC

New commit detected on ManageIQ/manageiq-ui-classic/hammer:

https://github.com/ManageIQ/manageiq-ui-classic/commit/80cf760f5b9acd316cc5a054f91e332da10b9a19
commit 80cf760f5b9acd316cc5a054f91e332da10b9a19
Author:     Milan Zázrivec <mzazrivec>
AuthorDate: Thu Oct 11 05:10:03 2018 -0400
Commit:     Milan Zázrivec <mzazrivec>
CommitDate: Thu Oct 11 05:10:03 2018 -0400

    Merge pull request #4752 from himdel/specific-menu

    Default menu - fix all menu items to use full url

    (cherry picked from commit eafe89a4016ab4379b4a4b4a33620c4920694f93)

    https://bugzilla.redhat.com/show_bug.cgi?id=1571223

 app/presenters/menu/default_menu.rb | 124 +-
 1 file changed, 62 insertions(+), 62 deletions(-)

Comment 56 Martin Hradil 2018-10-23 12:14:58 UTC

Created https://github.com/ManageIQ/manageiq-ui-classic/pull/4813 to reduce the number of notification-related http requests from 6 to 2.

Marking the PR as fixing this bz, as I don't think there's any more we can do here, not without a specific problem.

Comment 57 CFME Bot 2018-10-24 06:26:28 UTC

New commit detected on ManageIQ/manageiq-ui-classic/master:

https://github.com/ManageIQ/manageiq-ui-classic/commit/97d83f2997d27951582d64c7da14e1bfc3ad9e6b
commit 97d83f2997d27951582d64c7da14e1bfc3ad9e6b
Author:     Martin Hradil <mhradil>
AuthorDate: Tue Oct 23 08:04:49 2018 -0400
Commit:     Martin Hradil <mhradil>
CommitDate: Tue Oct 23 08:04:49 2018 -0400

    notifications - notificationBodyInclude - use render :partial instead of ng-include

    large template, so not inlining, but at least we can drop the extra async request for it

    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1571223

 app/assets/javascripts/controllers/notifications/notification-drawer.directive.js | 1 -
 app/views/layouts/_notifications_drawer.html.haml | 1 -
 app/views/static/notification_drawer/_notification-body.html.haml | 32 +
 app/views/static/notification_drawer/notification-body.html.haml | 32 -
 app/views/static/notification_drawer/notification-drawer.html.haml | 4 +-
 5 files changed, 34 insertions(+), 36 deletions(-)

Comment 58 CFME Bot 2018-10-24 15:31:38 UTC

New commit detected on ManageIQ/manageiq-ui-classic/hammer:

https://github.com/ManageIQ/manageiq-ui-classic/commit/0cd47efe5127962aed5f24d838b1d107ddbfe82d
commit 0cd47efe5127962aed5f24d838b1d107ddbfe82d
Author:     Milan Zázrivec <mzazrivec>
AuthorDate: Wed Oct 24 02:22:13 2018 -0400
Commit:     Milan Zázrivec <mzazrivec>
CommitDate: Wed Oct 24 02:22:13 2018 -0400

    Merge pull request #4813 from himdel/ng-include

    Remove ng-include in notifications

    (cherry picked from commit ec520f4f81accba173aee2d045329b18ea20e591)

    Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1571223

 app/assets/javascripts/controllers/notifications/notification-drawer.directive.js | 4 -
 app/views/layouts/_notifications_drawer.html.haml | 3 -
 app/views/static/notification_drawer/_notification-body.html.haml | 32 +
 app/views/static/notification_drawer/notification-body.html.haml | 32 -
 app/views/static/notification_drawer/notification-drawer.html.haml | 16 +-
 app/views/static/notification_drawer/notification-heading.html.haml | 2 -
 app/views/static/notification_drawer/notification-subheading.html.haml | 2 -
 7 files changed, 40 insertions(+), 51 deletions(-)

Comment 63 Ilanit Stein 2019-02-04 07:37:58 UTC

Martin,

Thank you for all the work done to improve things.

Moving bug to verified, as I experience remote slowness, only on the known Firefox issue.

Also adding require_doc_text to document the remote connection limitation.

Comment 65 errata-xmlrpc 2019-02-07 23:01:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:0212