Bug 1272099 - UserPortal is not responsive/is slow if there is some network latency
UserPortal is not responsive/is slow if there is some network latency
Status: CLOSED NEXTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine-userportal (Show other bugs)
3.5.4
Unspecified Unspecified
high Severity unspecified
: ovirt-3.6.1
: 3.6.0
Assigned To: Moran Goldboim
Pavel Novotny
infra
: Tracking
Depends On: 1281729
Blocks: 1213937
  Show dependency treegraph
 
Reported: 2015-10-15 09:39 EDT by Petr Spacek
Modified: 2016-02-10 14:14 EST (History)
15 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-11-24 10:11:32 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
video: login sequence, VM creation, VM startup (1.32 MB, application/octet-stream)
2015-10-15 09:40 EDT, Petr Spacek
no flags Details
screen: User Portal login page load time comparison (XHRs) (152.39 KB, image/png)
2015-10-21 10:29 EDT, Pavel Novotny
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 47912 master MERGED userportal,webadmin: sub tab proxy event Never

  None (edit)
Description Petr Spacek 2015-10-15 09:39:56 EDT
Description of problem:
UserPortal is not responsive and using it all day every day make users very unfomfortable.

Version-Release number of selected component (if applicable):
3.5.4.2-1.3.el6ev

How reproducible:
100 %

Steps to Reproduce:
0. Get a RHEV-M instance which is 100-200 ms of network latency far away from you
1. Login into user portal
2. Create a new VM from template
3. Start VM
4. Wait for VM FQDN to be shown in the user portal

Actual results:
It takes ~ 4 minutes. Most of the time is spent waiting for user interface to do something.

It seems that there is no single 'very long' step, but all the small steps (e.g. selecting a VM templates or so) takes enough time to make the user nervous. At the same time the delay is not long enough to warrant 'context switch' in user's head (to another task) so in the end the user is watching web interface and pushing with his eyeballs to make it quicker.


Expected results:
User portal should have very quick responses, especially when initial loading of the web interface is done.


Additional info:
I'm attaching video (recorded at 2 fps, that is the reason why mouse movement is not fluent) so you can see how painful it is to use UserPortal over WAN.

Latency between the machine running web browser and RHEV-M varies between 100 and 200 ms.

VM boot here takes < 20 seconds, guest agent is configured with 'report_application_rate = 10' so the overall delay caused by guest should not be more than 30 seconds.
Comment 1 Petr Spacek 2015-10-15 09:40 EDT
Created attachment 1083265 [details]
video: login sequence, VM creation, VM startup
Comment 2 Einav Cohen 2015-10-19 11:16:17 EDT
Petr, can you please perform parallel tests in RHEV-M 3.6 and elaborate on the experience compared to RHEV 3.5?
Comment 3 Alexander Wels 2015-10-19 15:06:31 EDT
Honestly don't see why this is marked high priority. We don't recommend using webadmin/userportal on a wan. This is one of the reasons why.
Comment 4 Petr Spacek 2015-10-20 04:45:21 EDT
I do not have access to equivalent deployment on RHEV 3.6 so I cannot test it unless you give me such environment.
Comment 6 Petr Spacek 2015-10-20 04:55:12 EDT
(In reply to Alexander Wels from comment #3)
> We don't recommend using webadmin/userportal on a wan.

As a mere RHEV user, I have to say that this recommendation (when followed literarly) seriously limits usefulness of RHEV for use-cases where people are distributed across the globe. Considering e.g. RH with people in ~ 90 countries, would you recommend to deploy 90 RHEV RHEV instances as 'optimization'?

Given that hypervisors cannot be shared among multiple RHEV-M instances, this recommendation cannot be easily followed without huge impact on operational cost.
Comment 7 Einav Cohen 2015-10-20 07:58:09 EDT
(In reply to Petr Spacek from comment #4)
> I do not have access to equivalent deployment on RHEV 3.6 so I cannot test
> it unless you give me such environment.

Pavel, we will need your help with performing some performance tests on the user portal (in addition to the tests that you are performing for the web-admin in the context of bug 1264809) - if there are any pain-points that we can address for the user-portal or for both apps - we will do our best to address them.
Comment 8 Alexander Wels 2015-10-20 10:15:27 EDT
(In reply to Petr Spacek from comment #6)
> (In reply to Alexander Wels from comment #3)
> > We don't recommend using webadmin/userportal on a wan.
> 
> As a mere RHEV user, I have to say that this recommendation (when followed
> literarly) seriously limits usefulness of RHEV for use-cases where people
> are distributed across the globe. Considering e.g. RH with people in ~ 90
> countries, would you recommend to deploy 90 RHEV RHEV instances as
> 'optimization'?
> 
> Given that hypervisors cannot be shared among multiple RHEV-M instances,
> this recommendation cannot be easily followed without huge impact on
> operational cost.

No we simply recommend you don't do it. And if you do, you can expect to have a sluggish UI. I am not quite sure what you are expecting to happen when you connect to a machine with a ping of 200 ms. The UI is designed for low latency connections, its highly interactive, makes a lot of requests, etc. Now with a high latency connection in the mix, all those requests are going to be a lot slower (take a minimum of 200ms, plus processing time), this compounded with a concurrent connection limit in browsers makes for a poor experience.

Without of a full re-architecture of the application, there is not much I can do about the high latency issue.
Comment 9 Petr Spacek 2015-10-20 10:43:51 EDT
To clarify: I'm just trying to point out that there are real use-cases where the assumption about low-latency link will often not hold.

To answer your question what I'm expecting:
I'm expecting as smooth and responsive UI as possible. Natually an action which has to be done server-side will take its time, but just displaying a new dialog with form for user to fill in or just to confirm an action should not take long time.

An example of responsive UI is e.g. FreeIPA project, try the demo here: https://ipa.demo1.freeipa.org/ (you have to click through the cert warning and login as "admin" user with password "Secret123"). For me it has ~ 200 ms latency but the UI is still responsive.

I hope this explains what I would expect. Thank you for understanding!
Comment 10 Alexander Wels 2015-10-20 10:56:26 EDT
For me the ping is around 80ms, and it is not significantly better than when I use the user portal. Can you point me to an example of the user portal being super slow? I have tested on 200ms webadmin/user portal before and while the experience isn't great, it is not horrible either. I am getting all kinds of reports of the UI being unusable and I simply have never seen it. And I have tried both high and low latency connections.
Comment 11 Petr Spacek 2015-10-20 11:25:03 EDT
One example is the video attached to this bug. Any modern version of Firefox should be able to play it.

* Login sequence (starting at position 0:10) takes ~ 20 seconds to display the basic UI. Please note that authentication is handled by mod_auth_kerb in Apache so the authentication itself is already by the time when the login form is displayed.
* Displaying empty New VM form (0:36) takes 7 seconds.
* Selecting a template (0:52) takes 20 seconds to update the form dialog.
* Creating a new VM after confirming New VM dialog (1:16) takes ~ 30 seconds to close the dialog and to return back to list of VMs.

Of course, there will be some latency, but even 7 seconds is uncomparable to 200 ms of latency. 7 seconds / 200 ms gives time for something about 35 round-trips which sounds way too much for displaying one dialog to the user, if we wanted to count round-trips instead of real time.

Naturaly, there might be some server-side slowness, but even 1/3 of the (user-visible) delays I listed above would be too much for smooth user experience.

Does it answer your question? Should I supply additional details?
Comment 12 Alexander Wels 2015-10-20 13:43:48 EDT
Actually this does give me a lot of clues. A couple of observations:

1. Open up the firefox developer tools, and switch to the network tab. ctrl-shift-i should open it up. Then reload the page and watch the network traffic. Since you are using the kerb sso, I get a feeling you will see duplicates of each request, once with and once without the needed headers for the automatic authentication. 

This is due to how the request/response works for the kerb sso, nothing I can do about it. This does however immediately double the number of requests and over a high latency connection that just increases the problems.

If possible try the same thing with a non sso user and see if the experience is better.

2. The application loads pieces of itself on demand, this is to help with the initial login time. It loads a core part, and then as you are navigating the application, other pieces are loaded on demand. One of the on demand pieces are all the sub tabs when you select a VM. Due to an unfixed issue, it loads ALL the sub tab code at once, instead of just the one that is active. This causes a request for each sub tab, as you can see we have like 10 of them so it will make a request for each. Multiply this with the SSO issue above, and we have at least 20 or requests happening at once. The browser limits the concurrent connections to 4-8 (depending on the browser). So I can see it easily taking 10-20 seconds to load them all at this point. In the webadmin I have actually fixed the same issue for all the main tabs, where the main tab only loads when it is active, I might have to extend that fix to include the sub tabs as well.

Btw once the code has been loaded it doesn't load again, so the second time you open up the sub tabs it should be significantly faster since the whole loading step is not there.

3. I noticed your VM status not updating immediately, but after 30 seconds or so. That basically tells me there is something wrong with the code that is supposed to quickly poll the status after it detects that an action (like start VM) is performed. I added that code almost 2 years ago, but I know it got broken at some point during that, and I have fixed it again, I just don't know if the version you have, has the fix in it. If you have the network tab in the developer tools open, and you start (or stop) the VM, do you see a whole bunch of requests being fired rapidly? It should be like poll every 200 ms 3x, then 500ms 3x, then 1s 3x, until it settles back to whatever interval you have defined by default.
Comment 13 Einav Cohen 2015-10-20 14:24:03 EDT
let's focus on fixing 2 and 3 in comment #12.
Comment 14 Einav Cohen 2015-10-20 14:26:04 EDT
let's fix 2 for 'master' only at this point (as the change may be too invasive to introduce in 3.6 at this time), and let's ensure that 3 is ok (i.e. refresh is invoked properly when performing an action).
Comment 15 Petr Spacek 2015-10-21 09:09:07 EDT
(In reply to Alexander Wels from comment #12)
> 1. ...
> This is due to how the request/response works for the kerb sso, nothing I
> can do about it. This does however immediately double the number of requests
> and over a high latency connection that just increases the problems.

Yes, I can see the doubled requests! Apparently RHEV is not doing any optimization in this regard. The recommended way to use mod_auth_kerb module is to guard limited set of '/login' URLs. The application sitting on these '/login' URLs should generate a cookie or something like that and redirect the user to '/app' URLs which are guarded by cookie and not by mod_auth_kerb. (This is how the optimization is implemented in FreeIPA, BTW.)

I spent some time looking into Apache configuration but this seems simply impossible because everything in RHEV UI is under the same URL prefix and I did not find any way how to exclude particular URLs because I do not know how to use RHEV-generated cookies...

Theoretically mod_auth_gssapi can do that automatically using session cookies, but it is not available for RHEL < 7 and RHEV-M cannot run on RHEL != 6.


> 2. The application loads pieces of itself on demand, this is to help with the initial login time. It loads a core part, and then as you are navigating the ...

This sounds like a reasonable optimization.

Alternatively it might help to download everything in one request at once, which allows effecient compression, caching, and requires only one TCP round-trip which is the main bottleneck here. (How often users upgrade RHEV-M? The cache can live for months if there is some simple mechanism to detect RHEV-M upgrade.)

AFAIK FreeIPA loads everything in (almost) one go and there are plans to cache everything including some metadata in browser off-line storage (or how is it called).


> 3. I noticed your VM status not updating immediately, but after 30 seconds
> or so. That basically tells me there is something wrong with the code that
> is supposed to quickly poll the status after it detects that an action (like
...
> you see a whole bunch of requests being fired rapidly? It should be like
> poll every 200 ms 3x, then 500ms 3x, then 1s 3x, until it settles back to
> whatever interval you have defined by default.

I *think* that I can see something like that, but I'm not very sure. There is an infinite stream of POST requests towards /ovirt-engine/userportal/GenericApiGWTService URL, so it hard to tell.


I hope this helps. Have a nice day!
Comment 16 Pavel Novotny 2015-10-21 10:27:30 EDT
(In reply to Einav Cohen from comment #7)
[snip]
> 
> Pavel, we will need your help with performing some performance tests on the
> user portal (in addition to the tests that you are performing for the
> web-admin in the context of bug 1264809) - if there are any pain-points that
> we can address for the user-portal or for both apps - we will do our best to
> address them.

I started with measuring load time of the user portal login page on two RHEVM instances:
- RHEVM instance#1 (nott04) - the problematic one from reproducer video in comment#1
- RHEVM instance#2 (rhevm-3) - a "regular" one for comparison

While rhevm-3 took 1 second to load, nott04 took over 4 seconds.
Practically every request took 4x longer on nott04.
See attached screenshot for details.

In my opinion, there seems to be some factor(s) on nott04 causing general degradation of UI responsiveness.


For the record:
Both RHEVM instances are of the same version (3.5.4.2-1.3.el6ev),
both have ping time around 100ms
and both are located at the same geographical place (TLV). 
Client machine is located in BRQ, Firefox 41.0.
Comment 17 Pavel Novotny 2015-10-21 10:29 EDT
Created attachment 1085168 [details]
screen: User Portal login page load time comparison (XHRs)
Comment 18 Einav Cohen 2015-10-21 10:41:52 EDT
thanks, Pavel. I agree - I believe that a nott04-specific issue is causing the general degradation of UI responsiveness. My assumption is that nott04 is simply a more overloaded engine (more running Hosts/VMs/...) than the 'rhevm-3' setup.
Comment 19 Alexander Wels 2015-10-21 11:24:01 EDT
(In reply to Petr Spacek from comment #15)
> (In reply to Alexander Wels from comment #12)
> > 1. ...
> > This is due to how the request/response works for the kerb sso, nothing I
> > can do about it. This does however immediately double the number of requests
> > and over a high latency connection that just increases the problems.
> 
> Yes, I can see the doubled requests! Apparently RHEV is not doing any
> optimization in this regard. The recommended way to use mod_auth_kerb module
> is to guard limited set of '/login' URLs. The application sitting on these
> '/login' URLs should generate a cookie or something like that and redirect
> the user to '/app' URLs which are guarded by cookie and not by
> mod_auth_kerb. (This is how the optimization is implemented in FreeIPA, BTW.)
> 
> I spent some time looking into Apache configuration but this seems simply
> impossible because everything in RHEV UI is under the same URL prefix and I
> did not find any way how to exclude particular URLs because I do not know
> how to use RHEV-generated cookies...

People a lot smarter than I have looked at trying to get this working, and the solution is going to be the entire SSO implementation (which is being worked on as we speak), but for now we are sort of stuck with this crappy version.

> 
> Theoretically mod_auth_gssapi can do that automatically using session
> cookies, but it is not available for RHEL < 7 and RHEV-M cannot run on RHEL
> != 6.
> 

I believe if we get the full SSO implemented for 3.6 (might be 4.0) the login will be on a separate URL, and thus we should be able to get that particular issue solved.

> 
> > 2. The application loads pieces of itself on demand, this is to help with the initial login time. It loads a core part, and then as you are navigating the ...
> 
> This sounds like a reasonable optimization.
> 
> Alternatively it might help to download everything in one request at once,
> which allows effecient compression, caching, and requires only one TCP
> round-trip which is the main bottleneck here. (How often users upgrade
> RHEV-M? The cache can live for months if there is some simple mechanism to
> detect RHEV-M upgrade.)
> 
> AFAIK FreeIPA loads everything in (almost) one go and there are plans to
> cache everything including some metadata in browser off-line storage (or how
> is it called).
> 

The total amount of Javascript is huge, its in the order of 10M of javascript total. If we download that on login it makes the login process take forever. This is the reason we split it up into fragments. In the near future we are going to do a complete redesign of the application with a different technology and we will address those issues at that point in time.

For now I am going to fix the one issue described so we can have a more responsive UI when loading sub tabs (1 at a time, instead of 10 at once, in 10 separate requests).

> 
> > 3. I noticed your VM status not updating immediately, but after 30 seconds
> > or so. That basically tells me there is something wrong with the code that
> > is supposed to quickly poll the status after it detects that an action (like
> ...
> > you see a whole bunch of requests being fired rapidly? It should be like
> > poll every 200 ms 3x, then 500ms 3x, then 1s 3x, until it settles back to
> > whatever interval you have defined by default.
> 
> I *think* that I can see something like that, but I'm not very sure. There
> is an infinite stream of POST requests towards
> /ovirt-engine/userportal/GenericApiGWTService URL, so it hard to tell.
> 

Well that is the main entry point of the application, and yes it does make it hard to see what exactly is going on since everything looks the same. But normally you get one request every X seconds where X is the refresh interval selected (5-60 seconds). I just tested 3.6 and it is working there so there is no immediate action item for me.
Comment 21 Dotan Paz 2015-11-08 09:38:34 EST
Hi Barak ,
Please see below info on how much time it took me to perform actions :
Over VPN:
Login 7 seconds
Load admin portal - 4-5 seconds

Create VM: 12 seconds to from clicking "new vm" till the configuration box loads
           8  seconds to change the cluster (in which the vm will be created)
           2 seconds to pick a template
           almost 1 min from the time I clicked on "OK" until I got the portal back
           6 minutes from the creation time until the vm was available (switched from image locked to down)
           ~6 seconds to power on the vm
           ~14 seconds from the time I clicked on the console button till spice launched and connected to the console
           
From the LAN:
Log in to rhev-tlv - 6 seconds from the time I clicked "Login"
Load admin portal : 6 seconds until the list of vms appeared
Create VM: 8 seconds from clicking on "new vm" till the configuration box loads.
           6 seconds to change the cluster (in which the vm will be created)
           >1 second to pick a template
           1:09 minutes from the time I clicked on "OK" until I got the portal back
           5:40 minutes minutes from the creation time until the vm was available (switched from image locked to down)
           ~6 seconds to power on the vm
           ~5 seconds from the time I clicked on the console button till spice launched and connected to the console

Template name: RHEL7-0_TLV .


Thanks,
Dotan
Comment 27 Barak 2015-11-12 11:40:12 EST
Moved Bug to Moran & set as a tracker & moved to infra
Comment 40 Nicolas Ecarnot 2016-02-01 03:23:29 EST
(In reply to Dotan Paz from comment #21)
> Hi Barak ,
> Please see below info on how much time it took me to perform actions :
> Over VPN:
> Login 7 seconds
> Load admin portal - 4-5 seconds
> 
> Create VM: 12 seconds to from clicking "new vm" till the configuration box
> loads
>            8  seconds to change the cluster (in which the vm will be created)
>            2 seconds to pick a template
>            almost 1 min from the time I clicked on "OK" until I got the
> portal back
>            6 minutes from the creation time until the vm was available
> (switched from image locked to down)
>            ~6 seconds to power on the vm
>            ~14 seconds from the time I clicked on the console button till
> spice launched and connected to the console
>            
> From the LAN:
> Log in to rhev-tlv - 6 seconds from the time I clicked "Login"
> Load admin portal : 6 seconds until the list of vms appeared
> Create VM: 8 seconds from clicking on "new vm" till the configuration box
> loads.
>            6 seconds to change the cluster (in which the vm will be created)
>            >1 second to pick a template
>            1:09 minutes from the time I clicked on "OK" until I got the
> portal back
>            5:40 minutes minutes from the creation time until the vm was
> available (switched from image locked to down)
>            ~6 seconds to power on the vm
>            ~5 seconds from the time I clicked on the console button till
> spice launched and connected to the console
> 
> Template name: RHEL7-0_TLV .
> 
> 
> Thanks,
> Dotan

Hello,

I'm witnessing similar delays on a very light DC (3 hosts, 12 very lightly loaded VMs), which was upgraded some days ago into 3.6.2 (centos 7.2 hosts, centos 6.7 dedicated engine).
This bug is not fixed.

I'd be glad to add a screencast, but useless as you'll see the same behaviour as the attachment on this BZ.

Note You need to log in before you can comment on or make changes to this bug.