Description of problem:
Installed the undercloud on a bare metal machine with 12 CPUs and 24GB of RAM which is located in the lab 2 floors below me. When browsing to the machine from the local network in the office it is taking way too long to log in, and every navigation to every page also takes about 1 minute.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Install the undercloud on a strong bare metal machine
Very slow responses from the web server
It should be lightning fast, given the machine specs I have and the office network we have
I have tested latest poodle in virtual setup and I am getting stable responses up to 15s. Loading Overview page takes around 16s (it is quite api-request-heavy). This time is the same when I log in and get redirected to Overview page.
Deployment Roles page is only place where I am getting extraordinary 31s delay. I am trying to investigate further but only suspicious thing I see in horizon error log is "tuskar_ui.api.tuskar Couldn't obtain image with id overcloud-full" but it is appearing quite early after I click the link, so I don't think that's what delays the response.
Does anyone have some baseline response times to work from? Even 15s is an extremely long time for a page load.
Reassigning this to Ryan as he had some success with speeding up the UI.
Patch available at https://code.engineering.redhat.com/gerrit/#/c/52100/.
I added some profiling middleware to horizon and ran through each of the tuskar-ui urls to get a baseline of time for each page load. In each page, the majority of time spent loading was in socket.py, which pointed at the api client communication as Jirka stated above. I experimented with adding memoization to api clients and found a heavily used function (list) in tuskarclient that called to an api in a list comprehension. Adding memoization to tuskarclient decreased load times; adding memoization to the overcloud_keystoneclient and ironicclient had negative impacts to page load time.
Tested with openstack-tuskar-ui-0.3.0-8.el7ost.noarch on the same machine. The times I get are still terrible. It takes about 30 seconds to get to the overview page, and about 20 seconds to get to the nodes page...
Udi provided me with access to two separate baremetal environments (puma01, puma42) for profiling and testing changes. I compared this to a virt environment running the same packages. I ran a profile for each main url and recorded it in a table .
There was a significant difference in the main overview url between the two baremetal environments that indicates there is something possibly environmentally different. The avg page load times between the virt and puma42 environments are not too far apart in most cases and are statistically close enough to at least be considered consistent. I agree these page load times are not great from a user experience standpoint, however IMO they don't seem to be slow enough to be a blocker to the GA.
I did attempt to reduce redundant api calls by adding additional caching  which showed minor improvement in the virt environment. In a future sprint, I think the page load times should be addressed at the same time we migrate to using common code with the cli, possibly through tripleo-common.
Just to set some expectations here. Almost all UI pages require multiple API calls in order to retrieve the relevant information. The most commonly used ones are probably heat stack-list and tuskar plan-list. Each of these API calls can take several seconds alone, especially on a long-running deployment. (I remember a recent complaint about a heat stack-list taking 6 seconds on the CLI.) Given this, page load times of ~20 seconds are expected.
The original blocker that prompted opening this bug (page loads above 100 s) was solved with the first patch (https://review.openstack.org/#/c/197409/). Since the page load went down drastically after this (the page load time is now around 20 s for the page in question), I wouldn't consider this a blocker anymore.
Reopening the bug once more. The bug should stay open until a real solution is found. If you don't consider this a blocker for the GA you can move the target release to A1 - but we need the bug to stay open in order to track the issue.
What is the success criteria here? Simply saying "it's slow" is not enough. Expecting load times of a few seconds is not realistic here. As I explained in Comment#11, the UI cannot be faster than the underlying services.
What are the load times that you are currently seeing?
Created attachment 1051487 [details]
Original baseline of puma01 env
Created attachment 1051488 [details]
example of puma01 profile without a stack
Just to be clear:
* attachment 1051487 [details] is data from calling real APIs on an environment
* attachment 1051488 [details] is data from calling mock APIs
Is that it?
Both of the attachments are real API calls on a real baremetal environment. The difference is in the data coming from external services.
I've added two attachments to illustrate where the stack is slow going to the main /infrastructure url.
If you look at lines 6 & 7 in https://bugzilla.redhat.com/attachment.cgi?id=1051487, you'll notice that 99.76% of the page load time is devoted to socket.py, which provides socket operations and related functions. There's not anything the obvious the UI can do for that in addition to the caching I've already added. This attachment is the actual profile from Udi's baremetal environment where he noticed the issue.
In https://bugzilla.redhat.com/attachment.cgi?id=1051488, this is the profile for puma01 without any stack. This is the current state of Udi's environment. The services are returning empty result sets quickly and the page load time reflects that at 3.805 secs vs the original time above. This has the caching enabled but since there's nothing to cache at this point it doesn't seem to make much of a difference.
I agree that performance is an issue and we should have a bug on it, but if we're going to do this then we need to do it right with profiling the external services and the UI, analyzing the profiling data, making the changes and comparing the results.
Ok, thanks for the explanation Ryan. In that case we're isolating the bottleneck on the API calls, we need to make them faster to improve response time. In this situation I suggest we close this bug, since the possible fixes were already done, and work on API response time improvements on future cycles.
Moving back to ON_QA.
Feel free to open other bug(s) for possible improvements of the response time of the APIs.
Ryan, you say that socket.py indicates API calls - can you specify which APIs are those? Are you calling APIs from the cli or the service url (specify the exact cli/url that you use please)? And what do you mean by "with a stack" and "without a stack"? I always had a stack already deployed.
I will close this bug if I can see APIs that really take this long (20-30 seconds), but I need to be sure because we never suffered like this while working with the CLI. Thanks.
The API's being called are keystone, heat, ironic and tuskar.
The API clients in tuskar-ui actually use same method as horizon for accessing the APIs by looking up the endpoint url using the service name.
I was able to profile your environment both with and without a stack. You had reprovisioned your environment but had not deployed a stack when I ran that specific profile on 7/13/2015.
<snip>"I will close this bug if I can see APIs that really take this long"</snip>
Most of the original claims of this bug were refuted with actual profile data and the solutions provided fit within the timeline given and showed an improvement across the available baremetal and virt test environments. Any additional effort here will require a much larger refactoring which would likely be wasted effort since there is a desire to migrate the ui to use the same client access code as the plugin in the future. You're not really going to *know* for sure how long the API calls take until they are profiled. You may feel you never suffered previously while using the CLI, but without actual data that's anecdotal at best. An effort to profile all of the services is a task large enough it needs to be included in sprint planning and have dedicated time and resources. There is some coordination between dev teams and the CI team needed to really handle performance properly. I suggest if you are uncomfortable closing this bug immediately, open a new one as an RFE that lists the proper scope in the description of "profiling all the things!" and ensuring performance is treated as a feature moving forward.
clearing the needinfo flag that appeared while was responding.
Performances are now much better than 20 seconds :). Except for when the UI is doing some tasks, like node discovery or deployment... But I will open a separate bug on that.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.