Bug 2006395 - Developer Topology view slow and unresponsive with large number of workloads
Summary: Developer Topology view slow and unresponsive with large number of workloads
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Dev Console
Version: 4.8
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: 4.12.0
Assignee: Christoph Jerolimov
QA Contact: spathak@redhat.com
URL:
Whiteboard:
: 2008237 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-21 16:18 UTC by Andrew Collins
Modified: 2023-09-18 04:26 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
*Previously, there were unnecessary re-renderings and calculations when rendering the topology graph. As a result, the topology performance wasn’t good when showing hundreds of nodes. With this fix, there are several improvements on the topology page to enhance the performance. As a result, the topology can now handle many workloads and works better with hundreds of workloads. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2006395#[*BZ#2006395*])
Clone Of:
Environment:
Last Closed: 2022-10-19 10:17:45 UTC
Target Upstream Version:
Embargoed:
jakumar: needinfo-
jakumar: needinfo-
jakumar: needinfo-
jakumar: needinfo-
jakumar: needinfo-


Attachments (Terms of Use)

Description Andrew Collins 2021-09-21 16:18:01 UTC
Description of problem:
A User experiences slow and unresponsive developer perspective when the Topology view tries to build a graph for a larger project.
Since the Topology view is the default when first logging in to the developer perspective, this makes for a poor customer experience.

The size of the project is well within the bounds of the cluster's documented max limitats, at the following:
500 pods, 250 replica sets, 99 deployments, 230 jobs

Version-Release number of selected component (if applicable):
OCP 4.8.10

How reproducible:
100%


Steps to Reproduce:
1. Log into Console
2. Select this large project
3. Select Developer perspective (if logging in as cluster-admin)

Actual results:
Browser tab is slow to respond or can become unresponsive (based on the user machine) while the graph is being rendered and/or displayed.

Expected results:
Several possibilities might "fix" this UX issue:
1. Building/rendering the graph may be an intensive task, but isolate from the console experience so a user can still navigate the UI.
2. If it has this known limitation that doesn't have a solution, do not load the Topology by default - or -  Expose the option to cluster administrators to choose whether to load this view by default.


Additional info:

I used Google Chrome developer tools to profile the console.
Memory usage for the tab containing this load went as high as 1.5GB and CPU went to >150 while on the Topology view.

In my own experience, the page renders an already-built graph in about 30 seconds, after which I can navigate to another view within the perspective.

Comment 4 Karthik Jeeyar 2021-09-29 05:09:27 UTC
*** Bug 2008237 has been marked as a duplicate of this bug. ***

Comment 21 Christoph Jerolimov 2022-03-04 09:54:20 UTC
We can confirm that the first wave of patches, which primary reduce the used memory are part of this releases:

These fixes/improvements are part of the upcoming 4.10 GA
* https://bugzilla.redhat.com/show_bug.cgi?id=1999796
* https://bugzilla.redhat.com/show_bug.cgi?id=2039315
* https://bugzilla.redhat.com/show_bug.cgi?id=2042829

This backports are available with 4.9.19 and newer
* https://bugzilla.redhat.com/show_bug.cgi?id=2044287
* https://bugzilla.redhat.com/show_bug.cgi?id=2044292
* https://bugzilla.redhat.com/show_bug.cgi?id=2044259

This backports are available with 4.8.32 and newer
* https://bugzilla.redhat.com/show_bug.cgi?id=2046051
* https://bugzilla.redhat.com/show_bug.cgi?id=2046215
* https://bugzilla.redhat.com/show_bug.cgi?id=2046043 (this was released in 4.8.31)

As said, we continue our work to improve the performance even more.

Can some of the customers can already confirm less/no browser crashs with this changes/releases?

This changes increases the possible load in the topology, esp. if a namespace contains log of Secrets.

We will also implement and backport two features for affected customers so that they can skip the topology on namespaces with a high load. For thiswe created two tickets:

* https://bugzilla.redhat.com/show_bug.cgi?id=2060325 to allow the customers to configure another landing page then the topology.
* https://bugzilla.redhat.com/show_bug.cgi?id=2060329 to show a warning if the number of workloads in the topology let us expect issues.

We expect that we can deliver this within this month.

Comment 22 Christoph Jerolimov 2022-03-04 10:30:09 UTC
I missed 4.7. The backports to 4.7 are merged and should be part of the next release. They are not available yet.

Comment 36 Jaivardhan Kumar 2022-10-19 10:17:45 UTC
Closing this based on comment https://bugzilla.redhat.com/show_bug.cgi?id=2006395#c34 

========================================================================================

from the engineering side, we worked on different fronts to close/handle this scenario. The issue is observed while rendering workloads in the topology view only if the number of workloads is more. Below are the individual tickets where we worked to improve the performance on load time but the topology view is graphical and CPU intensive it has improved to render but still it can't scale for a huge number of workloads.
 
- https://bugzilla.redhat.com/show_bug.cgi?id=1999796 (Topology performance: Reduce the amount of data for Secrets), backported till 4.8
- https://bugzilla.redhat.com/show_bug.cgi?id=2042829 (Topology performance: HPA was fetched for each Deployment (Pod Ring)), backported till 4.8
- https://bugzilla.redhat.com/show_bug.cgi?id=2043064 (Topology performance: Unnecessary rerenderings in topology nodes (unchanged mobx props), in 4.10

Although the above helps in improving the performance to some extent like currently with 100 workloads no issues are observed but beyond 400 workloads can see lags/slowness.

To handle this scenario we also introduced a check to notify the user while loading topology for more number of workloads(100) which will prevent the page from hanging, the details can be seen here

- https://bugzilla.redhat.com/show_bug.cgi?id=2060329 (Detect the unsupported amount of workloads before rendering a lazy or crashing topology) backported till 4.9, 4.8 backport PR is already merged and is in process

Can see the screenshots/gif here in the PR https://github.com/openshift/console/pull/11334 for it.

Let us know in case of any issues


cc @cjerolim

Comment 37 Red Hat Bugzilla 2023-09-18 04:26:19 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.