Bug 1264809

Summary:

WebUI unusable - operations take many seconds using 100% cpu

Product:

[oVirt] ovirt-engine

Reporter:

Nir Soffer <nsoffer>

Component:

Frontend.WebAdmin

Assignee:

Greg Sheremeta <gshereme>

Status:

CLOSED DUPLICATE

QA Contact:

Pavel Novotny <pnovotny>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

3.6.0

CC:

awels, bugs, frolland, michal.skrivanek, nicolas, oourfali, pnovotny, tnisan, ykaul

Target Milestone:

ovirt-4.0.0-alpha

Keywords:

Reopened

Target Release:

---

Flags:

ecohen: ovirt-4.0.0?
rule-engine: planning_ack?
ecohen: devel_ack+
rule-engine: testing_ack?

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-03-16 07:29:42 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Chrome cpu profile during slow operation	none
Chrome cpu profile during tab left pane item switching	none
Chrome cpu profile during tab switching	none
Chrome cpu profile during putting host to maintenance	none
Firefox engine profile during slow operations	none
20160121_ovirt-3.5.6_chromium-47_ubuntu-15.10_profile	none
20160121_ovirt-3.5.6_firefox-43_disabled-extensions_safe-mode_ubuntu-15.10_profile	none
20160121_ovirt-3.5.6_firefox-43_ubuntu-15.10_profile	none
Firefox profile during hich CPU	none

Description Nir Soffer 2015-09-21 09:21:05 UTC

Created attachment 1075427 [details]
Chrome cpu profile during slow operation

Description of problem:

Engine is horribly slow, almost unusable.

Switching tabs, selecting items in lists (host, dc) can takes many
seconds.

Looking in system monitor, chrome takes 100% cpu (of 400%) when engine
stalls.

Trying to select and copy text (e.g. version from the about dialog) can
take 30 seconds, chrome uses 100% cpu (of 400%) during the wait.

Version-Release number of selected component (if applicable):
oVirt Engine Version: 3.6.0-0.0.master.20150901142224.git8df944a.fc22

How reproducible:
Always

Steps to Reproduce:
Issue seems to start after creating a pool with 50 vms, but seems
to continue after I deleted the pool.

Browser: Chrome Version 45.0.2454.85 (64-bit)
OS: Fedora 22
Hardware: Lenovo T430s, 8G RAM, 60% used
No other process running, only one tab with engine.

Attached cpu profile - created like this:
1. Start with Virtual Machines tab
2. Start profiling
3. Click on Host tab
4. Wait couple of seconds
5. Host tab appears
6. Stop profiling

Comment 1 Nir Soffer 2015-09-21 09:42:33 UTC

Created attachment 1075429 [details]
Chrome cpu profile during tab left pane item switching

Comment 2 Nir Soffer 2015-09-21 09:43:20 UTC

Created attachment 1075430 [details]
Chrome cpu profile during tab switching

Comment 3 Nir Soffer 2015-09-21 09:44:19 UTC

Created attachment 1075443 [details]
Chrome cpu profile during putting host to maintenance

Comment 4 Nir Soffer 2015-09-21 09:52:53 UTC

Created attachment 1075446 [details]
Firefox engine profile during slow operations

Comment 5 Nir Soffer 2015-09-21 09:56:19 UTC

Same issue reproduced on Firefox:
Build identifier: Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0

Comment 6 Tal Nisan 2015-10-06 13:04:39 UTC

Seems like either a ux issue or a memory leak/Chrome issue, Einav, your thoughts?

Comment 7 Einav Cohen 2015-10-06 15:13:33 UTC

(In reply to Tal Nisan from comment #6)
> Seems like either a ux issue or a memory leak/Chrome issue, Einav, your
> thoughts?

will take a look. thanks.

Comment 8 Einav Cohen 2015-10-06 15:19:08 UTC

this could be related to bug 1260499.

Comment 9 Red Hat Bugzilla Rules Engine 2015-10-19 10:53:20 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 10 Sandro Bonazzola 2015-10-26 12:32:13 UTC

this is an automated message. oVirt 3.6.0 RC3 has been released and GA is targeted to next week, Nov 4th 2015.
Please review this bug and if not a blocker, please postpone to a later release.
All bugs not postponed on GA release will be automatically re-targeted to

- 3.6.1 if severity >= high
- 4.0 if severity < high

Comment 11 Alexander Wels 2015-11-02 16:43:47 UTC

Pushing to 4.0. I am completely unable to reproduce.

Comment 12 Einav Cohen 2015-11-09 14:27:43 UTC

(In reply to Alexander Wels from comment #11)
> Pushing to 4.0. I am completely unable to reproduce.

@Pavel, any thoughts or insights on your end? if none - I will close this BZ. Thanks.

Comment 13 Nir Soffer 2015-11-09 15:26:49 UTC

I could not reproduce it after upgrading engine to 
oVirt Engine Version: 3.6.0-0.0.master.20150915165908.git51dfb6e.fc22

Since the upgrade I created several big pools (100 vms each) and did
not experience this issue.

This was probably related to the version mentioned in the bug, or to 
the version of fedora where web browser was running.

I would close this as it seems to be fixed.

Comment 14 Alexander Wels 2015-12-14 18:05:58 UTC

Closing this as this does not seem to occur anymore.

Comment 15 Nir Soffer 2016-01-21 06:59:36 UTC

It seems to occur again - relevant reports from user list:
- http://lists.ovirt.org/pipermail/users/2016-January/037144.html (likely)
- http://lists.ovirt.org/pipermail/users/2016-January/037383.html (maybe)

Comment 16 Nicolas Ecarnot 2016-01-21 07:12:20 UTC

Reporting the same on 4 independent DC :
- 2 DC in 3.5.6
- 2 DC in 3.6.1

Witnessing exactly the same :
- any action is taking seconds to do (click, selection, console launching, editing, switching tab, anything)
- CPU is 100%
- seen on chrome 47, firefox 43, on Linux XUbuntu 15.10 64b, Windows 2008 srv 64b, and other plateforms.

Please don't close this bug.

It

is

happening

for

real...

Comment 17 Nir Soffer 2016-01-21 07:34:08 UTC

(In reply to Nicolas Ecarnot from comment #16)
Nicolas,
We need more info:
- List of browser extensions installed in chrome and firefox
- Can you reproduce when extensions are disabled?
- Can you provide cpu profiles in chrome and firefox?

Creating a profile in Chrome:
1. Press Ctrl + Shift + I
2. Open Profiles tab
3. Select "Collect JavaScript CPU Profile"
4. Click "Start" button
5. Do some operations that reproduces this
6. Click "Stop" button
7. Click the "Save" link near the profile name in the profiles list

Creating a profile in firefox:
1. Press Shift + F5
2. Click "Start Recording Performance"
3. Reproduce...
4. Click "Stop Recording Performance"
5. Click "Save" link in the recordings list

Comment 18 Nicolas Ecarnot 2016-01-21 10:51:47 UTC

(In reply to Nir Soffer from comment #17)
> (In reply to Nicolas Ecarnot from comment #16)
> Nicolas,
> We need more info:
> - List of browser extensions installed in chrome and firefox

Chrome :
- none

Firefox :
- Adblock Edge
- BehindTheOverlay
- Disconnect
- Flash control
- Tab Tree
- Video DownloadHelper
- Ubuntu Modifications

> - Can you reproduce when extensions are disabled?

Firefox, with disabled extensions :
- exact same user experience and feeling. As sluggish with or without safe-mode.

> - Can you provide cpu profiles in chrome and firefox?

See below for the attachments.

With Chromium, some actions are sometimes less slow, but this is random and mostly as slow as with Firefox.

For the sake of a sound comparison, I created the profiles of the next 3 attachments on the only one same DC in oVirt 3.5.6.
If asked, I can reproduce the same 3 profiles with a 3.6.1 DC.

Thank you.

Comment 19 Nicolas Ecarnot 2016-01-21 10:53:25 UTC

Created attachment 1116894 [details]
20160121_ovirt-3.5.6_chromium-47_ubuntu-15.10_profile

Comment 20 Nicolas Ecarnot 2016-01-21 10:54:51 UTC

Created attachment 1116896 [details]
20160121_ovirt-3.5.6_firefox-43_disabled-extensions_safe-mode_ubuntu-15.10_profile

Comment 21 Nicolas Ecarnot 2016-01-21 10:55:41 UTC

Created attachment 1116897 [details]
20160121_ovirt-3.5.6_firefox-43_ubuntu-15.10_profile

Comment 22 Alexander Wels 2016-01-21 20:07:53 UTC

@Nicolas,

Can you tell me the exact version of oVirt you are running? That way I can look up the symbol maps to look up the methods which are taking a lot of time.

Comment 23 Alexander Wels 2016-01-21 20:17:02 UTC

Nevermind, I see you mentioned its 3.5.6

Comment 24 Alexander Wels 2016-01-22 13:56:48 UTC

I have looked over the profiles provided by Nicolas, but nothing jumps out at me as saying hey this Javascript code is doing a lot of processing. In fact it appears to me that most of the time nothing is going on and we are waiting for the server to provide some information (data/code fragments/etc).

@Nicolas,
Can you check the load on the engine machine and tell me how loaded it is?

Also if possible can you do an experiment for me. I am assuming you have a proper DNS configured, but on the engine machine can you add itself to /etc/hosts and see if that improves the situation? We have had instances where misconfigured DNS or /etc/hosts would cause the engine not to be able to resolve itself which would cause a sluggish UI (minutes to log in, and minutes to basically do anything in the UI). I just want to eliminate DNS issues (overloaded/misconfigured/etc).

Comment 25 Nicolas Ecarnot 2016-01-22 14:48:47 UTC

(In reply to Alexander Wels from comment #24)
> I have looked over the profiles provided by Nicolas, but nothing jumps out
> at me as saying hey this Javascript code is doing a lot of processing. In
> fact it appears to me that most of the time nothing is going on and we are
> waiting for the server to provide some information (data/code fragments/etc).
> 
> @Nicolas,
> Can you check the load on the engine machine and tell me how loaded it is?

Commonly, I'm seeing a very decent load on the engine, never hurting my retina.
Precisely, never above 1 execpt some seconds after reboot, and most of the times under 0.5 .

> Also if possible can you do an experiment for me. I am assuming you have a
> proper DNS configured, but on the engine machine can you add itself to
> /etc/hosts and see if that improves the situation? We have had instances
> where misconfigured DNS or /etc/hosts would cause the engine not to be able
> to resolve itself which would cause a sluggish UI (minutes to log in, and
> minutes to basically do anything in the UI). I just want to eliminate DNS
> issues (overloaded/misconfigured/etc).

I just added it in /etc/hosts, but do you want me to add the host name for its public ip address, or for 127.0.0.1?

Now, I just added it for the public ip, rebooted, and did not see improvement.

Comment 26 Alexander Wels 2016-01-22 15:03:29 UTC

Okay just wanted to eliminate the engine part of the equation. Loads at 0.5 are great, so definitely no problem there. And yes I wanted the public ip in /etc/hosts on the ENGINE. you don't even need to restart the engine for that to take effect. So if there is no improvement then that is definitely not the problem either.

Comment 27 Nicolas Ecarnot 2016-01-22 15:07:13 UTC

(In reply to Alexander Wels from comment #24)
> I have looked over the profiles provided by Nicolas, but nothing jumps out
> at me as saying hey this Javascript code is doing a lot of processing. In
> fact it appears to me that most of the time nothing is going on and we are
> waiting for the server to provide some information (data/code fragments/etc).
> 
> @Nicolas,
> Can you check the load on the engine machine and tell me how loaded it is?
> 
> Also if possible can you do an experiment for me. I am assuming you have a
> proper DNS configured, but on the engine machine can you add itself to
> /etc/hosts and see if that improves the situation? We have had instances
> where misconfigured DNS or /etc/hosts would cause the engine not to be able
> to resolve itself which would cause a sluggish UI (minutes to log in, and
> minutes to basically do anything in the UI). I just want to eliminate DNS
> issues (overloaded/misconfigured/etc).

Would you advice me to install a local caching-only dns, like nscd or bind.
So that would exonerate the DNS issues?

Comment 28 Alexander Wels 2016-01-22 15:27:12 UTC

No doing the /etc/hosts test tells me its not the DNS, but something else. The question becomes what else. So I am going to ask a bunch of probably stupid questions, I am just trying to narrow down possible causes.

1. Can you tell me approximately how many hosts you have in your data center.
2. Also how many hosts?
3. Any complicated network setup? like hosts with 8 nics bonded with a bunch of vlans?
4. What console type are you using to connect to your VMs (Spice/VNC/Rdp?)
5. Does the problem go away if you filter your VMs down to lets say one. In the search bar just search for a particular VM name.

Thanks,
Alexander

Comment 29 Nicolas Ecarnot 2016-01-22 16:02:38 UTC

(In reply to Alexander Wels from comment #28)
> No doing the /etc/hosts test tells me its not the DNS, but something else.
> The question becomes what else. So I am going to ask a bunch of probably
> stupid questions, I am just trying to narrow down possible causes.

Never, no stupid questions, we're all aiming the same goal :)

> 1. Can you tell me approximately how many hosts you have in your data center.
> 2. Also how many hosts?

In this one :
- 48 VMs
- 13 hosts

> 3. Any complicated network setup? like hosts with 8 nics bonded with a bunch
> of vlans?

I don't think so. Well... see :

- Every host has 2 bonded NICs accessing the management network and 4 other specific tagged VLAN networks.
- Every host also has 4 additional NICs bonded (mode 1) to reach the iSCSI network (non VLANed, physical and dedicated)

> 4. What console type are you using to connect to your VMs (Spice/VNC/Rdp?)

All on VNC.

> 5. Does the problem go away if you filter your VMs down to lets say one. In
> the search bar just search for a particular VM name.

I lightly improves but this is not great at all.


Other questions or comments :
- I'm using oVirt for more than 4 years, and during this time, I created more DCs, and added more hosts and VMs. I see all these DCs grow and slow and sloooow time after time. Is there something we could check in the database that could be purged, that could be responsible for this behaviour?
- We didn't change our DNS setup since long, but we saw oVirt slow down.
- That may sound obvious but anyway : I see it slower during the day, in production time, when users are using, servers are serving. At rest, this is still slow, but less slow (not that much, though)
- Would it be valuable if I spend time scanning network ports to see if network queries are sent or waited on the engine? (I truly want to help, see)
- I honestly don't think I'm impatient. I consider I could make my work in a poor 1970 green and black tty behind a 9600b line. I think there is really something we can improve.

Comment 30 Alexander Wels 2016-01-22 16:36:30 UTC

(In reply to Nicolas Ecarnot from comment #29)
> (In reply to Alexander Wels from comment #28)
> > No doing the /etc/hosts test tells me its not the DNS, but something else.
> > The question becomes what else. So I am going to ask a bunch of probably
> > stupid questions, I am just trying to narrow down possible causes.
> 
> Never, no stupid questions, we're all aiming the same goal :)
> 
> > 1. Can you tell me approximately how many hosts you have in your data center.
> > 2. Also how many hosts?
> 
> In this one :
> - 48 VMs
> - 13 hosts
> 

That is a fairly small setup. Nothing that could explain this particular problem, I know of people with 100s of hosts and 1000s of VMs.

> > 3. Any complicated network setup? like hosts with 8 nics bonded with a bunch
> > of vlans?
> 
> I don't think so. Well... see :
> 
> - Every host has 2 bonded NICs accessing the management network and 4 other
> specific tagged VLAN networks.
> - Every host also has 4 additional NICs bonded (mode 1) to reach the iSCSI
> network (non VLANed, physical and dedicated)
> 

I would put that in the fairly complicated network setup. My development setup is a single nic for everything so very simple. I will have to take a deeper look at the performance profile you posted to see if I can find network related code.

> > 4. What console type are you using to connect to your VMs (Spice/VNC/Rdp?)
> 
> All on VNC.
> 

I use all spice, so that is definitely something I can try to see if I switch them all to VNC if that makes it re-producible. And I can check for display related code in the performance profile.

> > 5. Does the problem go away if you filter your VMs down to lets say one. In
> > the search bar just search for a particular VM name.
> 
> I lightly improves but this is not great at all.
> 

I didn't think it would, but it was worth a try since it is an easy try.

> 
> Other questions or comments :
> - I'm using oVirt for more than 4 years, and during this time, I created
> more DCs, and added more hosts and VMs. I see all these DCs grow and slow
> and sloooow time after time. Is there something we could check in the
> database that could be purged, that could be responsible for this behaviour?

Most likely your biggest table is going to be the event table and that table gets queried constantly.

> - We didn't change our DNS setup since long, but we saw oVirt slow down.

The DNS test is for one particular issue that we know causes horrible slow down in the UI. We have found half a dozen mis-configured DNS setups from people complaining about horrible slow UI (minutes to log into the UI, minutes to do anything). 

> - That may sound obvious but anyway : I see it slower during the day, in
> production time, when users are using, servers are serving. At rest, this is
> still slow, but less slow (not that much, though)

Here is something else to try that is fairly simple to do. On the top right of your UI window there is a little refresh icon with a dropdown next to it. This sets the automatic refresh rate, which by default is 5 seconds. Try setting it to 60 seconds and see if that improves the situation. The application only refreshes the active tab, but it also refreshes the events at the bottom on the same interval. If there are a lot of events happening at once that will produce an update each time. Fewer updates should generate fewer slowdowns.

> - Would it be valuable if I spend time scanning network ports to see if
> network queries are sent or waited on the engine? (I truly want to help, see)

If you could monitor the browser network traffic from the browser developer tools that might provide some useful information. ctrl-shift-i should open the developer tools, click the network tab. Then filter on XHR requests (those are the javascript queries to the backend). In particular I am interested in calls to GenericApiGWTService. The tools should provide you with how long it takes to get a response. These are usually broken down into 4 sections:

- Blocking
- Sending <-- this is the time it takes for the request to get from the browser to the engine.
- Waiting <-- this is the time engine uses to process the request and return a response.
- Receiving <-- this is the time it takes to receive the actual bytes from the response.

Normally waiting is the longest part of the request, but let me know if either sending or receiving is long.


> - I honestly don't think I'm impatient. I consider I could make my work in a
> poor 1970 green and black tty behind a 9600b line. I think there is really
> something we can improve.

If you are saying it is taking seconds to do anything, that is a problem, and we need to figure out what the problem is. Unfortunately I can't reproduce it here so we end up with, lets poke this bit and see if that changes anything type of testing, which is slow and painful.

Comment 31 Nicolas Ecarnot 2016-01-26 13:27:57 UTC

(In reply to Alexander Wels from comment #30)

> > > 3. Any complicated network setup? like hosts with 8 nics bonded with a bunch
> > > of vlans?
> > 
> > I don't think so. Well... see :
> > 
> > - Every host has 2 bonded NICs accessing the management network and 4 other
> > specific tagged VLAN networks.
> > - Every host also has 4 additional NICs bonded (mode 1) to reach the iSCSI
> > network (non VLANed, physical and dedicated)
> > 
> 
> I would put that in the fairly complicated network setup. My development
> setup is a single nic for everything so very simple. I will have to take a
> deeper look at the performance profile you posted to see if I can find
> network related code.

OK

> > > 4. What console type are you using to connect to your VMs (Spice/VNC/Rdp?)
> > 
> > All on VNC.
> > 
> 
> I use all spice, so that is definitely something I can try to see if I
> switch them all to VNC if that makes it re-producible. And I can check for
> display related code in the performance profile.

I don't get why console type is making any difference?
I am witnessing those slowness all the time, not only when using a console. In fact, we are using consoles quite rarely.

> > - I'm using oVirt for more than 4 years, and during this time, I created
> > more DCs, and added more hosts and VMs. I see all these DCs grow and slow
> > and sloooow time after time. Is there something we could check in the
> > database that could be purged, that could be responsible for this behaviour?
> 
> Most likely your biggest table is going to be the event table and that table
> gets queried constantly.

Is it worth to wipe this table?

> Here is something else to try that is fairly simple to do. On the top right
> of your UI window there is a little refresh icon with a dropdown next to it.
> This sets the automatic refresh rate, which by default is 5 seconds. Try
> setting it to 60 seconds and see if that improves the situation. The
> application only refreshes the active tab, but it also refreshes the events
> at the bottom on the same interval. If there are a lot of events happening
> at once that will produce an update each time. Fewer updates should generate
> fewer slowdowns.

I am already aware of this trick. I tried it, but with no luck.

But this leads me to another question, see at the end of this msg.

> > - Would it be valuable if I spend time scanning network ports to see if
> > network queries are sent or waited on the engine? (I truly want to help, see)
> 
> If you could monitor the browser network traffic from the browser developer
> tools that might provide some useful information. ctrl-shift-i should open
> the developer tools, click the network tab. Then filter on XHR requests
> (those are the javascript queries to the backend). In particular I am
> interested in calls to GenericApiGWTService. The tools should provide you
> with how long it takes to get a response. These are usually broken down into
> 4 sections:
> 
> - Blocking
> - Sending <-- this is the time it takes for the request to get from the
> browser to the engine.
> - Waiting <-- this is the time engine uses to process the request and return
> a response.
> - Receiving <-- this is the time it takes to receive the actual bytes from
> the response.
> 
> Normally waiting is the longest part of the request, but let me know if
> either sending or receiving is long.

I followed your hint, and I'm seeing a very quick sending and response trip, but a very long waiting part.



Another point that may be worth noting :
- some of our VM are clusters peers, using STONITH and fencing them by using fence_rhevm. So they are frequently (every minute) logging in, status checking, and closing session (I hope), using the admin@internal account.
This is filling the event log quite heavily.
Could that explain some slowness?
How could I empty the event table?
Should I?

Comment 32 Alexander Wels 2016-01-27 14:21:50 UTC

> 
> I don't get why console type is making any difference?
> I am witnessing those slowness all the time, not only when using a console.
> In fact, we are using consoles quite rarely.
> 

It could potentially make a difference if there is an issue with the objects that are being returned by the backend. I know they recently got changed to be bigger and more versatile. Also when looking through the browser profiles you posted some of the really slow parts looked like they might be related to some display related event handlers.

Note the huge amount of potentially/maybe/could. Aka I am not sure, but something to look at.

> > > - I'm using oVirt for more than 4 years, and during this time, I created
> > > more DCs, and added more hosts and VMs. I see all these DCs grow and slow
> > > and sloooow time after time. Is there something we could check in the
> > > database that could be purged, that could be responsible for this behaviour?
> > 
> > Most likely your biggest table is going to be the event table and that table
> > gets queried constantly.
> 
> Is it worth to wipe this table?
> 

Potentially, see below.

> > Here is something else to try that is fairly simple to do. On the top right
> > of your UI window there is a little refresh icon with a dropdown next to it.
> > This sets the automatic refresh rate, which by default is 5 seconds. Try
> > setting it to 60 seconds and see if that improves the situation. The
> > application only refreshes the active tab, but it also refreshes the events
> > at the bottom on the same interval. If there are a lot of events happening
> > at once that will produce an update each time. Fewer updates should generate
> > fewer slowdowns.
> 
> I am already aware of this trick. I tried it, but with no luck.
> 
> But this leads me to another question, see at the end of this msg.
> 
> > > - Would it be valuable if I spend time scanning network ports to see if
> > > network queries are sent or waited on the engine? (I truly want to help, see)
> > 
> > If you could monitor the browser network traffic from the browser developer
> > tools that might provide some useful information. ctrl-shift-i should open
> > the developer tools, click the network tab. Then filter on XHR requests
> > (those are the javascript queries to the backend). In particular I am
> > interested in calls to GenericApiGWTService. The tools should provide you
> > with how long it takes to get a response. These are usually broken down into
> > 4 sections:
> > 
> > - Blocking
> > - Sending <-- this is the time it takes for the request to get from the
> > browser to the engine.
> > - Waiting <-- this is the time engine uses to process the request and return
> > a response.
> > - Receiving <-- this is the time it takes to receive the actual bytes from
> > the response.
> > 
> > Normally waiting is the longest part of the request, but let me know if
> > either sending or receiving is long.
> 
> I followed your hint, and I'm seeing a very quick sending and response trip,
> but a very long waiting part.
> 

So I think this is an important hint, it appears the engine is taking a long time to process the request. The network tab should also be telling you the size of the data being sent. There should be 2 numbers, the raw size and the compressed size. Considering the size of your environment I would guess the average size of a refresh should be in the 100k range, can you tell me if it is larger than that.

> 
> 
> Another point that may be worth noting :
> - some of our VM are clusters peers, using STONITH and fencing them by using
> fence_rhevm. So they are frequently (every minute) logging in, status
> checking, and closing session (I hope), using the admin@internal account.
> This is filling the event log quite heavily.
> Could that explain some slowness?
> How could I empty the event table?
> Should I?

Now to answer this question. The table that contains the events is called 'audit_log'. Which basically does what it is named, it keeps an audit log of important events. You can run a simple sql statement to clear the table, but that would cause you to lose all your audit log data. I would first check to see if the table is actually huge or not. Let me know if you need instructions on how to query the table.

Comment 33 Alexander Wels 2016-01-28 21:01:14 UTC

Some additional information. To rule out the event (audit_log) table I created a script that uses the REST api to quickly create and remove VMs. Each create and remove adds an entry to the log. I threaded this script and made it do the create/remove at different intervals, so each refresh I would be sure that my VM list and events were different.

After running this for an hour, I had tons of events in the log and the list of VMs was constantly changing. The script caused the load on the engine to be ~15 due to all the processing of creating/removing VMs. But the browser CPU usage sometimes spiked to 100%, but was mostly below 25%.

So I can definitely say that the events have little to nothing to do with this particular problem. Time to try something else.

Comment 34 Nicolas Ecarnot 2016-01-29 07:16:12 UTC

(In reply to Alexander Wels from comment #33)
> Some additional information. To rule out the event (audit_log) table I
> created a script that uses the REST api to quickly create and remove VMs.
> Each create and remove adds an entry to the log. I threaded this script and
> made it do the create/remove at different intervals, so each refresh I would
> be sure that my VM list and events were different.
> 
> After running this for an hour, I had tons of events in the log and the list
> of VMs was constantly changing. The script caused the load on the engine to
> be ~15 due to all the processing of creating/removing VMs. But the browser
> CPU usage sometimes spiked to 100%, but was mostly below 25%.
> 
> So I can definitely say that the events have little to nothing to do with
> this particular problem. Time to try something else.

Thanks for that.
I'm on vacation now, so my spirit is answering there.
I can not easily connect to run sql queries, but according to what you wrote, there's no point searching for huge db table?

Another track : as you mentionned my network setup was complex, that leads me to another thought : when answering, does the engine queries (not sql, but something else I don't know) every host asking them thing, or at least, does it query one host asking it some info, and that would be this query that is taking time?

Precise network sniffing may help? (if I knew what to look at)

Comment 35 Nicolas Ecarnot 2016-02-08 07:03:15 UTC

Alex,

Here are things I could try, let me know your opinion :

- compare the time it takes to do an action via the web gui, and the same action via the ovirt-shell (that would help to check if we are waiting for the engine)
- debug any certificate concerns. I am not comfortable with certificates, but I'd like to know if I could completely re-create and replace any certificates, just to be sure this is not an issue
- try to install a graphical layer (Xorg) on the engine and run firefox on it, thus bypassing any network issue
- I'm worrying about what happened on the packages and the repos, because this is a DC that got upgraded and upgraded. Maybe installing a brand new engine would get rid of any old packages.

These are only ideas, what do you think about these?

Comment 36 Fred Rolland 2016-02-22 13:26:30 UTC

Created attachment 1129312 [details]
Firefox profile during hich CPU

I encounter this issue also.
See attached a Firefox profile captured during high CPU symptom.

Comment 37 Oved Ourfali 2016-03-15 08:06:20 UTC

Greg - please take a look at the capture and see if it gives valuable information.

Comment 38 Greg Sheremeta 2016-03-15 20:38:34 UTC

@Fred, do you encounter it regularly? If so, I'll ask you to try a few things for me. First, try applying this patch [ https://gerrit.ovirt.org/#/c/54503/ ] and see if the issue goes away. If it doesn't, next, apply this patch [ https://gerrit.ovirt.org/#/c/54310/ ].

@Oved, Fred's profile shows high CPU due to GC thrashing (37% of browser time spent doing GC). Same as what Mordechai and I have been seeing. Memory leaks for sure.

Comment 39 Greg Sheremeta 2016-03-15 20:39:26 UTC

@Nir, feel free to try the patches from comment 38 as well.

Comment 40 Oved Ourfali 2016-03-16 07:29:42 UTC


*** This bug has been marked as a duplicate of bug 1294678 ***

Comment 41 Fred Rolland 2016-03-29 08:08:46 UTC

Hi,
It does not happen regularly.
If I encounter it again , I will try to understand the steps I did.