Bug 1264809
Summary: | WebUI unusable - operations take many seconds using 100% cpu | ||
---|---|---|---|
Product: | [oVirt] ovirt-engine | Reporter: | Nir Soffer <nsoffer> |
Component: | Frontend.WebAdmin | Assignee: | Greg Sheremeta <gshereme> |
Status: | CLOSED DUPLICATE | QA Contact: | Pavel Novotny <pnovotny> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.6.0 | CC: | awels, bugs, frolland, michal.skrivanek, nicolas, oourfali, pnovotny, tnisan, ykaul |
Target Milestone: | ovirt-4.0.0-alpha | Keywords: | Reopened |
Target Release: | --- | Flags: | ecohen:
ovirt-4.0.0?
rule-engine: planning_ack? ecohen: devel_ack+ rule-engine: testing_ack? |
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-03-16 07:29:42 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | UX | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Created attachment 1075429 [details]
Chrome cpu profile during tab left pane item switching
Created attachment 1075430 [details]
Chrome cpu profile during tab switching
Created attachment 1075443 [details]
Chrome cpu profile during putting host to maintenance
Created attachment 1075446 [details]
Firefox engine profile during slow operations
Same issue reproduced on Firefox: Build identifier: Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0 Seems like either a ux issue or a memory leak/Chrome issue, Einav, your thoughts? (In reply to Tal Nisan from comment #6) > Seems like either a ux issue or a memory leak/Chrome issue, Einav, your > thoughts? will take a look. thanks. this could be related to bug 1260499. Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release. this is an automated message. oVirt 3.6.0 RC3 has been released and GA is targeted to next week, Nov 4th 2015. Please review this bug and if not a blocker, please postpone to a later release. All bugs not postponed on GA release will be automatically re-targeted to - 3.6.1 if severity >= high - 4.0 if severity < high Pushing to 4.0. I am completely unable to reproduce. (In reply to Alexander Wels from comment #11) > Pushing to 4.0. I am completely unable to reproduce. @Pavel, any thoughts or insights on your end? if none - I will close this BZ. Thanks. I could not reproduce it after upgrading engine to oVirt Engine Version: 3.6.0-0.0.master.20150915165908.git51dfb6e.fc22 Since the upgrade I created several big pools (100 vms each) and did not experience this issue. This was probably related to the version mentioned in the bug, or to the version of fedora where web browser was running. I would close this as it seems to be fixed. Closing this as this does not seem to occur anymore. It seems to occur again - relevant reports from user list: - http://lists.ovirt.org/pipermail/users/2016-January/037144.html (likely) - http://lists.ovirt.org/pipermail/users/2016-January/037383.html (maybe) Reporting the same on 4 independent DC : - 2 DC in 3.5.6 - 2 DC in 3.6.1 Witnessing exactly the same : - any action is taking seconds to do (click, selection, console launching, editing, switching tab, anything) - CPU is 100% - seen on chrome 47, firefox 43, on Linux XUbuntu 15.10 64b, Windows 2008 srv 64b, and other plateforms. Please don't close this bug. It is happening for real... (In reply to Nicolas Ecarnot from comment #16) Nicolas, We need more info: - List of browser extensions installed in chrome and firefox - Can you reproduce when extensions are disabled? - Can you provide cpu profiles in chrome and firefox? Creating a profile in Chrome: 1. Press Ctrl + Shift + I 2. Open Profiles tab 3. Select "Collect JavaScript CPU Profile" 4. Click "Start" button 5. Do some operations that reproduces this 6. Click "Stop" button 7. Click the "Save" link near the profile name in the profiles list Creating a profile in firefox: 1. Press Shift + F5 2. Click "Start Recording Performance" 3. Reproduce... 4. Click "Stop Recording Performance" 5. Click "Save" link in the recordings list (In reply to Nir Soffer from comment #17) > (In reply to Nicolas Ecarnot from comment #16) > Nicolas, > We need more info: > - List of browser extensions installed in chrome and firefox Chrome : - none Firefox : - Adblock Edge - BehindTheOverlay - Disconnect - Flash control - Tab Tree - Video DownloadHelper - Ubuntu Modifications > - Can you reproduce when extensions are disabled? Firefox, with disabled extensions : - exact same user experience and feeling. As sluggish with or without safe-mode. > - Can you provide cpu profiles in chrome and firefox? See below for the attachments. With Chromium, some actions are sometimes less slow, but this is random and mostly as slow as with Firefox. For the sake of a sound comparison, I created the profiles of the next 3 attachments on the only one same DC in oVirt 3.5.6. If asked, I can reproduce the same 3 profiles with a 3.6.1 DC. Thank you. Created attachment 1116894 [details]
20160121_ovirt-3.5.6_chromium-47_ubuntu-15.10_profile
Created attachment 1116896 [details]
20160121_ovirt-3.5.6_firefox-43_disabled-extensions_safe-mode_ubuntu-15.10_profile
Created attachment 1116897 [details]
20160121_ovirt-3.5.6_firefox-43_ubuntu-15.10_profile
@Nicolas, Can you tell me the exact version of oVirt you are running? That way I can look up the symbol maps to look up the methods which are taking a lot of time. Nevermind, I see you mentioned its 3.5.6 I have looked over the profiles provided by Nicolas, but nothing jumps out at me as saying hey this Javascript code is doing a lot of processing. In fact it appears to me that most of the time nothing is going on and we are waiting for the server to provide some information (data/code fragments/etc). @Nicolas, Can you check the load on the engine machine and tell me how loaded it is? Also if possible can you do an experiment for me. I am assuming you have a proper DNS configured, but on the engine machine can you add itself to /etc/hosts and see if that improves the situation? We have had instances where misconfigured DNS or /etc/hosts would cause the engine not to be able to resolve itself which would cause a sluggish UI (minutes to log in, and minutes to basically do anything in the UI). I just want to eliminate DNS issues (overloaded/misconfigured/etc). (In reply to Alexander Wels from comment #24) > I have looked over the profiles provided by Nicolas, but nothing jumps out > at me as saying hey this Javascript code is doing a lot of processing. In > fact it appears to me that most of the time nothing is going on and we are > waiting for the server to provide some information (data/code fragments/etc). > > @Nicolas, > Can you check the load on the engine machine and tell me how loaded it is? Commonly, I'm seeing a very decent load on the engine, never hurting my retina. Precisely, never above 1 execpt some seconds after reboot, and most of the times under 0.5 . > Also if possible can you do an experiment for me. I am assuming you have a > proper DNS configured, but on the engine machine can you add itself to > /etc/hosts and see if that improves the situation? We have had instances > where misconfigured DNS or /etc/hosts would cause the engine not to be able > to resolve itself which would cause a sluggish UI (minutes to log in, and > minutes to basically do anything in the UI). I just want to eliminate DNS > issues (overloaded/misconfigured/etc). I just added it in /etc/hosts, but do you want me to add the host name for its public ip address, or for 127.0.0.1? Now, I just added it for the public ip, rebooted, and did not see improvement. Okay just wanted to eliminate the engine part of the equation. Loads at 0.5 are great, so definitely no problem there. And yes I wanted the public ip in /etc/hosts on the ENGINE. you don't even need to restart the engine for that to take effect. So if there is no improvement then that is definitely not the problem either. (In reply to Alexander Wels from comment #24) > I have looked over the profiles provided by Nicolas, but nothing jumps out > at me as saying hey this Javascript code is doing a lot of processing. In > fact it appears to me that most of the time nothing is going on and we are > waiting for the server to provide some information (data/code fragments/etc). > > @Nicolas, > Can you check the load on the engine machine and tell me how loaded it is? > > Also if possible can you do an experiment for me. I am assuming you have a > proper DNS configured, but on the engine machine can you add itself to > /etc/hosts and see if that improves the situation? We have had instances > where misconfigured DNS or /etc/hosts would cause the engine not to be able > to resolve itself which would cause a sluggish UI (minutes to log in, and > minutes to basically do anything in the UI). I just want to eliminate DNS > issues (overloaded/misconfigured/etc). Would you advice me to install a local caching-only dns, like nscd or bind. So that would exonerate the DNS issues? No doing the /etc/hosts test tells me its not the DNS, but something else. The question becomes what else. So I am going to ask a bunch of probably stupid questions, I am just trying to narrow down possible causes. 1. Can you tell me approximately how many hosts you have in your data center. 2. Also how many hosts? 3. Any complicated network setup? like hosts with 8 nics bonded with a bunch of vlans? 4. What console type are you using to connect to your VMs (Spice/VNC/Rdp?) 5. Does the problem go away if you filter your VMs down to lets say one. In the search bar just search for a particular VM name. Thanks, Alexander (In reply to Alexander Wels from comment #28) > No doing the /etc/hosts test tells me its not the DNS, but something else. > The question becomes what else. So I am going to ask a bunch of probably > stupid questions, I am just trying to narrow down possible causes. Never, no stupid questions, we're all aiming the same goal :) > 1. Can you tell me approximately how many hosts you have in your data center. > 2. Also how many hosts? In this one : - 48 VMs - 13 hosts > 3. Any complicated network setup? like hosts with 8 nics bonded with a bunch > of vlans? I don't think so. Well... see : - Every host has 2 bonded NICs accessing the management network and 4 other specific tagged VLAN networks. - Every host also has 4 additional NICs bonded (mode 1) to reach the iSCSI network (non VLANed, physical and dedicated) > 4. What console type are you using to connect to your VMs (Spice/VNC/Rdp?) All on VNC. > 5. Does the problem go away if you filter your VMs down to lets say one. In > the search bar just search for a particular VM name. I lightly improves but this is not great at all. Other questions or comments : - I'm using oVirt for more than 4 years, and during this time, I created more DCs, and added more hosts and VMs. I see all these DCs grow and slow and sloooow time after time. Is there something we could check in the database that could be purged, that could be responsible for this behaviour? - We didn't change our DNS setup since long, but we saw oVirt slow down. - That may sound obvious but anyway : I see it slower during the day, in production time, when users are using, servers are serving. At rest, this is still slow, but less slow (not that much, though) - Would it be valuable if I spend time scanning network ports to see if network queries are sent or waited on the engine? (I truly want to help, see) - I honestly don't think I'm impatient. I consider I could make my work in a poor 1970 green and black tty behind a 9600b line. I think there is really something we can improve. (In reply to Nicolas Ecarnot from comment #29) > (In reply to Alexander Wels from comment #28) > > No doing the /etc/hosts test tells me its not the DNS, but something else. > > The question becomes what else. So I am going to ask a bunch of probably > > stupid questions, I am just trying to narrow down possible causes. > > Never, no stupid questions, we're all aiming the same goal :) > > > 1. Can you tell me approximately how many hosts you have in your data center. > > 2. Also how many hosts? > > In this one : > - 48 VMs > - 13 hosts > That is a fairly small setup. Nothing that could explain this particular problem, I know of people with 100s of hosts and 1000s of VMs. > > 3. Any complicated network setup? like hosts with 8 nics bonded with a bunch > > of vlans? > > I don't think so. Well... see : > > - Every host has 2 bonded NICs accessing the management network and 4 other > specific tagged VLAN networks. > - Every host also has 4 additional NICs bonded (mode 1) to reach the iSCSI > network (non VLANed, physical and dedicated) > I would put that in the fairly complicated network setup. My development setup is a single nic for everything so very simple. I will have to take a deeper look at the performance profile you posted to see if I can find network related code. > > 4. What console type are you using to connect to your VMs (Spice/VNC/Rdp?) > > All on VNC. > I use all spice, so that is definitely something I can try to see if I switch them all to VNC if that makes it re-producible. And I can check for display related code in the performance profile. > > 5. Does the problem go away if you filter your VMs down to lets say one. In > > the search bar just search for a particular VM name. > > I lightly improves but this is not great at all. > I didn't think it would, but it was worth a try since it is an easy try. > > Other questions or comments : > - I'm using oVirt for more than 4 years, and during this time, I created > more DCs, and added more hosts and VMs. I see all these DCs grow and slow > and sloooow time after time. Is there something we could check in the > database that could be purged, that could be responsible for this behaviour? Most likely your biggest table is going to be the event table and that table gets queried constantly. > - We didn't change our DNS setup since long, but we saw oVirt slow down. The DNS test is for one particular issue that we know causes horrible slow down in the UI. We have found half a dozen mis-configured DNS setups from people complaining about horrible slow UI (minutes to log into the UI, minutes to do anything). > - That may sound obvious but anyway : I see it slower during the day, in > production time, when users are using, servers are serving. At rest, this is > still slow, but less slow (not that much, though) Here is something else to try that is fairly simple to do. On the top right of your UI window there is a little refresh icon with a dropdown next to it. This sets the automatic refresh rate, which by default is 5 seconds. Try setting it to 60 seconds and see if that improves the situation. The application only refreshes the active tab, but it also refreshes the events at the bottom on the same interval. If there are a lot of events happening at once that will produce an update each time. Fewer updates should generate fewer slowdowns. > - Would it be valuable if I spend time scanning network ports to see if > network queries are sent or waited on the engine? (I truly want to help, see) If you could monitor the browser network traffic from the browser developer tools that might provide some useful information. ctrl-shift-i should open the developer tools, click the network tab. Then filter on XHR requests (those are the javascript queries to the backend). In particular I am interested in calls to GenericApiGWTService. The tools should provide you with how long it takes to get a response. These are usually broken down into 4 sections: - Blocking - Sending <-- this is the time it takes for the request to get from the browser to the engine. - Waiting <-- this is the time engine uses to process the request and return a response. - Receiving <-- this is the time it takes to receive the actual bytes from the response. Normally waiting is the longest part of the request, but let me know if either sending or receiving is long. > - I honestly don't think I'm impatient. I consider I could make my work in a > poor 1970 green and black tty behind a 9600b line. I think there is really > something we can improve. If you are saying it is taking seconds to do anything, that is a problem, and we need to figure out what the problem is. Unfortunately I can't reproduce it here so we end up with, lets poke this bit and see if that changes anything type of testing, which is slow and painful. (In reply to Alexander Wels from comment #30) > > > 3. Any complicated network setup? like hosts with 8 nics bonded with a bunch > > > of vlans? > > > > I don't think so. Well... see : > > > > - Every host has 2 bonded NICs accessing the management network and 4 other > > specific tagged VLAN networks. > > - Every host also has 4 additional NICs bonded (mode 1) to reach the iSCSI > > network (non VLANed, physical and dedicated) > > > > I would put that in the fairly complicated network setup. My development > setup is a single nic for everything so very simple. I will have to take a > deeper look at the performance profile you posted to see if I can find > network related code. OK > > > 4. What console type are you using to connect to your VMs (Spice/VNC/Rdp?) > > > > All on VNC. > > > > I use all spice, so that is definitely something I can try to see if I > switch them all to VNC if that makes it re-producible. And I can check for > display related code in the performance profile. I don't get why console type is making any difference? I am witnessing those slowness all the time, not only when using a console. In fact, we are using consoles quite rarely. > > - I'm using oVirt for more than 4 years, and during this time, I created > > more DCs, and added more hosts and VMs. I see all these DCs grow and slow > > and sloooow time after time. Is there something we could check in the > > database that could be purged, that could be responsible for this behaviour? > > Most likely your biggest table is going to be the event table and that table > gets queried constantly. Is it worth to wipe this table? > Here is something else to try that is fairly simple to do. On the top right > of your UI window there is a little refresh icon with a dropdown next to it. > This sets the automatic refresh rate, which by default is 5 seconds. Try > setting it to 60 seconds and see if that improves the situation. The > application only refreshes the active tab, but it also refreshes the events > at the bottom on the same interval. If there are a lot of events happening > at once that will produce an update each time. Fewer updates should generate > fewer slowdowns. I am already aware of this trick. I tried it, but with no luck. But this leads me to another question, see at the end of this msg. > > - Would it be valuable if I spend time scanning network ports to see if > > network queries are sent or waited on the engine? (I truly want to help, see) > > If you could monitor the browser network traffic from the browser developer > tools that might provide some useful information. ctrl-shift-i should open > the developer tools, click the network tab. Then filter on XHR requests > (those are the javascript queries to the backend). In particular I am > interested in calls to GenericApiGWTService. The tools should provide you > with how long it takes to get a response. These are usually broken down into > 4 sections: > > - Blocking > - Sending <-- this is the time it takes for the request to get from the > browser to the engine. > - Waiting <-- this is the time engine uses to process the request and return > a response. > - Receiving <-- this is the time it takes to receive the actual bytes from > the response. > > Normally waiting is the longest part of the request, but let me know if > either sending or receiving is long. I followed your hint, and I'm seeing a very quick sending and response trip, but a very long waiting part. Another point that may be worth noting : - some of our VM are clusters peers, using STONITH and fencing them by using fence_rhevm. So they are frequently (every minute) logging in, status checking, and closing session (I hope), using the admin@internal account. This is filling the event log quite heavily. Could that explain some slowness? How could I empty the event table? Should I? > > I don't get why console type is making any difference? > I am witnessing those slowness all the time, not only when using a console. > In fact, we are using consoles quite rarely. > It could potentially make a difference if there is an issue with the objects that are being returned by the backend. I know they recently got changed to be bigger and more versatile. Also when looking through the browser profiles you posted some of the really slow parts looked like they might be related to some display related event handlers. Note the huge amount of potentially/maybe/could. Aka I am not sure, but something to look at. > > > - I'm using oVirt for more than 4 years, and during this time, I created > > > more DCs, and added more hosts and VMs. I see all these DCs grow and slow > > > and sloooow time after time. Is there something we could check in the > > > database that could be purged, that could be responsible for this behaviour? > > > > Most likely your biggest table is going to be the event table and that table > > gets queried constantly. > > Is it worth to wipe this table? > Potentially, see below. > > Here is something else to try that is fairly simple to do. On the top right > > of your UI window there is a little refresh icon with a dropdown next to it. > > This sets the automatic refresh rate, which by default is 5 seconds. Try > > setting it to 60 seconds and see if that improves the situation. The > > application only refreshes the active tab, but it also refreshes the events > > at the bottom on the same interval. If there are a lot of events happening > > at once that will produce an update each time. Fewer updates should generate > > fewer slowdowns. > > I am already aware of this trick. I tried it, but with no luck. > > But this leads me to another question, see at the end of this msg. > > > > - Would it be valuable if I spend time scanning network ports to see if > > > network queries are sent or waited on the engine? (I truly want to help, see) > > > > If you could monitor the browser network traffic from the browser developer > > tools that might provide some useful information. ctrl-shift-i should open > > the developer tools, click the network tab. Then filter on XHR requests > > (those are the javascript queries to the backend). In particular I am > > interested in calls to GenericApiGWTService. The tools should provide you > > with how long it takes to get a response. These are usually broken down into > > 4 sections: > > > > - Blocking > > - Sending <-- this is the time it takes for the request to get from the > > browser to the engine. > > - Waiting <-- this is the time engine uses to process the request and return > > a response. > > - Receiving <-- this is the time it takes to receive the actual bytes from > > the response. > > > > Normally waiting is the longest part of the request, but let me know if > > either sending or receiving is long. > > I followed your hint, and I'm seeing a very quick sending and response trip, > but a very long waiting part. > So I think this is an important hint, it appears the engine is taking a long time to process the request. The network tab should also be telling you the size of the data being sent. There should be 2 numbers, the raw size and the compressed size. Considering the size of your environment I would guess the average size of a refresh should be in the 100k range, can you tell me if it is larger than that. > > > Another point that may be worth noting : > - some of our VM are clusters peers, using STONITH and fencing them by using > fence_rhevm. So they are frequently (every minute) logging in, status > checking, and closing session (I hope), using the admin@internal account. > This is filling the event log quite heavily. > Could that explain some slowness? > How could I empty the event table? > Should I? Now to answer this question. The table that contains the events is called 'audit_log'. Which basically does what it is named, it keeps an audit log of important events. You can run a simple sql statement to clear the table, but that would cause you to lose all your audit log data. I would first check to see if the table is actually huge or not. Let me know if you need instructions on how to query the table. Some additional information. To rule out the event (audit_log) table I created a script that uses the REST api to quickly create and remove VMs. Each create and remove adds an entry to the log. I threaded this script and made it do the create/remove at different intervals, so each refresh I would be sure that my VM list and events were different. After running this for an hour, I had tons of events in the log and the list of VMs was constantly changing. The script caused the load on the engine to be ~15 due to all the processing of creating/removing VMs. But the browser CPU usage sometimes spiked to 100%, but was mostly below 25%. So I can definitely say that the events have little to nothing to do with this particular problem. Time to try something else. (In reply to Alexander Wels from comment #33) > Some additional information. To rule out the event (audit_log) table I > created a script that uses the REST api to quickly create and remove VMs. > Each create and remove adds an entry to the log. I threaded this script and > made it do the create/remove at different intervals, so each refresh I would > be sure that my VM list and events were different. > > After running this for an hour, I had tons of events in the log and the list > of VMs was constantly changing. The script caused the load on the engine to > be ~15 due to all the processing of creating/removing VMs. But the browser > CPU usage sometimes spiked to 100%, but was mostly below 25%. > > So I can definitely say that the events have little to nothing to do with > this particular problem. Time to try something else. Thanks for that. I'm on vacation now, so my spirit is answering there. I can not easily connect to run sql queries, but according to what you wrote, there's no point searching for huge db table? Another track : as you mentionned my network setup was complex, that leads me to another thought : when answering, does the engine queries (not sql, but something else I don't know) every host asking them thing, or at least, does it query one host asking it some info, and that would be this query that is taking time? Precise network sniffing may help? (if I knew what to look at) Alex, Here are things I could try, let me know your opinion : - compare the time it takes to do an action via the web gui, and the same action via the ovirt-shell (that would help to check if we are waiting for the engine) - debug any certificate concerns. I am not comfortable with certificates, but I'd like to know if I could completely re-create and replace any certificates, just to be sure this is not an issue - try to install a graphical layer (Xorg) on the engine and run firefox on it, thus bypassing any network issue - I'm worrying about what happened on the packages and the repos, because this is a DC that got upgraded and upgraded. Maybe installing a brand new engine would get rid of any old packages. These are only ideas, what do you think about these? Created attachment 1129312 [details]
Firefox profile during hich CPU
I encounter this issue also.
See attached a Firefox profile captured during high CPU symptom.
Greg - please take a look at the capture and see if it gives valuable information. @Fred, do you encounter it regularly? If so, I'll ask you to try a few things for me. First, try applying this patch [ https://gerrit.ovirt.org/#/c/54503/ ] and see if the issue goes away. If it doesn't, next, apply this patch [ https://gerrit.ovirt.org/#/c/54310/ ]. @Oved, Fred's profile shows high CPU due to GC thrashing (37% of browser time spent doing GC). Same as what Mordechai and I have been seeing. Memory leaks for sure. @Nir, feel free to try the patches from comment 38 as well. *** This bug has been marked as a duplicate of bug 1294678 *** Hi, It does not happen regularly. If I encounter it again , I will try to understand the steps I did. |
Created attachment 1075427 [details] Chrome cpu profile during slow operation Description of problem: Engine is horribly slow, almost unusable. Switching tabs, selecting items in lists (host, dc) can takes many seconds. Looking in system monitor, chrome takes 100% cpu (of 400%) when engine stalls. Trying to select and copy text (e.g. version from the about dialog) can take 30 seconds, chrome uses 100% cpu (of 400%) during the wait. Version-Release number of selected component (if applicable): oVirt Engine Version: 3.6.0-0.0.master.20150901142224.git8df944a.fc22 How reproducible: Always Steps to Reproduce: Issue seems to start after creating a pool with 50 vms, but seems to continue after I deleted the pool. Browser: Chrome Version 45.0.2454.85 (64-bit) OS: Fedora 22 Hardware: Lenovo T430s, 8G RAM, 60% used No other process running, only one tab with engine. Attached cpu profile - created like this: 1. Start with Virtual Machines tab 2. Start profiling 3. Click on Host tab 4. Wait couple of seconds 5. Host tab appears 6. Stop profiling