Bug 1014134
Summary: | PRD35 - [RFE][scale] - prevent OutOfMemoryError after starting the dwh service. | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Lee Yarwood <lyarwood> | |
Component: | ovirt-engine-dwh | Assignee: | Shirly Radco <sradco> | |
Status: | CLOSED ERRATA | QA Contact: | movciari | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 3.2.0 | CC: | aberezin, bazulay, iheim, jentrena, juan.hernandez, juwu, lnovich, lyarwood, nobody, pablo.iranzo, pep, pstehlik, rbalakri, Rhev-m-bugs, sradco, ybronhei, yeylon, ylavi | |
Target Milestone: | --- | Keywords: | FutureFeature | |
Target Release: | 3.5.0 | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | infra | |||
Fixed In Version: | oVirt 3.5 Alpha 1 | Doc Type: | Enhancement | |
Doc Text: |
Previously, the data warehouse service became unresponsive and got an OutOfMemoryError on service start when the hourly aggregation tried to aggregate around 1.5 million records. With this update, the service now aggregates per hour of the day and then proceeds to the next hour. This way, data aggregation is now scalable.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1091686 (view as bug list) | Environment: | ||
Last Closed: | 2015-02-11 18:14:14 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1091686, 1142923, 1156165 |
Description
Lee Yarwood
2013-10-01 13:02:40 UTC
Please elaborate on the rhevm environment. how many hosts\vms\clusters and so on? I see two issues in the logs. One that was resolved in 3.2 that causes failure of collection due to vm with more than 16 IPs and the other is the heap size and the solution is to increase the default. Memory leak is unlikely. Yaniv (In reply to Yaniv Dary from comment #2) > Please elaborate on the rhevm environment. how many hosts\vms\clusters and > so on? Already provided in the description, let me know if you need anything else. Quoting the actual text here : (In reply to Lee Yarwood from comment #0) > Description of problem: > > The RHEV-M host itself is a Xen hosted guest with 16GB RAM, 8 sockets 1 core > each, 1x20GB RAW disk hosted within a mirrored LV. We unfortunately do not > have SAR data to highlight the perf hit while running the service. > > The env it controls consists of ~20 DCs, ~20 clusters, ~20 SDs, ~128 hosts > and ~800 vms. I use ~ as this is a rapidly growing env expected to double in > size within 6 months. (In reply to Lee Yarwood from comment #3) > (In reply to Yaniv Dary from comment #2) > > Please elaborate on the rhevm environment. how many hosts\vms\clusters and > > so on? > > Already provided in the description, let me know if you need anything else. > > Quoting the actual text here : > > (In reply to Lee Yarwood from comment #0) > > Description of problem: > > > > The RHEV-M host itself is a Xen hosted guest with 16GB RAM, 8 sockets 1 core > > each, 1x20GB RAW disk hosted within a mirrored LV. We unfortunately do not > > have SAR data to highlight the perf hit while running the service. > > > > The env it controls consists of ~20 DCs, ~20 clusters, ~20 SDs, ~128 hosts > > and ~800 vms. I use ~ as this is a rapidly growing env expected to double in > > size within 6 months. Ok, I think the best course of action is to increase heap size and see if that resolves the issue. The large data amounts are probably the cause. Juan, do you agree? Yaniv (In reply to Yaniv Dary from comment #2) > I see two issues in the logs. One that was resolved in 3.2 that causes > failure of collection due to vm with more than 16 IPs[..] Btw, could you confirm the BZ? The customer is running the latest 3.2.z version after all. (In reply to Lee Yarwood from comment #5) > (In reply to Yaniv Dary from comment #2) > > I see two issues in the logs. One that was resolved in 3.2 that causes > > failure of collection due to vm with more than 16 IPs[..] > > Btw, could you confirm the BZ? The customer is running the latest 3.2.z > version after all. Then it must be a old log message and it should work with a larger heap. (In reply to Yaniv Dary from comment #6) > (In reply to Lee Yarwood from comment #5) > > (In reply to Yaniv Dary from comment #2) > > > I see two issues in the logs. One that was resolved in 3.2 that causes > > > failure of collection due to vm with more than 16 IPs[..] > > > > Btw, could you confirm the BZ? The customer is running the latest 3.2.z > > version after all. > > Then it must be a old log message and it should work with a larger heap. Also because this is a vm, you need to check the storage\host stats because the cause to slowness can be the large amount of data being written to disk, while other vms are also busy. It is my understanding that the very nature of the DWH application is to load huge amounts of data from the database in memory and them crunch them, so I agree with Yaniv that the only way forward is to increase the size of the heap. I don't think additional resources are needed: 16 GiB and 8 CPUs should be enough, if they aren't, in my opinion, there is something that we need to fix. In this regards having a heap dump of the DWH process and a copy of the database could help. Not sure if we want to invest time in this. I would also suggest to restrict the number of CPUs available to the DWH service, so if it goes out of memory again the garbage collector will not take over all the CPUs of the machine. This can be achieved using cgroups, for example: # yum -y install libcgroup # chkconfig cgconfig on # cat >> /etc/cgconfig.conf <<. group dwh { cpuset { cpuset.cpus = "0,1"; # Use only CPUs 0 and 1 cpuset.mems = "0"; # Assuming that there is only one memory node } } . # service cgconfig restart Then edit the history_service.sh script and replace the "exec" that runs the Java virtual machine with "cgexec -g cpuset:dwh": cgexec -g cpuset:dwh ${JAVA_HOME}/bin/java ... Note that this is just a hack to avoid the DWH service overloading the machine, not a solution to the problem. The solution is to increase the heap size. Brilliant, thanks for the feedback. I'll pass these suggestions on to the customer now. Itamar, you added current version flags. what is the fix you want here? Yaniv iiuc, juan suggestions are important and should get their own bug. but do we know why we OOM'd? we need to solve that. i flagged for z-stream to consider it. question if this is a one-off, or reproducible. (In reply to Itamar Heim from comment #11) > iiuc, juan suggestions are important and should get their own bug. Please specify what you mean. > but do we know why we OOM'd? we need to solve that. We don't give unlimited heap size. The standard is good enough for most cases. In very large environments increase of defaults is needed. I don't think we should change the default currently. > i flagged for z-stream to consider it. question if this is a one-off, or > reproducible. (In reply to Yaniv Dary from comment #12) > (In reply to Itamar Heim from comment #11) > > iiuc, juan suggestions are important and should get their own bug. > > Please specify what you mean. > the suggestion in comment 8 to cap cpu maybe. > > but do we know why we OOM'd? we need to solve that. > > We don't give unlimited heap size. The standard is good enough for most > cases. > In very large environments increase of defaults is needed. I don't think we > should change the default currently. so how do we track/monitor/warn for users so it won't explode for them? Lee- is this external DB? (In reply to Barak from comment #15) > Lee- is this external DB? Internal. Please update. Yaniv I have created a test environment with 1000 VMs and 200 hosts running on FakeVDSM. DWHD process started and run without any issues. Engine is responsive and relatively fast. Note that I changed the engine process to use 12GB of heap size. Did the customer change the heap size configuration of the engine process? (In reply to Liran Zelkha from comment #37) > I have created a test environment with 1000 VMs and 200 hosts running on > FakeVDSM. Hey Liran, Which version are you testing here? 3.3? > DWHD process started and run without any issues. > Engine is responsive and relatively fast. > Note that I changed the engine process to use 12GB of heap size. > Did the customer change the heap size configuration of the engine process? Pep? Julio? AFAIK they did change the DWH [1] HEAP size but I don't know by how much. I have no idea if they changed the engine HEAP at all [2]. [1] /usr/share/ovirt-engine-dwh/etl/history_service.sh [2] /etc/ovirt-engine/engine.conf (In reply to Lee Yarwood from comment #38) > Pep? Julio? AFAIK they did change the DWH [1] HEAP size but I don't know by > how much. I have no idea if they changed the engine HEAP at all [2]. > > [1] /usr/share/ovirt-engine-dwh/etl/history_service.sh > [2] /etc/ovirt-engine/engine.conf We'll come back to you on [1] so keeping the needinfo. No changes in [2] /etc/ovirt-engine/engine.conf so default values for engine from /usr/share/ovirt-engine/conf/engine.conf.defaults apply: ENGINE_HEAP_MIN=1g ENGINE_HEAP_MAX=1g ENGINE_PERM_MIN=256m ENGINE_PERM_MAX=256m I'm using RHEVM 3.2. This will never do. Engine heap is too low for so many VMs/hosts. How much memory does the machine has? Increase memory and the engine will run much faster. Concerning Engine - my recommendation is to increase to 8GB. For DWH - if it works at 2GB, leave it like that. Engine would work faster and they will probably not feel the DWH. According to the dump when the OutOfMemory occurred there was one object holding a vector with 1465765 elements (one million and a half approx), and taking a total of 747972248 bytes (750 MiB approx). This vector was a local variable created by a thread created in class AggregationToHourly, and that thread was running the following query: SELECT history_id, history_datetime, current_user_name, vm_id, minutes_in_status, cpu_usage_percent, memory_usage_percent, user_cpu_usage_percent, system_cpu_usage_percent, vm_ip, currently_running_on_host, vm_configuration_version, current_host_configuration_version FROM vm_samples_history WHERE vm_status = 1 AND history_datetime >= ( SELECT var_datetime FROM history_configuration WHERE var_name = 'lastHourAggr' ) ORDER BY history_datetime, current_user_name, vm_id In order to collect the results of the query the PostgresSQL driver creates a vector, that holds the rows, that is the Vector that is growing. So apparently that query is returning at least 1465765 rows, which is hard to believe. Can run this query manually in the customer environment environment and see what is the number of results? If the query does return this huge number of rows, then we will need to modify the DWH so that it runs queries setting the fetch size to something other than zero. Zero is the default, and it means that the JDBC driver loads all rows in memory before returning the results. For details see here: http://jdbc.postgresql.org/documentation/92/query.html#fetchsize-example The code that runs these queries is generated by Talend Studio, so it won't be easy to change. If we need to do this then we will probably have to request it to Talend, or use something like AspectJ to modify the binary code. If the query doesn't return a large number of rows, then we are probably facing a bug in the JDBC driver, or a bug in the database itself. If the issue can be reproduced then I would suggest to generate a dump of the network traffic between the DWH and the database for further analysis. (In reply to Juan Hernández from comment #58) > So apparently that query is returning at least 1465765 rows, which is hard > to believe. Can run this query manually in the customer environment > environment and see what is the number of results? I've ran it on a copy of their db (file 00911410_ovirt_engine_history_sql.xz in the download location - not the very latest and live environment) and it does suggest that the order of magnitude is correct: => select count(*) FROM vm_samples_history WHERE vm_status = 1 AND history_datetime >= (SELECT var_datetime FROM history_configuration WHERE var_name = 'lastHourAggr'); count --------- 1297627 (1 row) => SELECT var_datetime FROM history_configuration WHERE var_name = 'lastHourAggr'; var_datetime ------------------------ 2013-08-29 17:00:00+02 (1 row) Yaniv, how do we proceed from here? (In reply to Josep 'Pep' Turro Mauri from comment #59) > (In reply to Juan Hernández from comment #58) > > So apparently that query is returning at least 1465765 rows, which is hard > > to believe. Can run this query manually in the customer environment > > environment and see what is the number of results? > > I've ran it on a copy of their db (file > 00911410_ovirt_engine_history_sql.xz in the download location - not the very > latest and live environment) and it does suggest that the order of magnitude > is correct: > > => select count(*) FROM vm_samples_history WHERE vm_status = 1 AND > history_datetime >= (SELECT var_datetime FROM history_configuration WHERE > var_name = 'lastHourAggr'); > count > --------- > 1297627 > (1 row) > > => SELECT var_datetime FROM history_configuration WHERE var_name = > 'lastHourAggr'; > var_datetime > ------------------------ > 2013-08-29 17:00:00+02 > (1 row) > > Yaniv, how do we proceed from here? by this calculation they have 1297627 rows / 60 minutes pre hour = 21627 vm data to aggregate / 800 vms we know they should have = 27 hours of non aggregated data. First thing will be to change last aggregate date to now, so this data will not be aggregated, then check the log closely and figure out why samples data is not being aggregated (it should happen every hour). The growing select here caused by data not being aggregated is causing this issue. Yaniv I think the best approach will be to do the hourly aggregation per hourly select instead to select *, this will increase the scale the the DWH can handle and if failure happens it will not increase the select size and we'll handle later hours. The downside is that once an hour is skipped, the data will never be aggregated for that hour. I would also advise to do the same for the daily. This kind of change should first be acceptable by PMs & customers (meaning aggregation failure may lead to a missing hour/day). In addition, such a change as described can not be delivered in zstream. Yaniv for now and to make sure customer issue had been addressed, please add comment how to update the last aggregated hour variable to NOW(). Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2015-0177.html |