Description of problem: ovirt_engine_history database metric samples contains more than 24 hours of samples even when hoursToKeepSamples=24. This was used in conjunction with Cloud Forms. This extra data causes Cloud Forms to generate an excessive number of messages (units of work) which raises CPU usage for an extended period of time. The highest amount of additional hourly data present as consumed by cloud forms was more than 60 hours of samples. Version-Release number of selected component (if applicable): rhevm-dwh-3.3.0-26.el6ev.noarch How reproducible: 100% Steps to Reproduce: 1. Install rhevm-dwh 2. Query ovirt_engine_history database sample tables/views to see number of entries per VM/metric greater than 1440 (# of minutes in 24 hours) Actual results: > 1440 samples per VM or metric observed at various times of the day Expected results: No more than 1440 samples if hoursToKeepSamples=24 Additional info: Inconsistent number of samples means inconsistent performance for C&U collections for a Cloud Forms appliance introduced to a RHEV environment.
In the docs we say that samples are deleted every 24 hours, so samples can contain 24-48 hours. restarting the dwh will trigger a deletion only in the next day in the selected time, so 60 hour is possible. In any case this can be solved easily by adding in WHERE clause: "datetime >= CURRENT_TIMESTAMP - interval '1 day'" Yaniv
The ovirt-engine-dwhd service had not been restarted during this time period. Is there anything else that would prevent the deletion of samples?
(In reply to Alex Krzos from comment #2) > The ovirt-engine-dwhd service had not been restarted during this time > period. Is there anything else that would prevent the deletion of samples? I could be that you checked while it was still deleting, since we only delete 1000 rows per run and then redo the delete after 5 minutes until all has been deleted to 2 hours. also check that default time to keep samples was not changed on that system, since that can be also the reason.
(In reply to Yaniv Dary from comment #3) > (In reply to Alex Krzos from comment #2) > > The ovirt-engine-dwhd service had not been restarted during this time > > period. Is there anything else that would prevent the deletion of samples? > > I could be that you checked while it was still deleting, since we only > delete 1000 rows per run and then redo the delete after 5 minutes until all > has been deleted to 2 hours. also check that default time to keep samples > was not changed on that system, since that can be also the reason. I have captured the number of samples associated with a single VM's disk and I've seen the number of samples greater than 4300 samples for that particular disk. It also appears that the deletion job is only running every other day and not exactly at the time indcated in the Defaults.properties file. Could it be that once a threshold of number of VMs has been surpassed, the deletion job is taking so long to execute it actually can not get to the to a particular VM's samples until an entire day later?
(In reply to Alex Krzos from comment #4) > (In reply to Yaniv Dary from comment #3) > > (In reply to Alex Krzos from comment #2) > > > The ovirt-engine-dwhd service had not been restarted during this time > > > period. Is there anything else that would prevent the deletion of samples? > > > > I could be that you checked while it was still deleting, since we only > > delete 1000 rows per run and then redo the delete after 5 minutes until all > > has been deleted to 2 hours. also check that default time to keep samples > > was not changed on that system, since that can be also the reason. > > I have captured the number of samples associated with a single VM's disk > and I've seen the number of samples greater than 4300 samples for that > particular disk. It also appears that the deletion job is only running > every other day and not exactly at the time indcated in the > Defaults.properties file. Could it be that once a threshold of number of > VMs has been surpassed, the deletion job is taking so long to execute it > actually can not get to the to a particular VM's samples until an entire day > later? I only deleted 1000 rows per run starting at the set time and it will rerun every 5 second on task end until all is deleted. I will not happen in one go. delete is very database heavy, so we divide the deletes. Does that answer your question? can I close this bug? Yaniv
(In reply to Yaniv Dary from comment #5) > (In reply to Alex Krzos from comment #4) > > (In reply to Yaniv Dary from comment #3) > > > (In reply to Alex Krzos from comment #2) > > > > The ovirt-engine-dwhd service had not been restarted during this time > > > > period. Is there anything else that would prevent the deletion of samples? > > > > > > I could be that you checked while it was still deleting, since we only > > > delete 1000 rows per run and then redo the delete after 5 minutes until all > > > has been deleted to 2 hours. also check that default time to keep samples > > > was not changed on that system, since that can be also the reason. > > > > I have captured the number of samples associated with a single VM's disk > > and I've seen the number of samples greater than 4300 samples for that > > particular disk. It also appears that the deletion job is only running > > every other day and not exactly at the time indcated in the > > Defaults.properties file. Could it be that once a threshold of number of > > VMs has been surpassed, the deletion job is taking so long to execute it > > actually can not get to the to a particular VM's samples until an entire day > > later? > > I only deleted 1000 rows per run starting at the set time and it will rerun > every 5 second on task end until all is deleted. I will not happen in one > go. delete is very database heavy, so we divide the deletes. Does that > answer your question? can I close this bug? > > > > Yaniv 1000 deletes occurring every 5 seconds aligns a lot more closely with the rate of deletes I am seeing in my system. The only mis-understood parameter I see is when deletes are starting to occur. The config file's default setting indicates it should occur nightly at 3 (runDeleteTime=3) The data shows it is running every 48 hours after the last completed run. For instance on my recorded data on sample count per a single vm's disk samples shows: Sat May 3 04:55:51 EDT 2014 1442 Sat May 3 04:56:52 EDT 2014 * Deletes Completed 1441 Sat May 3 04:57:52 EDT 2014 1442 ... Mon May 5 04:54:04 EDT 2014 4318 Mon May 5 04:55:05 EDT 2014 * Deletes Starting 4319 Mon May 5 04:56:05 EDT 2014 4318 ... Mon May 5 07:21:45 EDT 2014 1443 Mon May 5 07:22:45 EDT 2014 * Deletes Completed 1441 * Mon May 5 07:23:45 EDT 2014 1442 ... Wed May 7 07:20:01 EDT 2014 4318 Wed May 7 07:21:02 EDT 2014 * Deletes Starting 4319 Wed May 7 07:22:02 EDT 2014 4317 The delete process starts a full 48 hours after the last completed delete run. Depending on time of day this can provide a query that does not have your listed where clause with up to 72 hours worth of sample data rather than the potential 24-48 hours in the default out of box configuration. I agree completely that the process can not be a single delete operation and does need to be divided. -Alex
Please check the sleep for delete job and check if the above problem is noticeable. Yaniv
This is indeed a bug and it also affects aggregation in 3.4. We will fix it for 3.5 and consider 3.4 fix as well. Yaniv
Will not we fixed for 3.4. Yaniv
(In reply to Yaniv Dary from comment #10) correction: > Will not be fixed for 3.4. > > > > Yaniv
Verified in rhevm-dwh-3.5.0-7.el6ev.noarch, rhevm-3.5.0-0.23.beta.el6ev.noarch (vt13.1). There is no more than 1440 samples per metric per day. Currently my DWH instance lists 1243 samples of VM disk usage per day.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2015-0177.html