1086389 – Sleep is set according to last run length and not according to total delete\aggregation reruns total time

Bug 1086389 - Sleep is set according to last run length and not according to total delete\aggregation reruns total time

Summary: Sleep is set according to last run length and not according to total delete\a...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine-dwh
Sub Component:
Version:	3.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Shirly Radco
QA Contact:	Pavel Novotny
Docs Contact:
URL:
Whiteboard:	infra
Depends On:
Blocks:	rhev3.5beta 1156165
TreeView+	depends on / blocked

Reported:	2014-04-10 18:29 UTC by Alex Krzos
Modified:	2016-02-10 19:13 UTC (History)
CC List:	12 users (show)
Fixed In Version:	vt10
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2015-02-11 18:14:52 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2015:0177	0	normal	SHIPPED_LIVE	rhevm-dwh 3.5 bug fix and enhancement update	2015-02-11 23:11:50 UTC
oVirt gerrit	27674	0	master	MERGED	history: Fixed sleep time for delete/aggregations	Never

Description Alex Krzos 2014-04-10 18:29:31 UTC

Description of problem:
ovirt_engine_history database metric samples contains more than 24 hours of samples even when hoursToKeepSamples=24.  This was used in conjunction with Cloud Forms.  This extra data causes Cloud Forms to generate an excessive number of messages (units of work) which raises CPU usage for an extended period of time.  The highest amount of additional hourly data present as consumed by cloud forms was more than 60 hours of samples.

Version-Release number of selected component (if applicable):
rhevm-dwh-3.3.0-26.el6ev.noarch

How reproducible:
100%

Steps to Reproduce:
1. Install rhevm-dwh
2. Query ovirt_engine_history database sample tables/views to see number of entries per VM/metric greater than 1440 (# of minutes in 24 hours)

Actual results:
> 1440 samples per VM or metric observed at various times of the day

Expected results:
No more than 1440 samples if hoursToKeepSamples=24

Additional info:
Inconsistent number of samples means inconsistent performance for C&U collections for a Cloud Forms appliance introduced to a RHEV environment.

Comment 1 Yaniv Lavi 2014-04-11 06:51:15 UTC

In the docs we say that samples are deleted every 24 hours, so samples can contain 24-48 hours. restarting the dwh will trigger a deletion only in the next day in the selected time, so 60 hour is possible.

In any case this can be solved easily by adding in WHERE clause:
"datetime >= CURRENT_TIMESTAMP - interval '1 day'"



Yaniv

Comment 2 Alex Krzos 2014-04-11 17:28:26 UTC

The ovirt-engine-dwhd service had not been restarted during this time period.  Is there anything else that would prevent the deletion of samples?

Comment 3 Yaniv Lavi 2014-04-21 10:49:14 UTC

(In reply to Alex Krzos from comment #2)
> The ovirt-engine-dwhd service had not been restarted during this time
> period.  Is there anything else that would prevent the deletion of samples?

I could be that you checked while it was still deleting, since we only delete 1000 rows per run and then redo the delete after 5 minutes until all has been deleted to 2 hours. also check that default time to keep samples was not changed on that system, since that can be also the reason.

Comment 4 Alex Krzos 2014-05-05 13:49:56 UTC

(In reply to Yaniv Dary from comment #3)
> (In reply to Alex Krzos from comment #2)
> > The ovirt-engine-dwhd service had not been restarted during this time
> > period.  Is there anything else that would prevent the deletion of samples?
> 
> I could be that you checked while it was still deleting, since we only
> delete 1000 rows per run and then redo the delete after 5 minutes until all
> has been deleted to 2 hours. also check that default time to keep samples
> was not changed on that system, since that can be also the reason.

I have captured the number of samples associated with a  single VM's disk and I've seen the number of samples greater than 4300 samples for that particular disk.  It also appears that the deletion job is only running every other day and not exactly at the time indcated in the Defaults.properties file.  Could it be that once a threshold of number of VMs has been surpassed, the deletion job is taking so long to execute it actually can not get to the to a particular VM's samples until an entire day later?

Comment 5 Yaniv Lavi 2014-05-07 23:20:34 UTC

(In reply to Alex Krzos from comment #4)
> (In reply to Yaniv Dary from comment #3)
> > (In reply to Alex Krzos from comment #2)
> > > The ovirt-engine-dwhd service had not been restarted during this time
> > > period.  Is there anything else that would prevent the deletion of samples?
> > 
> > I could be that you checked while it was still deleting, since we only
> > delete 1000 rows per run and then redo the delete after 5 minutes until all
> > has been deleted to 2 hours. also check that default time to keep samples
> > was not changed on that system, since that can be also the reason.
> 
> I have captured the number of samples associated with a  single VM's disk
> and I've seen the number of samples greater than 4300 samples for that
> particular disk.  It also appears that the deletion job is only running
> every other day and not exactly at the time indcated in the
> Defaults.properties file.  Could it be that once a threshold of number of
> VMs has been surpassed, the deletion job is taking so long to execute it
> actually can not get to the to a particular VM's samples until an entire day
> later?

I only deleted 1000 rows per run starting at the set time and it will rerun every 5 second on task end until all is deleted. I will not happen in one go. delete is very database heavy, so we divide the deletes. Does that answer your question? can I close this bug?



Yaniv

Comment 6 Alex Krzos 2014-05-08 01:03:01 UTC

(In reply to Yaniv Dary from comment #5)
> (In reply to Alex Krzos from comment #4)
> > (In reply to Yaniv Dary from comment #3)
> > > (In reply to Alex Krzos from comment #2)
> > > > The ovirt-engine-dwhd service had not been restarted during this time
> > > > period.  Is there anything else that would prevent the deletion of samples?
> > > 
> > > I could be that you checked while it was still deleting, since we only
> > > delete 1000 rows per run and then redo the delete after 5 minutes until all
> > > has been deleted to 2 hours. also check that default time to keep samples
> > > was not changed on that system, since that can be also the reason.
> > 
> > I have captured the number of samples associated with a  single VM's disk
> > and I've seen the number of samples greater than 4300 samples for that
> > particular disk.  It also appears that the deletion job is only running
> > every other day and not exactly at the time indcated in the
> > Defaults.properties file.  Could it be that once a threshold of number of
> > VMs has been surpassed, the deletion job is taking so long to execute it
> > actually can not get to the to a particular VM's samples until an entire day
> > later?
> 
> I only deleted 1000 rows per run starting at the set time and it will rerun
> every 5 second on task end until all is deleted. I will not happen in one
> go. delete is very database heavy, so we divide the deletes. Does that
> answer your question? can I close this bug?
> 
> 
> 
> Yaniv

1000 deletes occurring every 5 seconds aligns a lot more closely with the rate of deletes I am seeing in my system.  The only mis-understood parameter I see is when deletes are starting to occur.  The config file's default setting indicates it should occur nightly at 3 (runDeleteTime=3)  The data shows it is running every 48 hours after the last completed run.  For instance on my recorded data on sample count per a single vm's disk samples shows:

Sat May  3 04:55:51 EDT 2014
  1442
Sat May  3 04:56:52 EDT 2014 * Deletes Completed
  1441                      
Sat May  3 04:57:52 EDT 2014
  1442
...
Mon May  5 04:54:04 EDT 2014
  4318
Mon May  5 04:55:05 EDT 2014 * Deletes Starting
  4319  
Mon May  5 04:56:05 EDT 2014
  4318
...
Mon May  5 07:21:45 EDT 2014
  1443
Mon May  5 07:22:45 EDT 2014 * Deletes Completed
  1441 * 
Mon May  5 07:23:45 EDT 2014
  1442
...
Wed May  7 07:20:01 EDT 2014
  4318
Wed May  7 07:21:02 EDT 2014 * Deletes Starting
  4319 
Wed May  7 07:22:02 EDT 2014
  4317

The delete process starts a full 48 hours after the last completed delete run. Depending on time of day this can provide a query that does not have your listed where clause with up to 72 hours worth of sample data rather than the potential 24-48 hours in the default out of box configuration.

I agree completely that the process can not be a single delete operation and does need to be divided.

-Alex

Comment 7 Yaniv Lavi 2014-05-08 13:10:56 UTC

Please check the sleep for delete job and check if the above problem is noticeable.



Yaniv

Comment 8 Yaniv Lavi 2014-05-11 10:24:07 UTC

This is indeed a bug and it also affects aggregation in 3.4.
We will fix it for 3.5 and consider 3.4 fix as well.


Yaniv

Comment 10 Yaniv Lavi 2014-05-15 15:49:18 UTC

Will not we fixed for 3.4.



Yaniv

Comment 11 Yaniv Lavi 2014-05-15 15:49:45 UTC

(In reply to Yaniv Dary from comment #10)
correction:
> Will not be fixed for 3.4.
> 
> 
> 
> Yaniv

Comment 13 Pavel Novotny 2014-12-12 15:28:57 UTC

Verified in rhevm-dwh-3.5.0-7.el6ev.noarch, rhevm-3.5.0-0.23.beta.el6ev.noarch (vt13.1).

There is no more than 1440 samples per metric per day. 
Currently my DWH instance lists 1243 samples of VM disk usage per day.

Comment 15 errata-xmlrpc 2015-02-11 18:14:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2015-0177.html

Note You need to log in before you can comment on or make changes to this bug.