1321517 – RHEV DWH database growing excessively

Bug 1321517 - RHEV DWH database growing excessively

Summary: RHEV DWH database growing excessively

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine-dwh
Sub Component:
Version:	3.5.7
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.0.2
Target Release:	4.0.1
Assignee:	Shirly Radco
QA Contact:	mlehrer
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1328709
TreeView+	depends on / blocked

Reported:	2016-03-28 07:20 UTC by Javier Ramirez
Modified:	2019-11-14 07:42 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, in large environments the delete job took longer than it should have (around 5 hours). This caused a load on the sampling process, data was not deleted, and a backlog was created for large sampling tables such as vm_interface_samples_history and vm_disk_samples_history. This release adds a mechanism that will allow the samples deletion process to become more aggressive according to the following parameters: the number of times the deletion loop runs; the Initial and Increment values; and the number of iterations between increments. In addition, the DWH_LIMIT_ROWS parameter enables updating the hourly and daily deletion process with the amount of rows to be deleted in each iteration. The deletion process is now more dynamic, and can be customized according to the environment, so the delete process will not take longer than 5 hours.
Clone Of:
Clones:	1328709 (view as bug list)
Environment:
Last Closed:	2016-08-23 21:10:30 UTC
oVirt Team:	Metrics
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2016:1691	normal	SHIPPED_LIVE	ovirt-engine-dwh bug fix and enhancement update for RHV 4.0	2016-09-02 21:56:19 UTC
oVirt gerrit	55690	'None'	MERGED	history: changed the limit in DeleteTimekeeping job	2020-08-19 09:21:28 UTC
oVirt gerrit	55716	'None'	MERGED	history: changed the limit in DeleteTimekeeping job	2020-08-19 09:21:28 UTC
oVirt gerrit	56066	'None'	MERGED	history: changed the limit in DeleteTimekeeping job	2020-08-19 09:21:28 UTC

Description Javier Ramirez 2016-03-28 07:20:57 UTC

Description of problem:
Customer ran out of disk space as a result of a large number of records in the _samples_ tables of the ovirt_engine_history database.

As you can see below two tables "vm_disk_samples_history" and "vm_interface_samples_history" contain a large number of rows (close to 70 million):

  ovirt_engine_history=# SELECT schemaname,relname,n_live_tup 
  ovirt_engine_history-#   FROM pg_stat_user_tables 
  ovirt_engine_history-#   ORDER BY n_live_tup DESC;
   schemaname |              relname              | n_live_tup 
  ------------+-----------------------------------+------------
*  public     | vm_disk_samples_history           |   69628149
*  public     | vm_interface_samples_history      |   67722453
   public     | vm_disk_hourly_history            |    2820919
   public     | vm_interface_hourly_history       |    2793968
   public     | host_interface_samples_history    |    2467320
   public     | host_interface_hourly_history     |    1793394
   public     | vm_disks_usage_samples_history    |    1574434
   public     | vm_samples_history                |    1574409
   public     | vm_disks_usage_hourly_history     |    1398195
   public     | vm_hourly_history                 |    1397808
   public     | statistics_vms_users_usage_hourly |    1392847
   public     | host_hourly_history               |     179189
   public     | host_samples_history              |     171479
   public     | vm_interface_daily_history        |     153946
   public     | vm_disk_daily_history             |     153720
   public     | host_interface_daily_history      |     104132
   public     | calendar                          |      87673
   public     | vm_daily_history                  |      77713
   public     | vm_disks_usage_daily_history      |      76114
   public     | statistics_vms_users_usage_daily  |      75551
   public     | host_interface_configuration      |      51479
   public     | storage_domain_hourly_history     |      29614
   public     | storage_domain_samples_history    |      24497
   public     | datacenter_hourly_history         |      22551
   public     | datacenter_samples_history        |      17292
   public     | vm_device_history                 |      16770
   public     | host_daily_history                |      11224
   public     | vm_interface_configuration        |       5408
   public     | vm_configuration                  |       2467
   public     | disks_vm_map                      |       2443
   public     | vm_disk_configuration             |       2401
   public     | storage_domain_daily_history      |       1963
   public     | datacenter_daily_history          |       1549
   public     | enum_translator                   |        699
   public     | host_configuration                |        471
   public     | datacenter_configuration          |        335
   public     | users_details_history             |         97
   public     | storage_domain_configuration      |         36
   public     | cluster_configuration             |         35
   public     | schema_version                    |         32
   public     | datacenter_storage_domain_map     |         24
   public     | history_configuration             |          6
   public     | tag_relations_history             |          0
   public     | tag_details                       |          0
  (44 rows)
  
According to https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.6/html/Reports_and_Data_Warehouse_Guide/sect-History_Database.html#Recording_statistical_history :

  "Data is stored for every minute of the past 24 hours, at a minimum, but can be stored for as long as 48 hours depending on the last time a deletion job was run. Minute-by-minute data more than two hours old is aggregated into hourly data and stored for two months. Hourly data more than two days old is aggregated into daily data and stored for five years."
  

I understand that _samples_ tables should not contain samples older than 48 hours, but looking at the oldest record in all tables reveals that tables "vm_disk_samples_history" and "vm_interface_samples_history" have very old (one month old) samples:

  CREATE OR REPLACE FUNCTION list_dates() 
  RETURNS TABLE(name text, min text, max text) AS $$
  DECLARE
    i varchar;
  BEGIN
    for i in select table_name from information_schema.tables 
      where table_catalog='ovirt_engine_history' 
      and table_schema='public' 
      and table_type='BASE TABLE'
    loop
      BEGIN
        return query execute 'select ' || quote_literal(i) ||'::text as table_name, min(history_datetime)::text as min, max(history_datetime)::text as max from ' || quote_ident(i);
      EXCEPTION WHEN undefined_column THEN
      END;
    end loop;
    RETURN;
  END;
  $$ LANGUAGE plpgsql;
  
  ovirt_engine_history=# select * from list_dates();
                 name                |            min             |            max             
  -----------------------------------+----------------------------+----------------------------
   datacenter_samples_history        | 2016-03-09 11:24:51.384+00 | 2016-03-10 11:25:01.929+00
   datacenter_hourly_history         | 2016-01-10 12:00:00+00     | 2016-03-10 09:00:00+00
   host_samples_history              | 2016-03-09 11:24:51.384+00 | 2016-03-10 11:25:01.929+00
   statistics_vms_users_usage_daily  | 2015-12-09 00:00:00+00     | 2016-03-08 00:00:00+00
   statistics_vms_users_usage_hourly | 2016-01-10 12:00:00+00     | 2016-03-10 09:00:00+00
   storage_domain_daily_history      | 2015-12-09                 | 2016-03-08
   storage_domain_hourly_history     | 2016-01-10 12:00:00+00     | 2016-03-10 09:00:00+00
   storage_domain_samples_history    | 2016-03-09 11:24:51.384+00 | 2016-03-10 11:25:01.929+00
   host_interface_samples_history    | 2016-03-09 01:42:30.403+00 | 2016-03-10 11:25:01.929+00
   vm_daily_history                  | 2015-12-09                 | 2016-03-08
   vm_disks_usage_daily_history      | 2015-12-09                 | 2016-03-08
   vm_interface_daily_history        | 2015-12-09                 | 2016-03-08
   vm_interface_hourly_history       | 2016-01-10 12:00:00+00     | 2016-03-10 09:00:00+00
   vm_disks_usage_hourly_history     | 2016-01-10 12:00:00+00     | 2016-03-10 09:00:00+00
   vm_hourly_history                 | 2016-01-10 12:00:00+00     | 2016-03-10 09:00:00+00
*  vm_interface_samples_history      | 2016-02-12 01:59:19.506+00 | 2016-03-10 11:25:01.929+00
   vm_disks_usage_samples_history    | 2016-03-09 09:28:33.43+00  | 2016-03-10 11:25:01.929+00
   vm_samples_history                | 2016-03-09 09:28:33.43+00  | 2016-03-10 11:25:01.929+00
*  vm_disk_samples_history           | 2016-02-11 16:21:18.528+00 | 2016-03-10 11:25:01.929+00
   vm_disk_hourly_history            | 2016-01-10 12:00:00+00     | 2016-03-10 09:00:00+00
   datacenter_daily_history          | 2015-12-09                 | 2016-03-08
   host_daily_history                | 2015-12-09                 | 2016-03-08
   host_hourly_history               | 2016-01-10 12:00:00+00     | 2016-03-10 09:00:00+00
   host_interface_daily_history      | 2015-12-09                 | 2016-03-08
   host_interface_hourly_history     | 2016-01-10 12:00:00+00     | 2016-03-10 09:00:00+00
   vm_disk_daily_history             | 2015-12-09                 | 2016-03-08
  (26 rows)

Summary of oldest records:

  - _daily_ tables:   2015-12-09
  - _hourly_ tables:  2016-01-10
  - _samples_ tables: 2016-03-09 mostly **but** 2016-02-11 for "vm_disk_samples_history" and "vm_interface_samples_history", which are the tables that have the highest number of records as per above.

There seems to be a bug preventing old samples from being deleted from tables "vm_disk_samples_history" and "vm_interface_samples_history".
This resulted in the database running out of disk space which in turn took down the engine service thus causing an outage.

Version-Release number of selected component (if applicable):
rhevm-dwh-3.5.5-1.el6ev.noarch


Additional info:
The following supporting files have been provided by the customer:

01596143-engine.dump
  Dump of the "engine" database in custom format ( -Fc ).
C01596143-ovirt_engine_history.dump
  Dump of the "ovirt_engine_history" in custom format ( -Fc ).
  Beware this dump requires a large amount of disk space (53GB) to be restored.
sosreport-yyprdap20.C01596143-20160307132427.tar.xz
  sosreport of the machine where RHEV-M runs on.
sosreport-yyprddb20.C01596143-20160307132417.tar.xz
  sosreport of the machine where the PostgreSQL database runs on.
  
Those files are available at http://jentrena-xw8600.usersys.redhat.com/01596143/

Comment 7 mlehrer 2016-08-14 11:00:23 UTC

Moving to verified.

Able to successfully reduce database physical size on disk because DWH delete job ran marking outdated tuples.  Upon the end of 'Delete History' process, I ran vacuum full with ovirt-engine-dwh off thus allowing postgres to release physical disk space back to the OS.

Comment 10 errata-xmlrpc 2016-08-23 21:10:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1691.html

Note You need to log in before you can comment on or make changes to this bug.