Description of problem: Customer ran out of disk space as a result of a large number of records in the _samples_ tables of the ovirt_engine_history database. As you can see below two tables "vm_disk_samples_history" and "vm_interface_samples_history" contain a large number of rows (close to 70 million): ovirt_engine_history=# SELECT schemaname,relname,n_live_tup ovirt_engine_history-# FROM pg_stat_user_tables ovirt_engine_history-# ORDER BY n_live_tup DESC; schemaname | relname | n_live_tup ------------+-----------------------------------+------------ * public | vm_disk_samples_history | 69628149 * public | vm_interface_samples_history | 67722453 public | vm_disk_hourly_history | 2820919 public | vm_interface_hourly_history | 2793968 public | host_interface_samples_history | 2467320 public | host_interface_hourly_history | 1793394 public | vm_disks_usage_samples_history | 1574434 public | vm_samples_history | 1574409 public | vm_disks_usage_hourly_history | 1398195 public | vm_hourly_history | 1397808 public | statistics_vms_users_usage_hourly | 1392847 public | host_hourly_history | 179189 public | host_samples_history | 171479 public | vm_interface_daily_history | 153946 public | vm_disk_daily_history | 153720 public | host_interface_daily_history | 104132 public | calendar | 87673 public | vm_daily_history | 77713 public | vm_disks_usage_daily_history | 76114 public | statistics_vms_users_usage_daily | 75551 public | host_interface_configuration | 51479 public | storage_domain_hourly_history | 29614 public | storage_domain_samples_history | 24497 public | datacenter_hourly_history | 22551 public | datacenter_samples_history | 17292 public | vm_device_history | 16770 public | host_daily_history | 11224 public | vm_interface_configuration | 5408 public | vm_configuration | 2467 public | disks_vm_map | 2443 public | vm_disk_configuration | 2401 public | storage_domain_daily_history | 1963 public | datacenter_daily_history | 1549 public | enum_translator | 699 public | host_configuration | 471 public | datacenter_configuration | 335 public | users_details_history | 97 public | storage_domain_configuration | 36 public | cluster_configuration | 35 public | schema_version | 32 public | datacenter_storage_domain_map | 24 public | history_configuration | 6 public | tag_relations_history | 0 public | tag_details | 0 (44 rows) According to https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.6/html/Reports_and_Data_Warehouse_Guide/sect-History_Database.html#Recording_statistical_history : "Data is stored for every minute of the past 24 hours, at a minimum, but can be stored for as long as 48 hours depending on the last time a deletion job was run. Minute-by-minute data more than two hours old is aggregated into hourly data and stored for two months. Hourly data more than two days old is aggregated into daily data and stored for five years." I understand that _samples_ tables should not contain samples older than 48 hours, but looking at the oldest record in all tables reveals that tables "vm_disk_samples_history" and "vm_interface_samples_history" have very old (one month old) samples: CREATE OR REPLACE FUNCTION list_dates() RETURNS TABLE(name text, min text, max text) AS $$ DECLARE i varchar; BEGIN for i in select table_name from information_schema.tables where table_catalog='ovirt_engine_history' and table_schema='public' and table_type='BASE TABLE' loop BEGIN return query execute 'select ' || quote_literal(i) ||'::text as table_name, min(history_datetime)::text as min, max(history_datetime)::text as max from ' || quote_ident(i); EXCEPTION WHEN undefined_column THEN END; end loop; RETURN; END; $$ LANGUAGE plpgsql; ovirt_engine_history=# select * from list_dates(); name | min | max -----------------------------------+----------------------------+---------------------------- datacenter_samples_history | 2016-03-09 11:24:51.384+00 | 2016-03-10 11:25:01.929+00 datacenter_hourly_history | 2016-01-10 12:00:00+00 | 2016-03-10 09:00:00+00 host_samples_history | 2016-03-09 11:24:51.384+00 | 2016-03-10 11:25:01.929+00 statistics_vms_users_usage_daily | 2015-12-09 00:00:00+00 | 2016-03-08 00:00:00+00 statistics_vms_users_usage_hourly | 2016-01-10 12:00:00+00 | 2016-03-10 09:00:00+00 storage_domain_daily_history | 2015-12-09 | 2016-03-08 storage_domain_hourly_history | 2016-01-10 12:00:00+00 | 2016-03-10 09:00:00+00 storage_domain_samples_history | 2016-03-09 11:24:51.384+00 | 2016-03-10 11:25:01.929+00 host_interface_samples_history | 2016-03-09 01:42:30.403+00 | 2016-03-10 11:25:01.929+00 vm_daily_history | 2015-12-09 | 2016-03-08 vm_disks_usage_daily_history | 2015-12-09 | 2016-03-08 vm_interface_daily_history | 2015-12-09 | 2016-03-08 vm_interface_hourly_history | 2016-01-10 12:00:00+00 | 2016-03-10 09:00:00+00 vm_disks_usage_hourly_history | 2016-01-10 12:00:00+00 | 2016-03-10 09:00:00+00 vm_hourly_history | 2016-01-10 12:00:00+00 | 2016-03-10 09:00:00+00 * vm_interface_samples_history | 2016-02-12 01:59:19.506+00 | 2016-03-10 11:25:01.929+00 vm_disks_usage_samples_history | 2016-03-09 09:28:33.43+00 | 2016-03-10 11:25:01.929+00 vm_samples_history | 2016-03-09 09:28:33.43+00 | 2016-03-10 11:25:01.929+00 * vm_disk_samples_history | 2016-02-11 16:21:18.528+00 | 2016-03-10 11:25:01.929+00 vm_disk_hourly_history | 2016-01-10 12:00:00+00 | 2016-03-10 09:00:00+00 datacenter_daily_history | 2015-12-09 | 2016-03-08 host_daily_history | 2015-12-09 | 2016-03-08 host_hourly_history | 2016-01-10 12:00:00+00 | 2016-03-10 09:00:00+00 host_interface_daily_history | 2015-12-09 | 2016-03-08 host_interface_hourly_history | 2016-01-10 12:00:00+00 | 2016-03-10 09:00:00+00 vm_disk_daily_history | 2015-12-09 | 2016-03-08 (26 rows) Summary of oldest records: - _daily_ tables: 2015-12-09 - _hourly_ tables: 2016-01-10 - _samples_ tables: 2016-03-09 mostly **but** 2016-02-11 for "vm_disk_samples_history" and "vm_interface_samples_history", which are the tables that have the highest number of records as per above. There seems to be a bug preventing old samples from being deleted from tables "vm_disk_samples_history" and "vm_interface_samples_history". This resulted in the database running out of disk space which in turn took down the engine service thus causing an outage. Version-Release number of selected component (if applicable): rhevm-dwh-3.5.5-1.el6ev.noarch Additional info: The following supporting files have been provided by the customer: 01596143-engine.dump Dump of the "engine" database in custom format ( -Fc ). C01596143-ovirt_engine_history.dump Dump of the "ovirt_engine_history" in custom format ( -Fc ). Beware this dump requires a large amount of disk space (53GB) to be restored. sosreport-yyprdap20.C01596143-20160307132427.tar.xz sosreport of the machine where RHEV-M runs on. sosreport-yyprddb20.C01596143-20160307132417.tar.xz sosreport of the machine where the PostgreSQL database runs on. Those files are available at http://jentrena-xw8600.usersys.redhat.com/01596143/
Moving to verified. Able to successfully reduce database physical size on disk because DWH delete job ran marking outdated tuples. Upon the end of 'Delete History' process, I ran vacuum full with ovirt-engine-dwh off thus allowing postgres to release physical disk space back to the OS.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1691.html