Bug 1922200 - Checking the Engine database consistency takes too long to complete
Summary: Checking the Engine database consistency takes too long to complete
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.4.3
Hardware: Unspecified
OS: Unspecified
low
high
Target Milestone: ovirt-4.4.5
: ---
Assignee: Eli Mesika
QA Contact: Guilherme Santos
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-29 12:56 UTC by Miguel Martin
Modified: 2021-04-14 11:40 UTC (History)
6 users (show)

Fixed In Version: ovirt-engine-4.4.5.6
Doc Type: Enhancement
Doc Text:
Previously, records in the `event_notification_hist` table were erased only during regular cleanup of the `audit_log` table By default `audit_log` table records that are older than 30 days are removed. With this update, records in the `event_notification_hist` table are kept for 7 days. You can override this limit by creating a custom configuration file `/etc/ovirt-engine/notifier/notifier.conf.d/history.conf` with the following content: DAYS_TO_KEEP_HISTORY=<number_of_days> Where <number_of_days> is the number of days to keep records in the `event_notification_hist` table. After adding this file the first time or after changing this value, you need to restart the ovirt-engine-notifier service: ---- # systemctl restart ovirt-engine-notifier ----
Clone Of:
Environment:
Last Closed: 2021-04-14 11:40:02 UTC
oVirt Team: Infra
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1323826 0 unspecified CLOSED engine-setup stage 'Setup validation' takes too long to complete 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHSA-2021:1169 0 None None None 2021-04-14 11:40:27 UTC
oVirt gerrit 113276 0 master MERGED Change notification config to save 7 days history 2021-02-17 16:04:40 UTC

Description Miguel Martin 2021-01-29 12:56:28 UTC
Description of problem:
engine-setup spends hours in checking the consistency when "audit_log" and "event_notification_hist" tables are big.

Version-Release number of selected component (if applicable):
4.4.3

How reproducible:
Always

Steps to Reproduce:
1. run "engine-setup --offline --accep-defaults" in an installation where the "audit_log" and "event_notification_hist" tables are big.

Actual results:
The engine is stuck for hours checking the consistency:
~~~
[ INFO  ] Cleaning async tasks and compensations
[ INFO  ] Unlocking existing entities
[ INFO  ] Checking the Engine database consistency
~~~

The CPU is close to 100% and the following SQL queries can be found in the database:
~~~
  pid   | usename  | datname |                                                                                 query                                                                                  |  client_addr  |    application_name

--------+----------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+--------------------
----
 184378 | engine   | engine  | copy (                                                                                                                                                                +| ::1           | psql
        |          |         |                                 select fk_violation, fk_status                                                                                                        +|               |
        |          |         |                                 from fn_db_validate_fks(false, 0 != 0)                                                                                                +|               |
        |          |         |                                 where fk_status=1                                                                                                                     +|               |
        |          |         |                         ) to stdout with delimiter as '|';                                                                                                             |               |
 184379 | engine   | engine  | select  (audit_log_id)  from public.event_notification_hist where  (audit_log_id) IS NOT NULL and  (audit_log_id)  not in (select audit_log_id from public.audit_log); |               | psql
 184380 | engine   | engine  | select  (audit_log_id)  from public.event_notification_hist where  (audit_log_id) IS NOT NULL and  (audit_log_id)  not in (select audit_log_id from public.audit_log); |               | psql

~~~

Expected results:
The engine-setup command finishes in a reasonable amount of time.

Additional info:
Data table sizes
~~~
engine=# select count(*) from audit_log;
 count  
--------
 660608
(1 row)

engine=# select count(*) from event_notification_hist;
 count  
--------
 635110
(1 row)

~~~

Truncating the "event_notification_hist" table as advised in [1] makes the setup to finish in seconds:
~~~
su - postgres -c "pg_dump -Fc --data-only -d engine -t event_notification_hist > /tmp/event_notification_hist.sql"
engine-setup
su - postgres -c "psql engine -c 'truncate table event_notification_hist;'"
su - postgres -c "pg_restore --data-only -d engine -t event_notification_hist /tmp/event_notification_hist.sql"
~~~


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1323826#c13

Comment 1 Eli Mesika 2021-02-01 09:45:14 UTC
Can you please attach the dump of the tested database ?

Comment 3 Eli Mesika 2021-02-02 15:21:48 UTC
Well , the verification process simply verifies that when we remove an entity it will be removed on all dependant tables via the DB FK definition on the dependant table, this will be time consuming when dealing with large tables

Looking in the problematic table I see

\d event_notification_hist
                  Table "public.event_notification_hist"
    Column    |           Type           | Collation | Nullable | Default
--------------+--------------------------+-----------+----------+---------
 event_name   | character varying(100)   |           | not null |
 audit_log_id | bigint                   |           | not null |
 method_type  | character(10)            |           | not null |
 sent_at      | timestamp with time zone |           | not null |
 status       | boolean                  |           | not null |
 reason       | character(255)           |           |          |
Indexes:
    "idx_event_notification_hist" btree (audit_log_id)
Foreign-key constraints:
    "fk_event_notification_hist_audit_log" FOREIGN KEY (audit_log_id) REFERENCES audit_log(audit_log_id) ON DELETE CASCADE

I came to the conclusion that there is no point to remove history data when an entity is removed and this may also violate the meaning of keeping the history

So, simple solution would be IMO , removing fk_event_notification_hist_audit_log from event_notification_hist table and since all insertions to this table are programmatic and not manual , I see no problem with not checking the passed audit_log_id

Comment 5 Eli Mesika 2021-02-03 11:08:01 UTC
connecting to the database in comment 2 , I found the following :

engine=# select count(*) from event_notification_hist 
;
 635110

So, history table is quite big 
Looking on the dates it keeps :

engine=# select sent_at  from event_notification_hist order by sent_at  desc limit 1;
 2021-01-27 14:08:49.183+00

engine=# select sent_at  from event_notification_hist order by sent_at  asc limit 1;
 2020-12-28 14:16:11.666+00

So, this table really keeps a 30 days of history 

Seems that audit_log table and event_notification_hist have a independent transaction mechanism set for 30 days 
In event_notification_hist it is controlled by DAYS_TO_KEEP_HISTORY while in audit_log it is controlled by AuditLogAgingThreshold 

However , seems we have another issue here regarding audit_log table truncation :

engine=# select * from vdc_options where option_name= 'AuditLogAgingThreshold'; 
        13 | AuditLogAgingThreshold | 30           | general | 30

engine=# select log_time from audit_log order by log_time asc limit 1;
 2020-12-28 00:38:13.964+00

engine=# select log_time from audit_log order by log_time desc limit 1;
 2021-02-03 09:57:31.272+00


Seems that audit_log truncation is not working from some reason here and should be investigated separately 


Anyway, since audit_log and event_notification_hist tables have separate truncation mechanisms and configuration it is not make sense to tie them with FK that cascades deletes from audit_log to event_notification_hist

Comment 9 Guilherme Santos 2021-03-17 21:37:16 UTC
Verified on:
ovirt-engine-4.4.5.9-0.1.el8ev.noarch

Steps:
1. Run engine-setup with very large audit_log table data
engine=# select count(*) from audit_log;
 count  
--------
 673335
(1 row)

# time engine-setup --offline --accept-defaults
(...)

real    3m0.328s
user    1m19.096s
sys     0m29.415s


2. Run engine-setup with small/no audit_log table data
engine=# delete from audit_log;
engine=# select count(*) from audit_log;
 count  
--------
 0
(1 row)

# time engine-setup --offline --accept-defaults
(...)

real    2m41.927s
user    1m12.190s
sys     0m27.500s

Results:
Both executions finished in a reasonable and similar time

Comment 14 errata-xmlrpc 2021-04-14 11:40:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) 4.4.z [ovirt-4.4.5] security, bug fix, enhancement), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1169


Note You need to log in before you can comment on or make changes to this bug.