Bug 1388433 - [RFE] Change auto-vacuum configuration defaults to match oVirt db usage
Summary: [RFE] Change auto-vacuum configuration defaults to match oVirt db usage
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Database.Core
Version: 4.0.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.1.2
: 4.1.2
Assignee: Roy Golan
QA Contact: Lucie Leistnerova
URL:
Whiteboard:
Depends On:
Blocks: 1393270 1411756 1447624
TreeView+ depends on / blocked
 
Reported: 2016-10-25 10:44 UTC by Eli Mesika
Modified: 2017-06-26 09:01 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Previously, the Manager’s PostgreSQL tables that were frequently updated would fill up with obsolete data, creating the risk of disk flooding and transaction ID wraparound issues. This release has introduced a more aggressive vacuum daemon configuration for collecting obsolete rows. This enables the Manager’s tables to remain healthy, and for disk space usage to be better correlated with the actual amount of kept data.
Clone Of:
Environment:
Last Closed: 2017-05-23 08:19:50 UTC
oVirt Team: Infra
Embargoed:
mgoldboi: ovirt-4.1+
mgoldboi: exception+
lsvaty: testing_plan_complete-
mgoldboi: planning_ack+
mperina: devel_ack+
pstehlik: testing_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1393270 0 high CLOSED Database table, vm_dynamic, grows without limit until the disk fills up. 2021-08-30 13:34:21 UTC
oVirt gerrit 68676 0 'None' MERGED packaging: setup: More aggressive autovacuum configuration 2020-12-03 04:29:56 UTC
oVirt gerrit 70309 0 'None' MERGED packaging: setup: More aggressive autovacuum configuration 2020-12-03 04:29:56 UTC
oVirt gerrit 70422 0 'None' MERGED packaging: setup: DB configuration plugin 2020-12-03 04:29:56 UTC
oVirt gerrit 70454 0 'None' ABANDONED packaging: setup: report all misconfigured items at once 2020-12-03 04:29:56 UTC
oVirt gerrit 71889 0 'None' ABANDONED packaging: setup: Add INVALID_CONFIG_ITEMS 2020-12-03 04:29:56 UTC
oVirt gerrit 72111 0 'None' MERGED packaging: setup: DB configuration plugin 2020-12-03 04:29:56 UTC
oVirt gerrit 73818 0 'None' MERGED packaging: Update only invalid values in postgres.conf 2020-12-03 04:29:57 UTC
oVirt gerrit 73942 0 'None' MERGED packaging: Extract the value correctly from postgresql.conf 2020-12-03 04:29:57 UTC
oVirt gerrit 74316 0 'None' MERGED packaging: Introduce unit test 2020-12-03 04:29:58 UTC
oVirt gerrit 74572 0 'None' MERGED packaging: Update only invalid values in postgres.conf 2020-12-03 04:29:57 UTC
oVirt gerrit 74573 0 'None' MERGED packaging: Extract the value correctly from postgresql.conf 2020-12-03 04:29:58 UTC
oVirt gerrit 74574 0 'None' MERGED packaging: Introduce unit test 2020-12-03 04:29:57 UTC

Internal Links: 1393270

Description Eli Mesika 2016-10-25 10:44:18 UTC
Description of problem:
 Change default autovacuuming settings, by default it's execute too often (on 50 tuples update), which makes postgresql process very busy.
We should change the default for all newly created db in engine-setup and probably update doc how to do that for existing ones


Version-Release number of selected component (if applicable):


Actual results:

currently in postgresql.conf the defaults are:

autovacuum_vacuum_threshold = 50 
autovacuum_analyze_threshold = 50


Expected results:

should be changed to :
autovacuum_vacuum_threshold = 5000 
autovacuum_analyze_threshold = 10000



Additional info:

Comment 1 Martin Perina 2016-10-25 12:25:51 UTC
We have to change this for all new automatically provisioned databases, but I'd also add this to "db customization" documentation (if we have any) and also show warning about that during each engine-setup execution.

Comment 2 Michal Skrivanek 2016-11-03 10:06:04 UTC
Any issue doing that for existing dbs? I would expect large existing deployments actually suffer the most right now.

Comment 3 Yedidyah Bar David 2016-11-03 12:51:52 UTC
(In reply to Michal Skrivanek from comment #2)
> Any issue doing that for existing dbs? I would expect large existing
> deployments actually suffer the most right now.

Generally speaking, we do not change postgresql.conf on upgrade, only on provisioning. We try to behave as if postgresql (and its conf) is not owned by us, as if the "normal" behavior is using a remote db. So we confine changes to configuration to be only during provisioning. Also, we do not have currently code for "warning", although that's easier to add.

If we only do the minimum, we'll only change this during provisioning and error during upgrade. See as an example the change for bug 1331168. Of course, everything is possible...

Comment 4 Martin Perina 2016-11-04 10:24:43 UTC
I'd say that changing this pro newly provisioned db and display  a warning (or error?) during engine-setup execution, if not set properly, is enough (I'd assume that large customers may have already changed postgresql.conf to fit their loads by tuning some other parameters).

Comment 7 Roy Golan 2016-11-15 13:32:20 UTC
autovacuum is a very important process for the database and raising the global threshold is a no-go:
1. we have variance in our table usage - some update heavy, rest isn't. 
large global value will starve those tables and they won't be cleaned.
2. there is no single value which is good. The system can grow. We should be more dynamic
3. co-hosted DWH is contending on vacuum workers

What I suggest is a per-table tuning for the update-heavy tables:
- vds_cpu_statistics  (should be removed?)
- vds_interface_statistics
- vds_statistics
- vm_dynamic
- vm_statistics
- vm_interface_statistics

This will set the scale factor to 1:
```
ALTER table NAME SET (autovacuum_vacuum_scale_factor=1);
```

The formula [1] to determine if to run vacuum for a table is:
```
threshold = vacuum base threshold + vacuum scale factor * number of tuples
```

So using defaults threshold and scale factor (50 and 0.2 respectively) and vm table have 1000 records then the effective threshold is:
 50 + (1000 * 0.2) = 250

I took load factor =1 because most of the vms and host should be running then most of the records would be updated regularly.

[1] https://www.postgresql.org/docs/9.1/static/routine-vacuuming.html#AUTOVACUUM

Comment 8 Martin Perina 2016-11-16 10:07:39 UTC
Moving back to infra, as discussed with Roy we can create an upgrade script for relevant tables and alter autovacuum_vacuum_scale_factor for them. Customer can adapt this setting later if needed

Comment 9 Roy Golan 2016-12-15 10:34:44 UTC
A recent change on a production system we made plus more research and tests I made  shows that the factor should be actually smaller. The smaller amount of work the vacuum has to do the better it behaves. The vacuum algorithm also has the cost limit factor, when it reached the worker stops working and waits for the next time with tuples waiting to be collected. 

The vacuum factor is going to set to 0.01 to be fairly aggressive and still support table growth. threshold will stay default.

Comment 12 Martin Perina 2017-01-14 20:36:43 UTC
Moving to 4.1.1 as we most probably won't be able to finish that before RC build

Comment 13 Yedidyah Bar David 2017-02-13 14:17:52 UTC
Removing 71889 because it's not really required for this bug. It (or a variation) will only be needed if/when dwh wants to add its own extra config items.

Comment 14 Lucie Leistnerova 2017-03-02 15:39:07 UTC
engine-setup checks by upgrade the postgres configuration and correctly writes changes that should be made

          Found the following problems in PostgreSQL configuration for the Engine database:
           autovacuum_vacuum_scale_factor required to be at most 0.01
           autovacuum_max_workers required to be at least 6
           maintenance_work_mem required to be at least 65536
          Please set:
           autovacuum_vacuum_scale_factor = 0.01
           autovacuum_max_workers = 6
           maintenance_work_mem = 65536

but when I chose to Fix automatically it also changed the setting that was already correct

#autovacuum_analyze_scale_factor = 0.1
...
autovacuum_analyze_scale_factor = 0.05

-->

#autovacuum_analyze_scale_factor = 0.1
...
# autovacuum_analyze_scale_factor = 0.05
autovacuum_analyze_scale_factor = 0.075


tested in ovirt-engine-setup-4.1.1.2-0.1.el7.noarch

Comment 15 Red Hat Bugzilla Rules Engine 2017-03-02 15:39:13 UTC
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 16 Martin Perina 2017-03-06 13:47:59 UTC
Roy, could you please take a look?

Comment 17 Yaniv Kaul 2017-03-08 19:33:07 UTC
(In reply to Martin Perina from comment #16)
> Roy, could you please take a look?

ping?

Comment 18 Oved Ourfali 2017-03-15 09:34:17 UTC
Roy - what's the status here?

Comment 19 Lucie Leistnerova 2017-05-02 14:34:08 UTC
Now it doesn't rewrite already good values, but there is a problem with maintenance_work_mem when value is set with units.

maintenance_work_mem = 16MB 

engine-setup ends with error
[ ERROR ] Failed to execute stage 'Misc configuration': invalid literal for int() with base 10: '16MB'

2017-05-02 16:22:49 DEBUG otopi.context context._executeMethod:142 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/otopi/context.py", line 132, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-engine/setup/bin/../plugins/ovirt-engine-setup/ovirt-engine/db/configuration.py", line 222, in _updatePGConf
    transaction=localtransaction,
  File "/usr/share/ovirt-engine/setup/ovirt_engine_setup/engine_common/postgres.py", line 284, in _updatePostgresConf
    needUpdate, content = dbovirtutils.getUpdatedPGConf(content)
  File "/usr/share/ovirt-engine/setup/ovirt_engine_setup/engine_common/database.py", line 1004, in getUpdatedPGConf
    expected=item['expected']
  File "/usr/share/ovirt-engine/setup/bin/../plugins/ovirt-engine-setup/ovirt-engine/db/configuration.py", line 128, in <lambda>
    int(current) >= int(expected)
ValueError: invalid literal for int() with base 10: '16MB'


tested in ovirt-engine-setup-4.1.2-0.1.el7.noarch

Comment 20 Oved Ourfali 2017-05-02 16:34:49 UTC
Roy, can you please take a look?

Comment 21 Roy Golan 2017-05-03 09:12:43 UTC
(In reply to Lucie Leistnerova from comment #19)
> Now it doesn't rewrite already good values, but there is a problem with
> maintenance_work_mem when value is set with units.
> 
> maintenance_work_mem = 16MB 

We don't support units and our scripts never did - i.e you are exposing a limitation that was already there from the beginning. 

I suggest we should VERIFY this RFE and open a new bug for this specific issue since this RFE didn't introduce that.

Comment 22 Lucie Leistnerova 2017-05-03 11:03:35 UTC
Alright. I created Bug 1447624
verified in ovirt-engine-setup-4.1.2-0.1.el7.noarch


Note You need to log in before you can comment on or make changes to this bug.