Bug 1388433

Summary:	[RFE] Change auto-vacuum configuration defaults to match oVirt db usage
Product:	[oVirt] ovirt-engine	Reporter:	Eli Mesika <emesika>
Component:	Database.Core	Assignee:	Roy Golan <rgolan>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Lucie Leistnerova <lleistne>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.0.0	CC:	bugs, didi, eheftman, jentrena, lleistne, lsvaty, mgoldboi, michal.skrivanek, mperina, oourfali, rgolan
Target Milestone:	ovirt-4.1.2	Keywords:	FutureFeature
Target Release:	4.1.2	Flags:	mgoldboi: ovirt-4.1+ mgoldboi: exception+ lsvaty: testing_plan_complete- mgoldboi: planning_ack+ mperina: devel_ack+ pstehlik: testing_ack+
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:	Previously, the Manager’s PostgreSQL tables that were frequently updated would fill up with obsolete data, creating the risk of disk flooding and transaction ID wraparound issues. This release has introduced a more aggressive vacuum daemon configuration for collecting obsolete rows. This enables the Manager’s tables to remain healthy, and for disk space usage to be better correlated with the actual amount of kept data.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-05-23 08:19:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Infra	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1393270, 1411756, 1447624

Description Eli Mesika 2016-10-25 10:44:18 UTC

Description of problem:
 Change default autovacuuming settings, by default it's execute too often (on 50 tuples update), which makes postgresql process very busy.
We should change the default for all newly created db in engine-setup and probably update doc how to do that for existing ones


Version-Release number of selected component (if applicable):


Actual results:

currently in postgresql.conf the defaults are:

autovacuum_vacuum_threshold = 50 
autovacuum_analyze_threshold = 50


Expected results:

should be changed to :
autovacuum_vacuum_threshold = 5000 
autovacuum_analyze_threshold = 10000



Additional info:

Comment 1 Martin Perina 2016-10-25 12:25:51 UTC

We have to change this for all new automatically provisioned databases, but I'd also add this to "db customization" documentation (if we have any) and also show warning about that during each engine-setup execution.

Comment 2 Michal Skrivanek 2016-11-03 10:06:04 UTC

Any issue doing that for existing dbs? I would expect large existing deployments actually suffer the most right now.

Comment 3 Yedidyah Bar David 2016-11-03 12:51:52 UTC

(In reply to Michal Skrivanek from comment #2)
> Any issue doing that for existing dbs? I would expect large existing
> deployments actually suffer the most right now.

Generally speaking, we do not change postgresql.conf on upgrade, only on provisioning. We try to behave as if postgresql (and its conf) is not owned by us, as if the "normal" behavior is using a remote db. So we confine changes to configuration to be only during provisioning. Also, we do not have currently code for "warning", although that's easier to add.

If we only do the minimum, we'll only change this during provisioning and error during upgrade. See as an example the change for bug 1331168. Of course, everything is possible...

Comment 4 Martin Perina 2016-11-04 10:24:43 UTC

I'd say that changing this pro newly provisioned db and display  a warning (or error?) during engine-setup execution, if not set properly, is enough (I'd assume that large customers may have already changed postgresql.conf to fit their loads by tuning some other parameters).

Comment 7 Roy Golan 2016-11-15 13:32:20 UTC

autovacuum is a very important process for the database and raising the global threshold is a no-go:
1. we have variance in our table usage - some update heavy, rest isn't. 
large global value will starve those tables and they won't be cleaned.
2. there is no single value which is good. The system can grow. We should be more dynamic
3. co-hosted DWH is contending on vacuum workers

What I suggest is a per-table tuning for the update-heavy tables:
- vds_cpu_statistics  (should be removed?)
- vds_interface_statistics
- vds_statistics
- vm_dynamic
- vm_statistics
- vm_interface_statistics

This will set the scale factor to 1:
```
ALTER table NAME SET (autovacuum_vacuum_scale_factor=1);
```

The formula [1] to determine if to run vacuum for a table is:
```
threshold = vacuum base threshold + vacuum scale factor * number of tuples
```

So using defaults threshold and scale factor (50 and 0.2 respectively) and vm table have 1000 records then the effective threshold is:
 50 + (1000 * 0.2) = 250

I took load factor =1 because most of the vms and host should be running then most of the records would be updated regularly.

[1] https://www.postgresql.org/docs/9.1/static/routine-vacuuming.html#AUTOVACUUM

Comment 8 Martin Perina 2016-11-16 10:07:39 UTC

Moving back to infra, as discussed with Roy we can create an upgrade script for relevant tables and alter autovacuum_vacuum_scale_factor for them. Customer can adapt this setting later if needed

Comment 9 Roy Golan 2016-12-15 10:34:44 UTC

A recent change on a production system we made plus more research and tests I made  shows that the factor should be actually smaller. The smaller amount of work the vacuum has to do the better it behaves. The vacuum algorithm also has the cost limit factor, when it reached the worker stops working and waits for the next time with tuples waiting to be collected. 

The vacuum factor is going to set to 0.01 to be fairly aggressive and still support table growth. threshold will stay default.

Comment 12 Martin Perina 2017-01-14 20:36:43 UTC

Moving to 4.1.1 as we most probably won't be able to finish that before RC build

Comment 13 Yedidyah Bar David 2017-02-13 14:17:52 UTC

Removing 71889 because it's not really required for this bug. It (or a variation) will only be needed if/when dwh wants to add its own extra config items.

Comment 14 Lucie Leistnerova 2017-03-02 15:39:07 UTC

engine-setup checks by upgrade the postgres configuration and correctly writes changes that should be made

          Found the following problems in PostgreSQL configuration for the Engine database:
           autovacuum_vacuum_scale_factor required to be at most 0.01
           autovacuum_max_workers required to be at least 6
           maintenance_work_mem required to be at least 65536
          Please set:
           autovacuum_vacuum_scale_factor = 0.01
           autovacuum_max_workers = 6
           maintenance_work_mem = 65536

but when I chose to Fix automatically it also changed the setting that was already correct

#autovacuum_analyze_scale_factor = 0.1
...
autovacuum_analyze_scale_factor = 0.05

-->

#autovacuum_analyze_scale_factor = 0.1
...
# autovacuum_analyze_scale_factor = 0.05
autovacuum_analyze_scale_factor = 0.075


tested in ovirt-engine-setup-4.1.1.2-0.1.el7.noarch

Comment 15 Red Hat Bugzilla Rules Engine 2017-03-02 15:39:13 UTC

Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.

Comment 16 Martin Perina 2017-03-06 13:47:59 UTC

Roy, could you please take a look?

Comment 17 Yaniv Kaul 2017-03-08 19:33:07 UTC

(In reply to Martin Perina from comment #16)
> Roy, could you please take a look?

ping?

Comment 18 Oved Ourfali 2017-03-15 09:34:17 UTC

Roy - what's the status here?

Comment 19 Lucie Leistnerova 2017-05-02 14:34:08 UTC

Now it doesn't rewrite already good values, but there is a problem with maintenance_work_mem when value is set with units.

maintenance_work_mem = 16MB 

engine-setup ends with error
[ ERROR ] Failed to execute stage 'Misc configuration': invalid literal for int() with base 10: '16MB'

2017-05-02 16:22:49 DEBUG otopi.context context._executeMethod:142 method exception
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/otopi/context.py", line 132, in _executeMethod
    method['method']()
  File "/usr/share/ovirt-engine/setup/bin/../plugins/ovirt-engine-setup/ovirt-engine/db/configuration.py", line 222, in _updatePGConf
    transaction=localtransaction,
  File "/usr/share/ovirt-engine/setup/ovirt_engine_setup/engine_common/postgres.py", line 284, in _updatePostgresConf
    needUpdate, content = dbovirtutils.getUpdatedPGConf(content)
  File "/usr/share/ovirt-engine/setup/ovirt_engine_setup/engine_common/database.py", line 1004, in getUpdatedPGConf
    expected=item['expected']
  File "/usr/share/ovirt-engine/setup/bin/../plugins/ovirt-engine-setup/ovirt-engine/db/configuration.py", line 128, in <lambda>
    int(current) >= int(expected)
ValueError: invalid literal for int() with base 10: '16MB'


tested in ovirt-engine-setup-4.1.2-0.1.el7.noarch

Comment 20 Oved Ourfali 2017-05-02 16:34:49 UTC

Roy, can you please take a look?

Comment 21 Roy Golan 2017-05-03 09:12:43 UTC

(In reply to Lucie Leistnerova from comment #19)
> Now it doesn't rewrite already good values, but there is a problem with
> maintenance_work_mem when value is set with units.
> 
> maintenance_work_mem = 16MB 

We don't support units and our scripts never did - i.e you are exposing a limitation that was already there from the beginning. 

I suggest we should VERIFY this RFE and open a new bug for this specific issue since this RFE didn't introduce that.

Comment 22 Lucie Leistnerova 2017-05-03 11:03:35 UTC

Alright. I created Bug 1447624
verified in ovirt-engine-setup-4.1.2-0.1.el7.noarch