Created attachment 1398274 [details] Changes to postgresql.conf for a large, active environment. Description of problem: When trying to purge data from the metrics tables we're running into timeouts. Additionally the following error gets written to the postgresql.log when purge or vacuum is run. ``` 2018-02-16 03:02:42 GMT:10.122.156.9(49810):5a8570b1.7645:root@vmdb_production:[30277]:LOG: duration: 6547.944 ms execute <unnamed>: DELETE FROM "metrics" WHERE "metrics"."id" IN (SELECT "metrics"."id" FROM "metrics" WHERE ("metrics"."timestamp" <= '2018-02-15 22:46:20.611836') LIMIT $1) 2018-02-16 03:02:42 GMT:10.122.156.9(49810):5a8570b1.7645:root@vmdb_production:[30277]:DETAIL: parameters: $1 = '1000' 2018-02-16 03:02:46 GMT::5a810349.10d0:@:[4304]:LOG: checkpoints are occurring too frequently (8 seconds apart) 2018-02-16 03:02:46 GMT::5a810349.10d0:@:[4304]:HINT: Consider increasing the configuration parameter "max_wal_size". ``` This occurs both in large and small deployments. Much more frequently in large environments. Small: (1 Provider, 1 Cluster, 1 Host, 6 Datastores, 28 VMs, 12 Templates) 48 Objects > Occurs 0-4 times per day Large: (2 Providers, 3 Clusters, 50 Hosts, 119 Datastores, 7600 VMs, 957 Templates) 7772 Objects > 100-200 times per day [BEFORE Tuning] Large*: (2 Providers, 3 Clusters, 50 Hosts, 119 Datastores, 7600 VMs, 957 Templates) 7772 Objects > 10-20 times per day [AFTER* Tuning] Version-Release number of selected component (if applicable): 5.8.2.3-3 How reproducible: This occurs every time a vacuum or purge metrics runs. Could also have additional scenarios that trigger this log to occur in PostgreSQL. Steps to Reproduce: 1. Setup a CFME Appliance with a VMware provider. 2. Configure db maintenance scripts 3. Run `tools/purge_metrics.rb` and `/usr/bin/periodic_vacuum_full_tables` 4. Watch `tail -f postgresql.log | grep -C2 checkpoints` Actual results: :LOG: checkpoints are occurring too frequently (8 seconds apart) Expected results: No log entry is written and the WAL is able to keep up with the changes. Additional info: Updated the `postgresql.conf` with the following changes reduce the occurrence of the log message from 146 entries to 10. See attached diff for changes. Note: These are only initial tuning options, more tuning may be necessary but functions as a good starting point.
I have opened a PR here: https://github.com/ManageIQ/manageiq-appliance/pull/181 These consist of all the tuning options we did. After about 1 hour of implementing these changes we have reduced our database footprint from 363GB to 284GB. We are continuing to reclaim space. Hope this PR helps! Ciao, Ron Valente
Additional PR opened for manageiq-appliance_console here: https://github.com/ManageIQ/manageiq-appliance_console/pull/33
*** This bug has been marked as a duplicate of bug 1537733 ***