Bug 1547161

Summary: PostgreSQL Checkpoints for WAL unable to keep up with a busy database
Product: Red Hat CloudForms Management Engine Reporter: Ron <rovalent>
Component: PerformanceAssignee: Pradeep Kumar Surisetty <psuriset>
Status: CLOSED DUPLICATE QA Contact: Alex Newman <anewman>
Severity: medium Docs Contact:
Priority: high    
Version: 5.8.0CC: anewman, cpelland, gtanzill, hroy, jhutar, ldixon, mfeifer, obarenbo, sbadhwar
Target Milestone: GA   
Target Release: 5.8.4   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-17 20:26:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: CFME Core Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1551709    
Attachments:
Description Flags
Changes to postgresql.conf for a large, active environment. none

Description Ron 2018-02-20 16:07:12 UTC
Created attachment 1398274 [details]
Changes to postgresql.conf for a large, active environment.

Description of problem:

When trying to purge data from the metrics tables we're running into timeouts. Additionally the following error gets written to the postgresql.log when purge or vacuum is run.

```
2018-02-16 03:02:42 GMT:10.122.156.9(49810):5a8570b1.7645:root@vmdb_production:[30277]:LOG:  duration: 6547.944 ms  execute <unnamed>: DELETE FROM "metrics" WHERE "metrics"."id" IN (SELECT  "metrics"."id" FROM "metrics" WHERE ("metrics"."timestamp" <= '2018-02-15 22:46:20.611836') LIMIT $1)
2018-02-16 03:02:42 GMT:10.122.156.9(49810):5a8570b1.7645:root@vmdb_production:[30277]:DETAIL:  parameters: $1 = '1000'
2018-02-16 03:02:46 GMT::5a810349.10d0:@:[4304]:LOG:  checkpoints are occurring too frequently (8 seconds apart)
2018-02-16 03:02:46 GMT::5a810349.10d0:@:[4304]:HINT:  Consider increasing the configuration parameter "max_wal_size".
```

This occurs both in large and small deployments.
Much more frequently in large environments.

Small: (1 Provider, 1 Cluster, 1 Host, 6 Datastores, 28 VMs, 12 Templates) 48 Objects > Occurs 0-4 times per day
Large: (2 Providers, 3 Clusters, 50 Hosts, 119 Datastores, 7600 VMs, 957 Templates) 7772 Objects > 100-200 times per day [BEFORE Tuning]
Large*: (2 Providers, 3 Clusters, 50 Hosts, 119 Datastores, 7600 VMs, 957 Templates) 7772 Objects > 10-20 times per day [AFTER* Tuning]

Version-Release number of selected component (if applicable):

5.8.2.3-3

How reproducible:

This occurs every time a vacuum or purge metrics runs. Could also have additional scenarios that trigger this log to occur in PostgreSQL.

Steps to Reproduce:
1. Setup a CFME Appliance with a VMware provider.
2. Configure db maintenance scripts
3. Run `tools/purge_metrics.rb` and `/usr/bin/periodic_vacuum_full_tables`
4. Watch `tail -f postgresql.log | grep -C2 checkpoints`

Actual results:

:LOG:  checkpoints are occurring too frequently (8 seconds apart)

Expected results:

No log entry is written and the WAL is able to keep up with the changes.

Additional info:

Updated the `postgresql.conf` with the following changes reduce the occurrence of the log message from 146 entries to 10.

See attached diff for changes. Note: These are only initial tuning options, more tuning may be necessary but functions as a good starting point.

Comment 6 Ron 2018-02-21 21:48:11 UTC
I have opened a PR here: https://github.com/ManageIQ/manageiq-appliance/pull/181

These consist of all the tuning options we did.

After about 1 hour of implementing these changes we have reduced our database footprint from 363GB to 284GB. We are continuing to reclaim space.

Hope this PR helps!

Ciao,
Ron Valente

Comment 7 Ron 2018-02-22 16:06:27 UTC
Additional PR opened for manageiq-appliance_console here: https://github.com/ManageIQ/manageiq-appliance_console/pull/33

Comment 15 Marianne Feifer 2018-04-17 20:26:17 UTC

*** This bug has been marked as a duplicate of bug 1537733 ***