1547161 – PostgreSQL Checkpoints for WAL unable to keep up with a busy database

Bug 1547161 - PostgreSQL Checkpoints for WAL unable to keep up with a busy database

Summary: PostgreSQL Checkpoints for WAL unable to keep up with a busy database

Keywords:
Status:	CLOSED DUPLICATE of bug 1537733
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Performance
Sub Component:
Version:	5.8.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	GA
Target Release:	5.8.4
Assignee:	Pradeep Kumar Surisetty
QA Contact:	Alex Newman
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1551709
TreeView+	depends on / blocked

Reported:	2018-02-20 16:07 UTC by Ron
Modified:	2021-06-10 14:44 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-04-17 20:26:17 UTC
Category:	---
Cloudforms Team:	CFME Core
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Changes to postgresql.conf for a large, active environment. (865 bytes, patch) 2018-02-20 16:07 UTC, Ron	no flags	Details \| Diff
View All

Description Ron 2018-02-20 16:07:12 UTC

Created attachment 1398274 [details]
Changes to postgresql.conf for a large, active environment.

Description of problem:

When trying to purge data from the metrics tables we're running into timeouts. Additionally the following error gets written to the postgresql.log when purge or vacuum is run.

```
2018-02-16 03:02:42 GMT:10.122.156.9(49810):5a8570b1.7645:root@vmdb_production:[30277]:LOG:  duration: 6547.944 ms  execute <unnamed>: DELETE FROM "metrics" WHERE "metrics"."id" IN (SELECT  "metrics"."id" FROM "metrics" WHERE ("metrics"."timestamp" <= '2018-02-15 22:46:20.611836') LIMIT $1)
2018-02-16 03:02:42 GMT:10.122.156.9(49810):5a8570b1.7645:root@vmdb_production:[30277]:DETAIL:  parameters: $1 = '1000'
2018-02-16 03:02:46 GMT::5a810349.10d0:@:[4304]:LOG:  checkpoints are occurring too frequently (8 seconds apart)
2018-02-16 03:02:46 GMT::5a810349.10d0:@:[4304]:HINT:  Consider increasing the configuration parameter "max_wal_size".
```

This occurs both in large and small deployments.
Much more frequently in large environments.

Small: (1 Provider, 1 Cluster, 1 Host, 6 Datastores, 28 VMs, 12 Templates) 48 Objects > Occurs 0-4 times per day
Large: (2 Providers, 3 Clusters, 50 Hosts, 119 Datastores, 7600 VMs, 957 Templates) 7772 Objects > 100-200 times per day [BEFORE Tuning]
Large*: (2 Providers, 3 Clusters, 50 Hosts, 119 Datastores, 7600 VMs, 957 Templates) 7772 Objects > 10-20 times per day [AFTER* Tuning]

Version-Release number of selected component (if applicable):

5.8.2.3-3

How reproducible:

This occurs every time a vacuum or purge metrics runs. Could also have additional scenarios that trigger this log to occur in PostgreSQL.

Steps to Reproduce:
1. Setup a CFME Appliance with a VMware provider.
2. Configure db maintenance scripts
3. Run `tools/purge_metrics.rb` and `/usr/bin/periodic_vacuum_full_tables`
4. Watch `tail -f postgresql.log | grep -C2 checkpoints`

Actual results:

:LOG:  checkpoints are occurring too frequently (8 seconds apart)

Expected results:

No log entry is written and the WAL is able to keep up with the changes.

Additional info:

Updated the `postgresql.conf` with the following changes reduce the occurrence of the log message from 146 entries to 10.

See attached diff for changes. Note: These are only initial tuning options, more tuning may be necessary but functions as a good starting point.

Comment 6 Ron 2018-02-21 21:48:11 UTC

I have opened a PR here: https://github.com/ManageIQ/manageiq-appliance/pull/181

These consist of all the tuning options we did.

After about 1 hour of implementing these changes we have reduced our database footprint from 363GB to 284GB. We are continuing to reclaim space.

Hope this PR helps!

Ciao,
Ron Valente

Comment 7 Ron 2018-02-22 16:06:27 UTC

Additional PR opened for manageiq-appliance_console here: https://github.com/ManageIQ/manageiq-appliance_console/pull/33

Comment 15 Marianne Feifer 2018-04-17 20:26:17 UTC


*** This bug has been marked as a duplicate of bug 1537733 ***

Note You need to log in before you can comment on or make changes to this bug.