Bug 1547161

Summary:

PostgreSQL Checkpoints for WAL unable to keep up with a busy database

Product:

Red Hat CloudForms Management Engine

Reporter:

Ron <rovalent>

Component:

Performance

Assignee:

Pradeep Kumar Surisetty <psuriset>

Status:

CLOSED DUPLICATE

QA Contact:

Alex Newman <anewman>

Severity:

medium

Docs Contact:

Priority:

high

Version:

5.8.0

CC:

anewman, cpelland, gtanzill, hroy, jhutar, ldixon, mfeifer, obarenbo, sbadhwar

Target Milestone:

Target Release:

5.8.4

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-04-17 20:26:17 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

CFME Core

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1551709

Attachments:

Description	Flags
Changes to postgresql.conf for a large, active environment.	none

Description Ron 2018-02-20 16:07:12 UTC

Created attachment 1398274 [details]
Changes to postgresql.conf for a large, active environment.

Description of problem:

When trying to purge data from the metrics tables we're running into timeouts. Additionally the following error gets written to the postgresql.log when purge or vacuum is run.

```
2018-02-16 03:02:42 GMT:10.122.156.9(49810):5a8570b1.7645:root@vmdb_production:[30277]:LOG:  duration: 6547.944 ms  execute <unnamed>: DELETE FROM "metrics" WHERE "metrics"."id" IN (SELECT  "metrics"."id" FROM "metrics" WHERE ("metrics"."timestamp" <= '2018-02-15 22:46:20.611836') LIMIT $1)
2018-02-16 03:02:42 GMT:10.122.156.9(49810):5a8570b1.7645:root@vmdb_production:[30277]:DETAIL:  parameters: $1 = '1000'
2018-02-16 03:02:46 GMT::5a810349.10d0:@:[4304]:LOG:  checkpoints are occurring too frequently (8 seconds apart)
2018-02-16 03:02:46 GMT::5a810349.10d0:@:[4304]:HINT:  Consider increasing the configuration parameter "max_wal_size".
```

This occurs both in large and small deployments.
Much more frequently in large environments.

Small: (1 Provider, 1 Cluster, 1 Host, 6 Datastores, 28 VMs, 12 Templates) 48 Objects > Occurs 0-4 times per day
Large: (2 Providers, 3 Clusters, 50 Hosts, 119 Datastores, 7600 VMs, 957 Templates) 7772 Objects > 100-200 times per day [BEFORE Tuning]
Large*: (2 Providers, 3 Clusters, 50 Hosts, 119 Datastores, 7600 VMs, 957 Templates) 7772 Objects > 10-20 times per day [AFTER* Tuning]

Version-Release number of selected component (if applicable):

5.8.2.3-3

How reproducible:

This occurs every time a vacuum or purge metrics runs. Could also have additional scenarios that trigger this log to occur in PostgreSQL.

Steps to Reproduce:
1. Setup a CFME Appliance with a VMware provider.
2. Configure db maintenance scripts
3. Run `tools/purge_metrics.rb` and `/usr/bin/periodic_vacuum_full_tables`
4. Watch `tail -f postgresql.log | grep -C2 checkpoints`

Actual results:

:LOG:  checkpoints are occurring too frequently (8 seconds apart)

Expected results:

No log entry is written and the WAL is able to keep up with the changes.

Additional info:

Updated the `postgresql.conf` with the following changes reduce the occurrence of the log message from 146 entries to 10.

See attached diff for changes. Note: These are only initial tuning options, more tuning may be necessary but functions as a good starting point.

Comment 6 Ron 2018-02-21 21:48:11 UTC

I have opened a PR here: https://github.com/ManageIQ/manageiq-appliance/pull/181

These consist of all the tuning options we did.

After about 1 hour of implementing these changes we have reduced our database footprint from 363GB to 284GB. We are continuing to reclaim space.

Hope this PR helps!

Ciao,
Ron Valente

Comment 7 Ron 2018-02-22 16:06:27 UTC

Additional PR opened for manageiq-appliance_console here: https://github.com/ManageIQ/manageiq-appliance_console/pull/33

Comment 15 Marianne Feifer 2018-04-17 20:26:17 UTC


*** This bug has been marked as a duplicate of bug 1537733 ***