1442911 – In new db master node, pg_xlog directory got fulled

Bug 1442911 - In new db master node, pg_xlog directory got fulled

Summary: In new db master node, pg_xlog directory got fulled

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Appliance
Sub Component:
Version:	5.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	GA
Target Release:	5.9.0
Assignee:	Nick Carboni
QA Contact:	luke couzens
Docs Contact:
URL:
Whiteboard:	HA:black
Depends On:
Blocks:	1445385 1450512
TreeView+	depends on / blocked

Reported:	2017-04-18 03:02 UTC by tachoi
Modified:	2023-09-14 03:56 UTC (History)
CC List:	8 users (show)
Fixed In Version:	5.9.0.1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1445385 1450512 (view as bug list)
Environment:
Last Closed:	2018-03-06 15:58:27 UTC
Category:	---
Cloudforms Team:	CFME Core
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description tachoi 2017-04-18 03:02:11 UTC

Description of problem:
- In CFME appliance HA db configuration
- 100 GB pg_xlog dir got fulled which resulted in DB service down.

Version-Release number of selected component (if applicable):
5.7.0.17

How reproducible:
NA

Steps to Reproduce:
1. Master DB failed, failover occurred to Standby node properly
2. Standby node successfully promoted to Master node.
3. Master node hasn't been manually intervened/added to cluster back more than a week
4. New Master node, WAL log dir got fulled(pg_xlog) then DB service down

Actual results:
1. DB service down with WAL dir full

Expected results:
1. WAL log is properly being rotated in New Master node
2. Need a notification method to customer when db failover occurs and manual intervention needed

Additional info:

Comment 7 Nick Carboni 2017-04-20 15:12:29 UTC

This is probably the same issue described in bug 1426769.

If the replication slots which were causing the xlog to be retained were not removed it is likely that this customer will take another outage without intervention.

You can find the slots that are not being used by inspecting the pg_replication_slots view (specifically the "active" column) using `select slot_name, active from pg_replication_slots;`

Unused slots can be dropped using `select pg_drop_replication_slot(slot_name);`

More information can be found in the postgresql docs (https://www.postgresql.org/docs/9.5/static/warm-standby.html#STREAMING-REPLICATION-SLOTS) if needed.

Comment 8 Nick Carboni 2017-04-20 19:50:55 UTC

Ah I see, I just read through the case.

It looks like the database was recreated on the master node after dumping a backup.

This would remove any existing replication slots and should prevent the problem from occurring again. Unfortunately this also prevents us from diagnosing if replication slots did actually cause the xlog to be retained.

Comment 9 tachoi 2017-04-20 23:03:21 UTC

Hi Nick

As this customer is the largest Telco in our region, I had to find and apply workaround to resume their service as soon as possible.

I have a couple of questions from this case.

1. I am seeing we don't have any limitation on number of WAL file so if something goes wrong on the network(or standby node), WAL file will be stacked up indefinitely. It means, HA network issue or standby node problem is affecting actual master db availability, which is not a good idea in production. Is there anyway to prevent this kind of filesystem full with properly rotated WAL log in master node?

[root@standby pg_xlog]# ls -l |wc -l
9
[root@master pg_xlog]# ls -l |wc -l
6282 <--- this is the cause of 100G being used up and space to be full

[root@standby data]# grep wal_keep_segments postgresql.conf
#wal_keep_segments = 0          # in logfile segments, 16MB each; 0 disables

[root@master data]# grep wal_keep_segments postgresql.conf
#wal_keep_segments = 0          # in logfile segments, 16MB each; 0 disables

2. In operation point of view, we need to have a notification method to let customer know if there was an HA failover incident so that they can manually intervene to add failed node to the cluster back (as this is current limitation) otherwise this kind of filesystem full is expected anytime in any customer again as failed node is not ready to accept WAL log.

Comment 10 Nick Carboni 2017-04-21 13:28:11 UTC

(In reply to tachoi from comment #9)
> 
> 1. I am seeing we don't have any limitation on number of WAL file so if
> something goes wrong on the network(or standby node), WAL file will be
> stacked up indefinitely.

This is the intended behavior of replication slots. If that WAL is removed then the standby which is consuming changes from the replication slot will have inconsistent data and will need to be re-cloned from the primary server.

> It means, HA network issue or standby node problem
> is affecting actual master db availability, which is not a good idea in
> production. Is there anyway to prevent this kind of filesystem full with
> properly rotated WAL log in master node?

The PostgreSQL WAL is not a "log" in the common sense, so it doesn't get "rotated" as such. A good intro into the concept can be found in the documentation (https://www.postgresql.org/docs/9.5/static/wal-intro.html)

> 
> [root@standby pg_xlog]# ls -l |wc -l
> 9
> [root@master pg_xlog]# ls -l |wc -l
> 6282 <--- this is the cause of 100G being used up and space to be full
> 
> [root@standby data]# grep wal_keep_segments postgresql.conf
> #wal_keep_segments = 0          # in logfile segments, 16MB each; 0 disables
> 
> [root@master data]# grep wal_keep_segments postgresql.conf
> #wal_keep_segments = 0          # in logfile segments, 16MB each; 0 disables

This value doesn't do what you seem to think it does. This is a *minimum* value of WAL segments to keep. It was used primarily before replication slots were introduced to ensure that the WAL needed to replay active changes was still present when a new standby server was doing it's initial clone of the primary database. Now, you can create a replication slot at the start of the initial clone and that will ensure that the WAL you need is still there when the clone finishes.

Again, more information on this can be found in the PostgreSQL documentation https://www.postgresql.org/docs/9.5/static/runtime-config-replication.html

> 
> 2. In operation point of view, we need to have a notification method to let
> customer know if there was an HA failover incident so that they can manually
> intervene to add failed node to the cluster back (as this is current
> limitation) otherwise this kind of filesystem full is expected anytime in
> any customer again as failed node is not ready to accept WAL log.

We have this already, but it may not be properly documented.

Today, an event (EvmEvent) is raised for each server when a failover is executed. This can be used in automate to configure sending an email or whatever other notification is necessary. That was added to 5.7 (euwe upstream) here https://github.com/ManageIQ/manageiq/pull/12332

-----------------------

Given what happened here I think another good enhancement might be to monitor replication slots directly and raise an event (or notification) when we see one that is not active for some time. This can be done using SQL so it is more flexible than monitoring disk space directly.

That said, I would still encourage monitoring disk usage on all machines that are vital to operations, but I feel like that should go without saying and is probably out of scope for our application.

-----------------------

After all that, this still seems to be a duplicate of bug 1426769 as I mentioned in comment 7. That bug is not a blocker though so I'm not sure what direction we want to take if we close this.

I'll leave it open for now until we get some feedback from PM.

Comment 11 CFME Bot 2017-04-24 19:59:08 UTC

https://github.com/ManageIQ/manageiq-gems-pending/pull/126

Comment 12 CFME Bot 2017-04-24 20:34:34 UTC

New commit detected on ManageIQ/manageiq-gems-pending/master:
https://github.com/ManageIQ/manageiq-gems-pending/commit/63a179ea2b419007df07a0385989f8f20978ee8f

commit 63a179ea2b419007df07a0385989f8f20978ee8f
Author:     Nick Carboni <ncarboni>
AuthorDate: Wed Apr 19 17:43:17 2017 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Mon Apr 24 15:52:53 2017 -0400

    Offer to clear the data directory for new standby servers
    
    This will allow seamless reintegration of failed primary
    servers after a failover.
    
    When this happens the user will be given the option to clear
    the existing database and re-clone the new primary into this server
    and then continue to set up a standby as before.
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1426718
    https://bugzilla.redhat.com/show_bug.cgi?id=1426769
    https://bugzilla.redhat.com/show_bug.cgi?id=1442911

 .../database_replication_standby.rb                |  20 +--
 .../database_replication_standby_spec.rb           | 143 ++++++++++++++-------
 2 files changed, 112 insertions(+), 51 deletions(-)

Comment 13 Nick Carboni 2017-04-25 12:41:44 UTC

This should be fixed for the node re-introduction case.

I opened another issue [1] for generally monitoring replication slots to prevent this kind of situation in the future.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1445291

Comment 17 luke couzens 2017-10-12 12:44:25 UTC

Verified in 5.9.0.2

Comment 18 Red Hat Bugzilla 2023-09-14 03:56:30 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.