1616022 – pg_basebackup: could not get transaction log end position from server

Bug 1616022 - pg_basebackup: could not get transaction log end position from server

Summary: pg_basebackup: could not get transaction log end position from server

Keywords:
Status:	VERIFIED
Alias:	None
Product:	Red Hat CloudForms Management Engine
Classification:	Red Hat
Component:	Documentation
Sub Component:
Version:	5.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	GA
Target Release:	5.10.2
Assignee:	Chris Budzilowicz
QA Contact:	Red Hat CloudForms Documentation
Docs Contact:	Red Hat CloudForms Documentation
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-08-14 18:28 UTC by Jared Deubel
Modified:	2022-04-30 04:36 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Category:	---
Cloudforms Team:	Documentation
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
evm.log (6.92 MB, text/plain) 2018-08-14 18:28 UTC, Jared Deubel	no flags	Details
View All

Description Jared Deubel 2018-08-14 18:28:38 UTC

Created attachment 1475929 [details]
evm.log

Description of problem:
Cloudforms Scheduled Backup Jobs not running due to pg_basebackup wal segment being removed:


=========================================
[----] E, [2018-08-14T12:14:50.589103 #2201:a2b124] ERROR -- : AwesomeSpawn: pg_basebackup exit code: 1
[----] E, [2018-08-14T12:14:50.589997 #2201:a2b124] ERROR -- : AwesomeSpawn: pg_basebackup: could not get transaction log end position from server: ERROR:  requested WAL segment 000000010000021A0000002E has already been removed

[----] I, [2018-08-14T12:14:51.122275 #2201:a2b124]  INFO -- : MIQ(Class-disconnect) Disconnecting mount point: /tmp/miq_20180814-2201-bo19np
[----] I, [2018-08-14T12:14:51.209233 #2201:a2b124]  INFO -- : MIQ(Class-disconnect) Disconnecting mount point: /tmp/miq_20180814-2201-bo19np...Complete
[----] E, [2018-08-14T12:14:51.210831 #2201:a2b124] ERROR -- : MIQ(MiqQueue#deliver) Message id: [11000200723726], Error: [pg_basebackup exit code: 1]
[----] E, [2018-08-14T12:14:51.211679 #2201:a2b124] ERROR -- : [AwesomeSpawn::CommandResultError]: pg_basebackup exit code: 1  Method:[block in method_missing]
[----] E, [2018-08-14T12:14:51.213053 #2201:a2b124] ERROR -- : /opt/rh/cfme-gemset/gems/awesome_spawn-1.4.1/lib/awesome_spawn.rb:105:in `run!'
=========================================
Version-Release number of selected component (if applicable):
5.9

Comment 3 Nick Carboni 2018-08-15 13:46:24 UTC

I think this is due to the way that we run pg_basebackup

It looks like we're running this command:
`pg_basebackup --username=root -x --format=t --xlog-method=fetch --pgdata=<datadir>`

The issue I have with this is the --xlog-method=fetch option. I've added an excerpt from the pg_basebackup man-page here:

----------------------------------------------------

--xlog-method=method
Includes the required transaction log files (WAL files) in the backup. This will include all transaction logs generated during the backup. If this option is specified, it is possible to start a postmaster directly in the extracted directory without the need to consult the log archive, thus making this a completely standalone backup.

The following methods for collecting the transaction logs are supported:

f
fetch
The transaction log files are collected at the end of the backup. Therefore, it is necessary for the wal_keep_segments parameter to be set high enough that the log is not removed before the end of the backup. If the log has been rotated when it's time to transfer it, the backup will fail and be unusable.

s
stream
Stream the transaction log while the backup is created. This will open a second connection to the server and start streaming the transaction log in parallel while running the backup. Therefore, it will use up two connections configured by the max_wal_senders parameter. As long as the client can keep up with transaction log received, using this mode requires no extra transaction logs to be saved on the master.

--------------------------------------------------

It seems like this bug is happening because we are using the "fetch" method, but not setting the wal_keep_segments parameter. Seeing as how we already use "--xlog-method=stream" for repmgr's initial sync backup[1], we should also be using it here.

[1] https://github.com/ManageIQ/manageiq-appliance_console/blob/master/lib/manageiq/appliance_console/database_replication.rb#L62

Comment 4 Nick Carboni 2018-08-15 13:56:11 UTC

https://github.com/ManageIQ/manageiq-gems-pending/pull/364 should fix this in the appliance and I also opened https://github.com/ManageIQ/container-postgresql/pull/12 to fix the same issue in the postgres container for good measure.

Comment 5 CFME Bot 2018-08-15 13:57:21 UTC

New commit detected on ManageIQ/manageiq-gems-pending/master:

https://github.com/ManageIQ/manageiq-gems-pending/commit/022d00154fbba2bbb6349c56dcb762ccac9a81fb
commit 022d00154fbba2bbb6349c56dcb762ccac9a81fb
Author:     Nick Carboni <ncarboni>
AuthorDate: Wed Aug 15 09:49:53 2018 -0400
Commit:     Nick Carboni <ncarboni>
CommitDate: Wed Aug 15 09:49:53 2018 -0400

    Use --xlog-method=stream for pg_basebackup

    Previously we were using the "fetch" method which would fail if
    we did not set wal_keep_segments to a high enough value. We were
    never setting a value for wal_keep_segments so this would (infrequently)
    fail.

    https://bugzilla.redhat.com/show_bug.cgi?id=1616022

 lib/gems/pending/util/postgres_admin.rb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comment 9 Nick LaMuro 2018-08-24 15:30:49 UTC

Ran into a problem with this fix:

https://github.com/ManageIQ/manageiq-gems-pending/issues/366

Effectively what happens now is we can't use this method (--xlog-method=stream) unless we switch to `--format=plain` and remove compression, and backups are completely broken using the rake tasks.

* * *

We basically have two options moving forward:

- Switch from `--format=tar` to `--format=plain`, which uses about 10x more disk space per backup, and doesn't allow us to stream the output anywhere.  This would require a much larger rewrite of a lot of things surrounding our rake evm:db:backup* tasks since we now are not working with generated tars, but full directory structures.
- Revert the change made, and configure `wal_keep_segments` in the postgres config to something reasonable (as a fix for this issue), or create documentation around making sure to set this in order to use the backup functionality.


Switching this back to `ON_DEV` until we get this resolved.

-Nick

Comment 12 Nick Carboni 2018-09-26 20:07:13 UTC

I have a proposal which will ensure that this won't occur while also continuing to use the "fetch" xlog-method (and by extension the tar output format), but it could cause the database to run out of space in the data directory if the backup is slow enough which will cause a crash.

The pitch is to create a replication slot before we start the backup then remove it after we finish. We won't ever expect anyone to consume from the slot, but it's a (relatively) clean way to ensure that we retain transaction log until the backup is finished.

In the worst case scenario the database is low on free space and the backup is being done at a high-traffic time so a ton of WAL log gets generated while the backup is in progress and is kept on disk. Running out of disk space will crash the PG server and is quite hard to recover from.

Before I try implementing this patch I'm curious what Dennis and maybe Brad think about this. If we don't go the replication slot route then I propose closing this as WONTFIX and suggesting to customers that hit this problem to find a value for wal_keep_segments which will allow them to complete their backups given their traffic and space constraints.

Comment 13 Sudhir Mallamprabhakara 2018-09-28 15:36:07 UTC

Dennis,

Based on https://bugzilla.redhat.com/show_bug.cgi?id=1616022#c10, can we change target release to 5.10 please?

- Sudhir

Comment 14 dmetzger 2018-10-09 14:36:46 UTC

Target release changed

Comment 16 Nick Carboni 2018-12-13 14:26:16 UTC

As a workaround you will have to choose a value for the `wal_keep_segments` configuration parameter that will allow the backup to finish.

You really want it to be as small as possible, but I would start with something like (<pg-volume-free-space>/4)/16MB and see if that allows the backup to finish.
Keep in mind that after setting this value, the transaction log will *always* take up at least a quarter of the current free space. So if that value works, it might make sense to decrease it a bit more to reclaim some space for the database.

You can set the value from psql using `ALTER SYSTEM SET wal_keep_segments = <value>;` and use `SELECT pg_reload_conf();` to make it take effect.

Comment 17 Nick Carboni 2018-12-13 14:27:43 UTC

Also still waiting on a verdict on the proposal in comment 12

Comment 18 dmetzger 2018-12-19 16:39:33 UTC

Going with the documentation option, needed information for documenting is in comment #16.

Comment 28 CFME Bot 2021-03-17 16:36:31 UTC

New commit detected on ManageIQ/manageiq-appliance_console/master:

https://github.com/ManageIQ/manageiq-appliance_console/commit/21de264f929abc0519640ad49b22c7a836e36adc
commit 21de264f929abc0519640ad49b22c7a836e36adc
Author:     Nick Carboni <ncarboni>
AuthorDate: Wed Aug 15 13:49:53 2018 +0000
Commit:     Nick LaMuro <nicklamuro>
CommitDate: Tue Mar 16 19:25:16 2021 +0000

    Use --xlog-method=stream for pg_basebackup

    Previously we were using the "fetch" method which would fail if
    we did not set wal_keep_segments to a high enough value. We were
    never setting a value for wal_keep_segments so this would (infrequently)
    fail.

    https://bugzilla.redhat.com/show_bug.cgi?id=1616022


    (transferred from ManageIQ/manageiq-gems-pending@022d00154fbba2bbb6349c56dcb762ccac9a81fb)
 lib/manageiq/appliance_console/postgres_admin.rb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Note You need to log in before you can comment on or make changes to this bug.