Created attachment 1475929 [details] evm.log Description of problem: Cloudforms Scheduled Backup Jobs not running due to pg_basebackup wal segment being removed: ========================================= [----] E, [2018-08-14T12:14:50.589103 #2201:a2b124] ERROR -- : AwesomeSpawn: pg_basebackup exit code: 1 [----] E, [2018-08-14T12:14:50.589997 #2201:a2b124] ERROR -- : AwesomeSpawn: pg_basebackup: could not get transaction log end position from server: ERROR: requested WAL segment 000000010000021A0000002E has already been removed [----] I, [2018-08-14T12:14:51.122275 #2201:a2b124] INFO -- : MIQ(Class-disconnect) Disconnecting mount point: /tmp/miq_20180814-2201-bo19np [----] I, [2018-08-14T12:14:51.209233 #2201:a2b124] INFO -- : MIQ(Class-disconnect) Disconnecting mount point: /tmp/miq_20180814-2201-bo19np...Complete [----] E, [2018-08-14T12:14:51.210831 #2201:a2b124] ERROR -- : MIQ(MiqQueue#deliver) Message id: [11000200723726], Error: [pg_basebackup exit code: 1] [----] E, [2018-08-14T12:14:51.211679 #2201:a2b124] ERROR -- : [AwesomeSpawn::CommandResultError]: pg_basebackup exit code: 1 Method:[block in method_missing] [----] E, [2018-08-14T12:14:51.213053 #2201:a2b124] ERROR -- : /opt/rh/cfme-gemset/gems/awesome_spawn-1.4.1/lib/awesome_spawn.rb:105:in `run!' ========================================= Version-Release number of selected component (if applicable): 5.9
I think this is due to the way that we run pg_basebackup It looks like we're running this command: `pg_basebackup --username=root -x --format=t --xlog-method=fetch --pgdata=<datadir>` The issue I have with this is the --xlog-method=fetch option. I've added an excerpt from the pg_basebackup man-page here: ---------------------------------------------------- --xlog-method=method Includes the required transaction log files (WAL files) in the backup. This will include all transaction logs generated during the backup. If this option is specified, it is possible to start a postmaster directly in the extracted directory without the need to consult the log archive, thus making this a completely standalone backup. The following methods for collecting the transaction logs are supported: f fetch The transaction log files are collected at the end of the backup. Therefore, it is necessary for the wal_keep_segments parameter to be set high enough that the log is not removed before the end of the backup. If the log has been rotated when it's time to transfer it, the backup will fail and be unusable. s stream Stream the transaction log while the backup is created. This will open a second connection to the server and start streaming the transaction log in parallel while running the backup. Therefore, it will use up two connections configured by the max_wal_senders parameter. As long as the client can keep up with transaction log received, using this mode requires no extra transaction logs to be saved on the master. -------------------------------------------------- It seems like this bug is happening because we are using the "fetch" method, but not setting the wal_keep_segments parameter. Seeing as how we already use "--xlog-method=stream" for repmgr's initial sync backup[1], we should also be using it here. [1] https://github.com/ManageIQ/manageiq-appliance_console/blob/master/lib/manageiq/appliance_console/database_replication.rb#L62
https://github.com/ManageIQ/manageiq-gems-pending/pull/364 should fix this in the appliance and I also opened https://github.com/ManageIQ/container-postgresql/pull/12 to fix the same issue in the postgres container for good measure.
New commit detected on ManageIQ/manageiq-gems-pending/master: https://github.com/ManageIQ/manageiq-gems-pending/commit/022d00154fbba2bbb6349c56dcb762ccac9a81fb commit 022d00154fbba2bbb6349c56dcb762ccac9a81fb Author: Nick Carboni <ncarboni> AuthorDate: Wed Aug 15 09:49:53 2018 -0400 Commit: Nick Carboni <ncarboni> CommitDate: Wed Aug 15 09:49:53 2018 -0400 Use --xlog-method=stream for pg_basebackup Previously we were using the "fetch" method which would fail if we did not set wal_keep_segments to a high enough value. We were never setting a value for wal_keep_segments so this would (infrequently) fail. https://bugzilla.redhat.com/show_bug.cgi?id=1616022 lib/gems/pending/util/postgres_admin.rb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Ran into a problem with this fix: https://github.com/ManageIQ/manageiq-gems-pending/issues/366 Effectively what happens now is we can't use this method (--xlog-method=stream) unless we switch to `--format=plain` and remove compression, and backups are completely broken using the rake tasks. * * * We basically have two options moving forward: - Switch from `--format=tar` to `--format=plain`, which uses about 10x more disk space per backup, and doesn't allow us to stream the output anywhere. This would require a much larger rewrite of a lot of things surrounding our rake evm:db:backup* tasks since we now are not working with generated tars, but full directory structures. - Revert the change made, and configure `wal_keep_segments` in the postgres config to something reasonable (as a fix for this issue), or create documentation around making sure to set this in order to use the backup functionality. Switching this back to `ON_DEV` until we get this resolved. -Nick
I have a proposal which will ensure that this won't occur while also continuing to use the "fetch" xlog-method (and by extension the tar output format), but it could cause the database to run out of space in the data directory if the backup is slow enough which will cause a crash. The pitch is to create a replication slot before we start the backup then remove it after we finish. We won't ever expect anyone to consume from the slot, but it's a (relatively) clean way to ensure that we retain transaction log until the backup is finished. In the worst case scenario the database is low on free space and the backup is being done at a high-traffic time so a ton of WAL log gets generated while the backup is in progress and is kept on disk. Running out of disk space will crash the PG server and is quite hard to recover from. Before I try implementing this patch I'm curious what Dennis and maybe Brad think about this. If we don't go the replication slot route then I propose closing this as WONTFIX and suggesting to customers that hit this problem to find a value for wal_keep_segments which will allow them to complete their backups given their traffic and space constraints.
Dennis, Based on https://bugzilla.redhat.com/show_bug.cgi?id=1616022#c10, can we change target release to 5.10 please? - Sudhir
Target release changed
As a workaround you will have to choose a value for the `wal_keep_segments` configuration parameter that will allow the backup to finish. You really want it to be as small as possible, but I would start with something like (<pg-volume-free-space>/4)/16MB and see if that allows the backup to finish. Keep in mind that after setting this value, the transaction log will *always* take up at least a quarter of the current free space. So if that value works, it might make sense to decrease it a bit more to reclaim some space for the database. You can set the value from psql using `ALTER SYSTEM SET wal_keep_segments = <value>;` and use `SELECT pg_reload_conf();` to make it take effect.
Also still waiting on a verdict on the proposal in comment 12
Going with the documentation option, needed information for documenting is in comment #16.
New commit detected on ManageIQ/manageiq-appliance_console/master: https://github.com/ManageIQ/manageiq-appliance_console/commit/21de264f929abc0519640ad49b22c7a836e36adc commit 21de264f929abc0519640ad49b22c7a836e36adc Author: Nick Carboni <ncarboni> AuthorDate: Wed Aug 15 13:49:53 2018 +0000 Commit: Nick LaMuro <nicklamuro> CommitDate: Tue Mar 16 19:25:16 2021 +0000 Use --xlog-method=stream for pg_basebackup Previously we were using the "fetch" method which would fail if we did not set wal_keep_segments to a high enough value. We were never setting a value for wal_keep_segments so this would (infrequently) fail. https://bugzilla.redhat.com/show_bug.cgi?id=1616022 (transferred from ManageIQ/manageiq-gems-pending@022d00154fbba2bbb6349c56dcb762ccac9a81fb) lib/manageiq/appliance_console/postgres_admin.rb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)