Bug 1284686

Summary: [RFE] Support use of snapshots in katello-backup to allow service to be restored quickly
Product: Red Hat Satellite Reporter: Stuart Auchterlonie <sauchter>
Component: Backup & RestoreAssignee: Christine Fouant <cfouant>
Status: CLOSED ERRATA QA Contact: Peter Ondrejka <pondrejk>
Severity: high Docs Contact: Michaela Slaninkova <mslanink>
Priority: high    
Version: 6.1.3CC: awestbro, bbuckingham, bchardim, bkearney, cfouant, egolov, ehelms, fgarciad, mmccune, molasaga, mslanink, oshtaier, riehecky, sthirugn, xdmoon
Target Milestone: UnspecifiedKeywords: FutureFeature, Performance, Triaged, UserExperience
Target Release: Unused   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tfm-rubygem-katello-3.4.2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-02-21 12:32:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1317008, 1479962    

Description Stuart Auchterlonie 2015-11-23 22:12:45 UTC
Description of problem:

katello-backup is horrendously slow, and while the backup is being taken
the satellite system is offline (since the services have been shutdown).

My customer is reporting that their backups are now taking over 11hrs

In order to minimize the time taken for the we should support
taking LVM snapshots of the various volumes for postgresql, pulp,
mongodb (and any other required data).

The technique is as follows. Use SSM to manage the snapshots.
- Shutdown services
- Create snapshots and mount them
- Restart services
- Backup snapshots (without -v and -z, see bz#1283578)
- Unmount and remove snapshots


Version-Release number of selected component (if applicable):

6.1.3


Actual results:

Without using this procedure, backup speed remains unacceptable

Expected results:

Restoration of the satellite services is much faster.
There is no need to wait for the backup to complete.


Additional info:

Comment 4 Bryan Kearney 2016-07-08 20:21:02 UTC
Per 6.3 planning, moving out non acked bugs to the backlog

Comment 6 Christine Fouant 2017-01-31 21:11:02 UTC
Created redmine issue http://projects.theforeman.org/issues/18329 from this bug

Comment 7 Christine Fouant 2017-01-31 22:31:58 UTC
*** Bug 1382002 has been marked as a duplicate of this bug. ***

Comment 8 Christine Fouant 2017-02-03 20:05:48 UTC
*** Bug 1354337 has been marked as a duplicate of this bug. ***

Comment 9 Bryan Kearney 2017-05-09 18:01:46 UTC
This did not make the 1.15/3.5 cut. I am pushing this out to sat-backlog.

Comment 10 Satellite Program 2017-06-13 16:02:23 UTC
Moving this bug to POST for triage into Satellite 6 since the upstream issue http://projects.theforeman.org/issues/18329 has been resolved.

Comment 13 Peter Ondrejka 2017-10-03 12:48:58 UTC
On satellite-6.3.0-19.0.beta.el7sat.noarch this fails due to https://bugzilla.redhat.com/show_bug.cgi?id=1497957

Christine, cold you please give me some background of the feature, as I didn't find much information on how it's meant to be used (and tested), namely:

-- Are there any prerequisites to the procedure, and does the script check if it has what it needs?
-- Is it supposed to stop services?
-- Can it be combined with other katello-backup subcommands?
-- How to restore from backup created by this feature?

Cheers

Comment 14 Peter Ondrejka 2017-10-04 09:23:04 UTC
After working around the issue from 1497957:
~]# katello-backup --snapshot /var/tmp
Starting backup: 2017-10-04 04:44:07 -0400
Creating backup folder /var/tmp/katello-backup-20171004044409
Generating metadata ... 
Cannot create a temporary file: /var/tmp/scl3HIanE
Done.
Backing up config files... 
Done.
WARNING: This script will stop your services. Do you want to proceed(y/n)? y
Redirecting to /bin/systemctl stop foreman-tasks.service
Redirecting to /bin/systemctl stop httpd.service
Redirecting to /bin/systemctl stop pulp_celerybeat.service
Redirecting to /bin/systemctl stop pulp_streamer.service
Redirecting to /bin/systemctl stop pulp_resource_manager.service
Redirecting to /bin/systemctl stop pulp_workers.service
Redirecting to /bin/systemctl stop tomcat.service
Redirecting to /bin/systemctl stop postgresql.service
Redirecting to /bin/systemctl stop mongod.service
Creating pulp snapshot
  Volume group "rhel_sgi-uv20-01" has insufficient free space (0 extents): 512 required.
Failed 'lvcreate -npulp-snap -L2G -s /dev/mapper/rhel_sgi--uv20--01-root' with exit code 5
Cleaning up backup folder and starting any stopped services... 
/usr/share/ruby/fileutils.rb:125: warning: conflicting chdir during another chdir block
Redirecting to /bin/systemctl start mongod.service
Redirecting to /bin/systemctl start postgresql.service
Redirecting to /bin/systemctl start tomcat.service
Redirecting to /bin/systemctl start pulp_workers.service
Redirecting to /bin/systemctl start pulp_resource_manager.service
Redirecting to /bin/systemctl start pulp_streamer.service
Redirecting to /bin/systemctl start pulp_celerybeat.service
Redirecting to /bin/systemctl start httpd.service

Not sure why I get "Cannot create a temporary file: /var/tmp/scl3HIanE" and why it is needed. Obviously there is a prerequisite of having some free extents in a vg, but that should be documented at least in the script.

I'd like to be able to set the lv name for cases like:

Creating pulp snapshot
  Logical Volume "pulp-snap" already exists in volume group "rhel_sgi-uv20-01"
Failed 'lvcreate -npulp-snap -L2G -s /dev/mapper/rhel_sgi--uv20--01-root' with exit code 5
Cleaning up backup folder and starting any stopped services...

Seems like lvs are not cleaned up after failure

Comment 15 Brad Buckingham 2017-10-13 14:53:29 UTC
Moving to POST since upstream redmine is merged.

Comment 16 Peter Ondrejka 2017-10-24 13:45:09 UTC
Hi Christine, could you please take a look at my questions from comment #13? Also I wonder what is the expected behavior of this feature on Capsule and what are the prerequisites to successful usage. 

I created a documentation bug for this feature in https://bugzilla.redhat.com/show_bug.cgi?id=1505890

Comment 17 Christine Fouant 2017-10-27 14:24:27 UTC
(In reply to Peter Ondrejka from comment #13)
> On satellite-6.3.0-19.0.beta.el7sat.noarch this fails due to
> https://bugzilla.redhat.com/show_bug.cgi?id=1497957
> 
> Christine, cold you please give me some background of the feature, as I
> didn't find much information on how it's meant to be used (and tested),
> namely:
> 
> -- Are there any prerequisites to the procedure, and does the script check
> if it has what it needs?
Prerequisites are that the filesystem must be an LVM filesystem, and there must be enough space in the volume group in which to create the snapshot. It will fail the backup if these are not in place, giving a message with the exit codes.

> -- Is it supposed to stop services?
It must stop services, but it will only be momentarily. There is no way to do snapshots with online backup. The services will be down for the amount of time it takes to create the snapshot.

> -- Can it be combined with other katello-backup subcommands? 
Should be fine to do, except with online-backup.

> -- How to restore from backup created by this feature?
Same way you restore any other backup, #katello-restore /path/to/backup/folder

> 
> Cheers

@Evgeni - could you tell us if there are any more prerequisites to snapshots that I'm missing here?

Comment 18 Evgeni Golov 2017-11-02 10:16:54 UTC
If you would wake me at 2 am in the morning and ask me for requirements for working snapshots, you'd get the following:

* the system uses LVM for (at least) /var/lib/pulp, /var/lib/mongodb, /var/lib/pgsql
* the above mentioned points are preferably (but not necessarily) on different LVs
* there is sufficient free space (3×snapshot_size, 2G by default, = 6G) in the relevant VGs
  * if all three points are on VG1, it has to have 3×snapshot_size (2G by default, = 6G) free
  * if spread differently, the free space have to match too ;)
* the backup target is preferably not on a snapshoted LV, as this would mean you have to fit the whole backup into the snapshot, raising the snapshot space requirement by orders of magnitude

Comment 24 Christine Fouant 2018-01-29 16:55:51 UTC
Need to add bypass of logical volume validation

Comment 25 Peter Ondrejka 2018-02-05 15:08:35 UTC
Hello, checked again on snap 35,

-- snapshot backup performs as expected, the above error was due to lack of space on backup destination LV
-- snapshots seem to play well with other options (--incremental, --skip-pulp-content)
-- services are started after creating a snapshot as expected
-- restored from snapshot backups successfully 
-- checked both on server and capsule

Comment 28 errata-xmlrpc 2018-02-21 12:32:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:0336