799020 – Document a disaster recovery plan (or process)

Bug 799020 - Document a disaster recovery plan (or process)

Summary: Document a disaster recovery plan (or process)

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	CloudForms Cloud Engine
Classification:	Retired
Component:	Documentation
Sub Component:
Version:	1.0.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	low
Target Milestone:	beta6
Assignee:	Dan Macpherson
QA Contact:	Giulio Fidente
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	865782 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-03-01 15:26 UTC by James Laska
Modified:	2013-09-02 07:01 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Clone Of:	787184
Environment:
Last Closed:	2012-12-10 22:05:38 UTC
Embargoed:

Attachments	(Terms of Use)

Comment 1 James Laska 2012-03-01 15:27:15 UTC

This bug is intended to capture documenting a process (or procedure) for disaster recovery for Cloud Engine.  Bug#787184 tracks this change on the System Engine side.

Comment 3 Hugh Brock 2012-03-08 16:20:46 UTC

Mike will nominate some folks from his team to get a proper backup/restore procedure documented for Cloud Engine

Comment 4 Steve Linabery 2012-03-21 16:10:25 UTC

first revision of backup/recovery at https://www.aeolusproject.org/redmine/documents/86

Comment 5 wes hayutin 2012-03-30 18:57:48 UTC

assigning to rehana

Comment 6 wes hayutin 2012-04-03 13:56:57 UTC

assigning to Shveta and Aziza

Comment 7 James Laska 2012-04-03 14:09:19 UTC

Adding lbrindle to the cc list, requires_release_note has been requested, and a proposed procedure has been linked in comment#4.

Comment 8 James Laska 2012-04-03 14:09:19 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Docs team, please see comment#4 for an upstream draft of a backup+restore procedure.

Comment 9 Rehana 2012-04-05 14:30:25 UTC

Found an observation, while following the  restore steps,
Recovery:
1. Reinstall Cloud Engine and components
2. Run aeolus-configure
3. Stop all Cloud Engine services
4. Restore postgresql and mongodb databases
5. Restore configuration files
6. Restart Cloud Engine services

At step 2, after i do aeolus-configure , it automatically create the 'conductor' db in postgresql . so at 4 when i try to restore my postgresql backup file as described below, i get an error

createdb -T template0 conductor
createdb: database creation failed: ERROR: database "conductor" already exists

URL:http://www.postgresql.org/docs/9.1/static/backup-dump.html


So it will be good if you can mention in the steps, asking user to restore over the db created by aeolus-configure..

Additional info:

rpm -qa | grep aeolus
aeolus-conductor-daemons-0.8.7-1.el6.noarch
rubygem-aeolus-image-0.3.0-12.el6.noarch
aeolus-conductor-0.8.7-1.el6.noarch
rubygem-aeolus-cli-0.3.1-1.el6.noarch
aeolus-conductor-doc-0.8.7-1.el6.noarch
aeolus-all-0.8.7-1.el6.noarch
aeolus-configure-2.5.2-1.el6.noarch

Comment 10 wes hayutin 2012-04-05 15:28:24 UTC

Please add the following steps to the recovery process

Recovery:
1. Reinstall Cloud Engine and components
2. Run aeolus-configure
3. Stop all Cloud Engine services
4. Restore postgresql and mongodb databases
4.1 drop database conductor;
4.2 drop ROLE aeolus;
4.2 CREATE USER aeolus WITH PASSWORD 'your-password' CREATEDB;(note, change passwd)
4.3 create database conductor;
4.4 GRANT ALL PRIVILEGES ON DATABASE conductor to aeolus;
4.5 exit postgres cmdline
4.6 execute psql conductor < /$PATH/$BACKUP

Comment 11 Steve Linabery 2012-04-05 15:34:23 UTC

Minor edit to wes' comment follows:

Please add the following steps to the recovery process

Recovery:
1. Reinstall Cloud Engine and components
2. Run aeolus-configure
3. Stop all Cloud Engine services
4. Restore postgresql and mongodb databases
4.1 enter the PostgreSQL interactive terminal, psql, and issue the commands:
4.1 drop database conductor;
4.2 drop ROLE aeolus;
4.2 CREATE USER aeolus WITH PASSWORD 'your-password' CREATEDB;(note, change
passwd)
4.3 create database conductor;
4.4 GRANT ALL PRIVILEGES ON DATABASE conductor to aeolus;
4.5 exit psql
4.6 execute 'psql conductor < /$PATH/$BACKUP'

Comment 18 James Laska 2012-10-10 18:55:07 UTC

Giulio, good find!  Thanks for raising the alert in this bug.  Let's keep this bug for tracking completion of a documentation request.  

For the issue you found, I suspect code changes may be needed.  Would you mind filing this as a separate issue?  If it turns out the the documented procedure is incorrect or requires ammendment, let's make a note in this bug.

Here are two potentially related command-line bugs.  I'm not sure if this captures the issue you're seeing.  If not, let's add a new bug to the pile :)

> https://bugzilla.redhat.com/show_bug.cgi?id=864190
> https://bugzilla.redhat.com/show_bug.cgi?id=864192

Comment 19 Rehana 2012-10-12 08:59:11 UTC

updating the observation

I have retested the back and restore functionality, below are the observation made

1. Backup and restore on the same host

Result : All the informations got restored.
Hostname :https://hp-dl380g6-01.rhts.eng.bos.redhat.com/conductor

2. Restored the information to a new machine

steps:
Install and configure aeolus on a fresh machine
moved the backup files from old machine to the fresh machine 
Performed restore operation

Result: Made the same observation as  Giulio, Saw 403 error when click on image tab and application blueprint page, says "Images missing from the Image Warehouse"

Hostname:https://intel-d3c69-01.rhts.eng.bos.redhat.com/conductor

Comment 20 Rehana 2012-10-12 09:22:29 UTC

So to reply to comment 18,

Reg the bugs:

I m not sure those two bugs covers the observation, because i executed those bug description steps on both the hosts ( mentioned on comment 19) and observed the same results(it listed the targets but didn't display providers and account list ), 

how ever image list on both the machine results are 


1.  result on machine which had backup and restore on same host

aeolus-image list --images
ID                                       Name            Environment     OS         OS Version     Arch       Description                   
------------------------------------     -----------     -----------     ------     ----------     ------     -------------------------     
c093e2e8-0d2b-11e2-919c-00237de23550     rhel62_rhem     default         RHEL-6     2              x86_64     RHEL62 x86_64 rhev audrey     
5de06055-b0f6-4083-a8e2-69b0f9736459     rhel            default                                                                         

2.result on machine which had restore on a different host

aeolus-image list --images

ERROR:  Service Temporarily Unavailable => Please check that Conductor is running.


James,

Can you please confirm do we support restore operation across host or not ?

Comment 21 James Laska 2012-10-12 11:26:28 UTC

(In reply to comment #20)
> Can you please confirm do we support restore operation across host or not ?

The disaster recovery plan is intended to document the procedures for customers to bring their engine back online after a catastrophic failure.  I envision this would involve moving data from a failed system, to a new system.  Therefore, I interpret this to mean that disaster recovery would include backing up data from one system, and restoring it to another.

Any software problems encountered during backup/restore are likely going to be lost if added to this Documentation bug.  Please file issues separate so that we can prioritize them individually.

Comment 22 Steve Linabery 2012-10-12 17:38:38 UTC

(In reply to comment #20)
> 2.result on machine which had restore on a different host
> 
> aeolus-image list --images
> 
> ERROR:  Service Temporarily Unavailable => Please check that Conductor is
> running.
> 

1) What versions of imagefactory are installed on each host?
2) Is imagefactory running on the host where you get the ERROR?

I recently had the same error; it was caused by having a bad build of imagefactory. `service imagefactory start` looked successful, but factory was stopping right away.

Comment 23 Giulio Fidente 2012-10-15 14:07:03 UTC

*** Bug 865782 has been marked as a duplicate of this bug. ***

Comment 24 Giulio Fidente 2012-10-15 14:16:56 UTC

I got the backup/restore to work, restoring on a different host, here are my notes:

1. no need to go trough what suggested in comment #11

2. follow the doc as per http://documentation-devel.engineering.redhat.com/docs/en-US/CloudForms/1.1/html/Cloud_Engine_User_Guide/chap-Maintenance.html#Cloud_Engine_Backup_Procedure1 , except the following notes:

1. no need to change the umask, but /backup must be writable by group postgres

2. we don't want to archive /etc/aeolus-conductor as that only contains symlinks and is restored correctly; we should not dereference the symlinks either as the app reads its config from /usr/share/aeolus-conductor/config ; those are the files we need to archive

3. when extracting, we should use 'xvf' instead of 'xvzf' as the archive is not gzipped

4. I'd --exclude etc/fstab when extracting the archive, as we don't want the new system fstab to be replaced

Comment 25 Giulio Fidente 2012-10-15 14:22:46 UTC

also, while the documentation does not mention the need to copy the images from /var/lib/iwhd onto the new host, that is a _required_ step to get the new system to work

Comment 26 James Laska 2012-10-15 14:23:34 UTC

Moving back to ASSIGNED pending adjustments raised by Giulio in comment#24 and comment#25

Comment 27 Julie 2012-10-17 01:15:17 UTC

Hi Giulio,
   Want to double check with you the commands for documentation update.
1. Remove "# umask 0027" from the guide.
2. Change  
"# tar -cf ce-backup.tar /etc/aeolus-conductor /etc/aeolus-configure /etc/imagefactory /etc/iwhd /etc/fstab"

to 

# tar -cf ce-backup.tar /usr/share/aeolus-conductor/config /etc/imagefactory  /var/lib/iwhd

3. Is your third point in comment24 referring to 12.1.2. Cloud Engine Restore Procedure step 5 # tar ––selinux –xzvf ce-backup.tar –C /

and change it to-->  # tar ––selinux –xvf ce-backup.tar –C /                

4. Anything else I missed?

Many thanks,
Julie

Comment 28 Giulio Fidente 2012-10-17 08:37:11 UTC

1.

remove 'umask 0027' and add, after 'chgrp postgres /backup', this 'chmod g+w /backup'

2.

/etc/fstab remains, we add more paths:

tar -cf ce-backup.tar /etc/aeolus-conductor /etc/aeolus-configure /etc/imagefactory /etc/iwhd /etc/fstab /usr/share/aeolus-conductor/config/{database.yml,development.rb,oauth.json,production.rb,settings.yml,test.rb}

as we want to keep a backup copy of the /etc/fstab but, when restoring, we don't want to overwrite the /etc/fstab , see next bullet

3.

change 'tar --selinux -xzvf ce-backup.tar -C /' in 'tar --selinux -xvf ce-backup.tar -C / --exclude etc/fstab'

at the same time, we should add a notice telling the user that, if he uses RHEV, he also nees to add into his /etc/fstab the line needed to mount the remote NFS export domain and that there is a copy of its old /etc/fstab backed up in the tar archive

note that we already have a notice in the early steps telling the user that /etc/fstab needs to be backed up if using RHEV

4.

we need to add a whole new bullet in both the backup and restore procedures telling the user to archive/restore the /var/lib/iwhd directory too, eg:

tar -cf ce-images-backup.tar /var/lib/iwhd

and

tar --selinux -xvf ce-images-backup.tar -C /

Comment 32 Giulio Fidente 2012-10-25 09:15:16 UTC

the command in step 4. of the backup procedure should be changed again into:

tar -cf ce-backup.tar /etc/aeolus-conductor /etc/aeolus-configure /etc/imagefactory /etc/iwhd /etc/fstab /usr/share/aeolus-conductor/config/{database.yml,environments/development.rb,oauth.json,environments/production.rb,settings.yml,environments/test.rb}

also this is a one line command unless the paths are separated by \:

tar -cf ce-backup.tar /etc/aeolus-conductor \
/etc/aeolus-configure \
/etc/imagefactory \
/etc/iwhd /etc/fstab \
/usr/share/aeolus-conductor/config/{database.yml,environments/development.rb,oauth.json,environments/production.rb,settings.yml,environments/test.rb}

Comment 33 Dan Macpherson 2012-10-29 02:58:53 UTC

Modified step 4 to include the new tar directories:

http://documentation-devel.engineering.redhat.com/docs/en-US/CloudForms/1.1/html/Cloud_Engine_User_Guide/chap-Maintenance.html#sect-Disaster_Recovery

Have kept the command as one line.

Comment 34 Lana Brindley 2012-11-19 02:44:32 UTC

This documentation has now been dropped to translation ahead of publication. For any further issues, please open a new a bug.

LKB

Comment 35 Lana Brindley 2012-12-10 22:05:38 UTC

This document is now publicly available on access.redhat.com. For any further issues, please raise a new bug.

LKB

Note You need to log in before you can comment on or make changes to this bug.