This bug is intended to capture documenting a process (or procedure) for disaster recovery for Cloud Engine. Bug#787184 tracks this change on the System Engine side.
Mike will nominate some folks from his team to get a proper backup/restore procedure documented for Cloud Engine
first revision of backup/recovery at https://www.aeolusproject.org/redmine/documents/86
assigning to rehana
assigning to Shveta and Aziza
Adding lbrindle to the cc list, requires_release_note has been requested, and a proposed procedure has been linked in comment#4.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Docs team, please see comment#4 for an upstream draft of a backup+restore procedure.
Found an observation, while following the restore steps, Recovery: 1. Reinstall Cloud Engine and components 2. Run aeolus-configure 3. Stop all Cloud Engine services 4. Restore postgresql and mongodb databases 5. Restore configuration files 6. Restart Cloud Engine services At step 2, after i do aeolus-configure , it automatically create the 'conductor' db in postgresql . so at 4 when i try to restore my postgresql backup file as described below, i get an error createdb -T template0 conductor createdb: database creation failed: ERROR: database "conductor" already exists URL:http://www.postgresql.org/docs/9.1/static/backup-dump.html So it will be good if you can mention in the steps, asking user to restore over the db created by aeolus-configure.. Additional info: rpm -qa | grep aeolus aeolus-conductor-daemons-0.8.7-1.el6.noarch rubygem-aeolus-image-0.3.0-12.el6.noarch aeolus-conductor-0.8.7-1.el6.noarch rubygem-aeolus-cli-0.3.1-1.el6.noarch aeolus-conductor-doc-0.8.7-1.el6.noarch aeolus-all-0.8.7-1.el6.noarch aeolus-configure-2.5.2-1.el6.noarch
Please add the following steps to the recovery process Recovery: 1. Reinstall Cloud Engine and components 2. Run aeolus-configure 3. Stop all Cloud Engine services 4. Restore postgresql and mongodb databases 4.1 drop database conductor; 4.2 drop ROLE aeolus; 4.2 CREATE USER aeolus WITH PASSWORD 'your-password' CREATEDB;(note, change passwd) 4.3 create database conductor; 4.4 GRANT ALL PRIVILEGES ON DATABASE conductor to aeolus; 4.5 exit postgres cmdline 4.6 execute psql conductor < /$PATH/$BACKUP
Minor edit to wes' comment follows: Please add the following steps to the recovery process Recovery: 1. Reinstall Cloud Engine and components 2. Run aeolus-configure 3. Stop all Cloud Engine services 4. Restore postgresql and mongodb databases 4.1 enter the PostgreSQL interactive terminal, psql, and issue the commands: 4.1 drop database conductor; 4.2 drop ROLE aeolus; 4.2 CREATE USER aeolus WITH PASSWORD 'your-password' CREATEDB;(note, change passwd) 4.3 create database conductor; 4.4 GRANT ALL PRIVILEGES ON DATABASE conductor to aeolus; 4.5 exit psql 4.6 execute 'psql conductor < /$PATH/$BACKUP'
Giulio, good find! Thanks for raising the alert in this bug. Let's keep this bug for tracking completion of a documentation request. For the issue you found, I suspect code changes may be needed. Would you mind filing this as a separate issue? If it turns out the the documented procedure is incorrect or requires ammendment, let's make a note in this bug. Here are two potentially related command-line bugs. I'm not sure if this captures the issue you're seeing. If not, let's add a new bug to the pile :) > https://bugzilla.redhat.com/show_bug.cgi?id=864190 > https://bugzilla.redhat.com/show_bug.cgi?id=864192
updating the observation I have retested the back and restore functionality, below are the observation made 1. Backup and restore on the same host Result : All the informations got restored. Hostname :https://hp-dl380g6-01.rhts.eng.bos.redhat.com/conductor 2. Restored the information to a new machine steps: Install and configure aeolus on a fresh machine moved the backup files from old machine to the fresh machine Performed restore operation Result: Made the same observation as Giulio, Saw 403 error when click on image tab and application blueprint page, says "Images missing from the Image Warehouse" Hostname:https://intel-d3c69-01.rhts.eng.bos.redhat.com/conductor
So to reply to comment 18, Reg the bugs: I m not sure those two bugs covers the observation, because i executed those bug description steps on both the hosts ( mentioned on comment 19) and observed the same results(it listed the targets but didn't display providers and account list ), how ever image list on both the machine results are 1. result on machine which had backup and restore on same host aeolus-image list --images ID Name Environment OS OS Version Arch Description ------------------------------------ ----------- ----------- ------ ---------- ------ ------------------------- c093e2e8-0d2b-11e2-919c-00237de23550 rhel62_rhem default RHEL-6 2 x86_64 RHEL62 x86_64 rhev audrey 5de06055-b0f6-4083-a8e2-69b0f9736459 rhel default 2.result on machine which had restore on a different host aeolus-image list --images ERROR: Service Temporarily Unavailable => Please check that Conductor is running. James, Can you please confirm do we support restore operation across host or not ?
(In reply to comment #20) > Can you please confirm do we support restore operation across host or not ? The disaster recovery plan is intended to document the procedures for customers to bring their engine back online after a catastrophic failure. I envision this would involve moving data from a failed system, to a new system. Therefore, I interpret this to mean that disaster recovery would include backing up data from one system, and restoring it to another. Any software problems encountered during backup/restore are likely going to be lost if added to this Documentation bug. Please file issues separate so that we can prioritize them individually.
(In reply to comment #20) > 2.result on machine which had restore on a different host > > aeolus-image list --images > > ERROR: Service Temporarily Unavailable => Please check that Conductor is > running. > 1) What versions of imagefactory are installed on each host? 2) Is imagefactory running on the host where you get the ERROR? I recently had the same error; it was caused by having a bad build of imagefactory. `service imagefactory start` looked successful, but factory was stopping right away.
*** Bug 865782 has been marked as a duplicate of this bug. ***
I got the backup/restore to work, restoring on a different host, here are my notes: 1. no need to go trough what suggested in comment #11 2. follow the doc as per http://documentation-devel.engineering.redhat.com/docs/en-US/CloudForms/1.1/html/Cloud_Engine_User_Guide/chap-Maintenance.html#Cloud_Engine_Backup_Procedure1 , except the following notes: 1. no need to change the umask, but /backup must be writable by group postgres 2. we don't want to archive /etc/aeolus-conductor as that only contains symlinks and is restored correctly; we should not dereference the symlinks either as the app reads its config from /usr/share/aeolus-conductor/config ; those are the files we need to archive 3. when extracting, we should use 'xvf' instead of 'xvzf' as the archive is not gzipped 4. I'd --exclude etc/fstab when extracting the archive, as we don't want the new system fstab to be replaced
also, while the documentation does not mention the need to copy the images from /var/lib/iwhd onto the new host, that is a _required_ step to get the new system to work
Moving back to ASSIGNED pending adjustments raised by Giulio in comment#24 and comment#25
Hi Giulio, Want to double check with you the commands for documentation update. 1. Remove "# umask 0027" from the guide. 2. Change "# tar -cf ce-backup.tar /etc/aeolus-conductor /etc/aeolus-configure /etc/imagefactory /etc/iwhd /etc/fstab" to # tar -cf ce-backup.tar /usr/share/aeolus-conductor/config /etc/imagefactory /var/lib/iwhd 3. Is your third point in comment24 referring to 12.1.2. Cloud Engine Restore Procedure step 5 # tar ––selinux –xzvf ce-backup.tar –C / and change it to--> # tar ––selinux –xvf ce-backup.tar –C / 4. Anything else I missed? Many thanks, Julie
1. remove 'umask 0027' and add, after 'chgrp postgres /backup', this 'chmod g+w /backup' 2. /etc/fstab remains, we add more paths: tar -cf ce-backup.tar /etc/aeolus-conductor /etc/aeolus-configure /etc/imagefactory /etc/iwhd /etc/fstab /usr/share/aeolus-conductor/config/{database.yml,development.rb,oauth.json,production.rb,settings.yml,test.rb} as we want to keep a backup copy of the /etc/fstab but, when restoring, we don't want to overwrite the /etc/fstab , see next bullet 3. change 'tar --selinux -xzvf ce-backup.tar -C /' in 'tar --selinux -xvf ce-backup.tar -C / --exclude etc/fstab' at the same time, we should add a notice telling the user that, if he uses RHEV, he also nees to add into his /etc/fstab the line needed to mount the remote NFS export domain and that there is a copy of its old /etc/fstab backed up in the tar archive note that we already have a notice in the early steps telling the user that /etc/fstab needs to be backed up if using RHEV 4. we need to add a whole new bullet in both the backup and restore procedures telling the user to archive/restore the /var/lib/iwhd directory too, eg: tar -cf ce-images-backup.tar /var/lib/iwhd and tar --selinux -xvf ce-images-backup.tar -C /
the command in step 4. of the backup procedure should be changed again into: tar -cf ce-backup.tar /etc/aeolus-conductor /etc/aeolus-configure /etc/imagefactory /etc/iwhd /etc/fstab /usr/share/aeolus-conductor/config/{database.yml,environments/development.rb,oauth.json,environments/production.rb,settings.yml,environments/test.rb} also this is a one line command unless the paths are separated by \: tar -cf ce-backup.tar /etc/aeolus-conductor \ /etc/aeolus-configure \ /etc/imagefactory \ /etc/iwhd /etc/fstab \ /usr/share/aeolus-conductor/config/{database.yml,environments/development.rb,oauth.json,environments/production.rb,settings.yml,environments/test.rb}
Modified step 4 to include the new tar directories: http://documentation-devel.engineering.redhat.com/docs/en-US/CloudForms/1.1/html/Cloud_Engine_User_Guide/chap-Maintenance.html#sect-Disaster_Recovery Have kept the command as one line.
This documentation has now been dropped to translation ahead of publication. For any further issues, please open a new a bug. LKB
This document is now publicly available on access.redhat.com. For any further issues, please raise a new bug. LKB