Bugzilla will be upgraded to version 5.0 on a still to be determined date in the near future. The original upgrade date has been delayed.
Bug 1557794 - [Docs][Back Up and Restore] Update Back Up and Restore the Director Undercloud for RHOSP 13 (fails in a RHOSP 13 env when following the procedure and steps as it was documented for RHOSP 12) [NEEDINFO]
[Docs][Back Up and Restore] Update Back Up and Restore the Director Underclou...
Status: CLOSED CURRENTRELEASE
Product: Red Hat OpenStack
Classification: Red Hat
Component: documentation (Show other bugs)
13.0 (Queens)
x86_64 Linux
high Severity high
: z2
: 13.0 (Queens)
Assigned To: Dan Macpherson
Julie
docs-accepted
: Documentation, Reopened, Triaged, ZStream
Depends On: 1594279
Blocks: 1597920
  Show dependency treegraph
 
Reported: 2018-03-18 13:32 EDT by Omri Hochman
Modified: 2018-09-06 17:34 EDT (History)
23 users (show)

See Also:
Fixed In Version:
Doc Type: Known Issue
Doc Text:
A regression was identified in the procedure for backing up and restoring the director undercloud. As a result, the procedure requires modification and verification before it can be published. The book 'Back Up and Restore the Director Undercloud' is therefore not available with the general availability of Red Hat OpenStack Platform 13. The procedure will be updated as a priority after the general availability release, and published as soon as it is verified.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-08-06 02:45:34 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
dmacpher: needinfo? (knoha)


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
OpenStack gerrit 558429 None master: MERGED tripleo-docs: [Backup/Restore] Fix nits in docs and clarification note. (I1bfcf5566511284cf78422aada29f45b2628bf4b) 2018-07-18 21:36 EDT
OpenStack gerrit 570554 None master: MERGED tripleo-docs: Fix docs for Undercloud backup and restore (I398d1a33109ac88dcb63582517754893735a1b05) 2018-07-18 21:35 EDT
OpenStack gerrit 572800 None master: MERGED tripleo-docs: Add keystone credential-keys and fernet-keys clarification when restoring the Undercloud (Ib9274f28e916375... 2018-07-18 21:35 EDT

  None (edit)
Description Omri Hochman 2018-03-18 13:32:40 EDT
[OSP13] Undercloud backup and restore fails when following the procedure and steps as it was documented for OSP12  ( fails on OSP13 env ). 

Environment:
------------
OSP13 Puddle 2018-03-16.1

Description : 
---------------
Undercloud backup and restore fails when following the procedure and steps as it was documented for OSP12  ( it fails on OSP13 env ). 


Results : 
---------
(1) missing /etc/my.cnf.d/server.cnf (on osp13 env )
(2) when attempted to use mariadb-server.cnf (instead of server.cnf) the command " cat /root/undercloud-all-databases.sql | mysql"  failed with ERROR

Docs link:
-----------
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/12/html/back_up_and_restore_the_director_undercloud/restore

 

list of files exist under /etc/my.cnf.d/ 
-----------------------------------------
[root@undercloud75 ~]# cd /etc/my.cnf.d/
[root@undercloud75 my.cnf.d]# ls
auth_gssapi.cnf  client.cnf  enable_encryption.preset  galera.cnf  mariadb-server.cnf  mysql-clients.cnf  tokudb.cnf
[root@undercloud75 my.cnf.d]# ls -ls
total 28
4 -rw-r--r--. 1 root root   41 Feb  1 12:42 auth_gssapi.cnf
4 -rw-r--r--. 1 root root  295 Dec 14  2016 client.cnf
4 -rw-r--r--. 1 root root  763 Dec 14  2016 enable_encryption.preset
4 -rw-r--r--. 1 root root  999 Feb 20 15:01 galera.cnf
4 -rw-r--r--. 1 root root 1462 Feb  1 12:42 mariadb-server.cnf
4 -rw-r--r--. 1 root root  232 Dec 14  2016 mysql-clients.cnf
4 -rw-r--r--. 1 root root  285 Dec 14  2016 tokudb.cnf



When attempting to use mariadb-server.cnf instead of server.cnf :
("cat /root/undercloud-all-databases.sql | mysql" ->  FAILED)) 
-----------------------------------------------------------------
[root@undercloud75 ~]#   tar -xzC / -f undercloud-backup-*.tar.gz etc/my.cnf.d/mariadb-server.cnf
[root@undercloud75 ~]#   tar -xzC / -f undercloud-backup-*.tar.gz root/undercloud-all-databases.sql
tar: root/undercloud-all-databases.sql: time stamp 2018-02-20 19:31:31.598594491 is 15509.047497842 s in the future
[root@undercloud75 ~]#   systemctl start mariadb
[root@undercloud75 ~]#   mysql -uroot -e"set global max_allowed_packet = 16777216;"
[root@undercloud75 ~]#   cat /root/undercloud-all-databases.sql | mysql
ERROR 1911 (HY000) at line 3992: Unknown option 'STATS_PERSISTENT'
Comment 2 Carlos Camacho 2018-03-22 04:57:29 EDT
Hey Omri,

Can you clarify the steps executed from the docs?
Im trying to follow up your steps in my environment but im not getting them correctly.
Comment 3 Carlos Camacho 2018-03-26 05:06:59 EDT
Hello Omri,

We are working on a new refactor for the backup/restore procedures, this is the upstrean review https://review.openstack.org/#/c/544975/ do you mind to check it?


Cheers,
Carlos.
Comment 4 Omri Hochman 2018-03-28 03:48:52 EDT
(In reply to Carlos Camacho from comment #2)
> Hey Omri,
> 
> Can you clarify the steps executed from the docs?
> Im trying to follow up your steps in my environment but im not getting them
> correctly.

Hi Carlos,   I was following the official RH documentation for OSP12, the section that explains the backup/restore undercloud ( link on Bz body) . 


 (In reply to Carlos Camacho from comment #3)
> Hello Omri,
> 
> We are working on a new refactor for the backup/restore procedures, this is
> the upstrean review https://review.openstack.org/#/c/544975/ do you mind to
> check it?
> 

That's great- I'll try to follow those steps and report back here.  


> 
> Cheers,
> Carlos.
Comment 6 Omri Hochman 2018-03-29 04:20:11 EDT
(In reply to Carlos Camacho from comment #3)
> Hello Omri,
> 
> We are working on a new refactor for the backup/restore procedures, this is
> the upstrean review https://review.openstack.org/#/c/544975/ do you mind to
> check it?
> 
> 
> Cheers,
> Carlos.

By attempting to run the manual commands suggested in the documentation path patch https://review.openstack.org/#/c/544975/, 
it seems that again, there is a request to tar /etc/my.cnf.d/server.cnf 
which is a file that no longer exist in osp13 environment (as reported in Bz body).
Comment 7 Carlos Camacho 2018-04-03 04:50:35 EDT
Hello Omri,

You are right there, depending on the MySQL version we can have a different configuration file name.

I pushed a fix for the latest docs in https://review.openstack.org/#/c/558429
Comment 8 Omri Hochman 2018-04-03 22:35:55 EDT
(In reply to Carlos Camacho from comment #7)
> Hello Omri,
> 
> You are right there, depending on the MySQL version we can have a different
> configuration file name.
> 
> I pushed a fix for the latest docs in https://review.openstack.org/#/c/558429


Trying to follow the steps on osp13 env :
-------------------------------------------
[root@undercloud75 ~]# tar --ignore-failed-read -czf \
>       undercloud-backup-`date +%F`.tar.gz \
>       /root/undercloud-all-databases.sql \
>       /etc/my.cnf.d \
>       /var/lib/glance/images \
>       /srv/node \
>       /home/stack \
>       /etc/pki \
>       /opt/stack
tar: Removing leading `/' from member names
tar: /root/undercloud-all-databases.sql: Warning: Cannot stat: No such file or directory

seems like we're missign undercloud-all-databases.sql 

[root@undercloud75 ~]# find / -name undercloud-all-databases.sql
[root@undercloud75 ~]# 
[stack@undercloud75 ~]$ sudo find / -name  *.sql
/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/migrate_repo/versions/246_sqlite_upgrade.sql
/usr/lib/python2.7/site-packages/glance/db/sqlalchemy/migrate_repo/versions/003_sqlite_upgrade.sql
/usr/lib/python2.7/site-packages/glance/db/sqlalchemy/migrate_repo/versions/006_mysql_upgrade.sql
/usr/lib/python2.7/site-packages/glance/db/sqlalchemy/migrate_repo/versions/006_sqlite_upgrade.sql
/usr/lib/python2.7/site-packages/glance/db/sqlalchemy/migrate_repo/versions/045_sqlite_upgrade.sql
/usr/lib/python2.7/site-packages/glance/db/sqlalchemy/migrate_repo/versions/011_sqlite_upgrade.sql
/usr/lib/python2.7/site-packages/glance/db/sqlalchemy/migrate_repo/versions/037_sqlite_upgrade.sql
/usr/share/mariadb/mroonga/install.sql
/usr/share/mariadb/mroonga/uninstall.sql
/usr/share/mariadb/fill_help_tables.sql
/usr/share/mariadb/install_spider.sql
/usr/share/mariadb/maria_add_gis_sp.sql
/usr/share/mariadb/maria_add_gis_sp_bootstrap.sql
/usr/share/mariadb/mysql_performance_tables.sql
/usr/share/mariadb/mysql_system_tables.sql
/usr/share/mariadb/mysql_system_tables_data.sql
/usr/share/mariadb/mysql_test_data_timezone.sql
/usr/share/mariadb/mysql_to_mariadb.sql
/usr/share/puppet/ext/dbfix.sql
/usr/share/openstack-puppet/modules/veritas_hyperscale/files/scripts/db/01_HyperScale.sql
/usr/share/openstack-puppet/modules/veritas_hyperscale/files/scripts/db/02_HyperScaleStatsSchema.sql
/usr/share/openstack-puppet/modules/veritas_hyperscale/files/scripts/db/03_HyperScaleWorkflow.sql
/usr/share/openstack-puppet/modules/veritas_hyperscale/files/scripts/db/51_HyperScaleAlertsDescription.sql
Comment 9 Carlos Camacho 2018-04-04 03:36:50 EDT
Hello,

Omri, did you run the previous step? Is this one:

  mysqldump --opt --single-transaction --all-databases > /root/undercloud-all-databases.sql

If you don't create the DB dump you won't be able to zip it.
Comment 10 Omri Hochman 2018-04-04 10:36:01 EDT
(undercloud) [stack@undercloud75 ~]$ openstack undercloud backup --add-path /etc/hosts \
>                               --add-path /var/log/ \
>                               --add-path /var/lib/glance/images/ \
>                               --add-path /srv/node/ \
>                               --add-path /etc/

Swift API is failing during the backup : 
-----------------------------------------

                          u'version': u'2.0'},
                u'updated_at': u'2018-02-22 15:08:57'},
 u'message': {u'msg': u'Object PUT failed: https://192.168.0.2:13808/v1/AUTH_5b0b3efc458a4fc4af8e21487556aeca/undercloud-backups/UC-backup-20180222101308.tar 413 Request Entity Too Large  [first 60 chars of response] <html><h1>Request Entity Too Large</h1><p>The body of your r'},
 u'status': u'FAILED'}

[root@undercloud75 undercloud-backup-JjgPDr]# ls -lah
total 8.2G
drwx------.  2 mistral mistral   86 Feb 22 10:09 .
drwxrwxrwt. 14 root    root    4.0K Feb 22 10:08 ..
-rw-r--r--.  1 mistral mistral  35M Feb 22 10:09 all-databases-20180222100900.sql.gz
-rw-r--r--.  1 mistral mistral 8.2G Feb 22 10:13 filesystem-20180222100929.tar
Comment 11 Omri Hochman 2018-04-04 13:06:41 EDT
(In reply to Omri Hochman from comment #10)
> (undercloud) [stack@undercloud75 ~]$ openstack undercloud backup --add-path
> /etc/hosts \
> >                               --add-path /var/log/ \
> >                               --add-path /var/lib/glance/images/ \
> >                               --add-path /srv/node/ \
> >                               --add-path /etc/
> 

Opened the following Bz for the openstack undercloud backup : https://bugzilla.redhat.com/show_bug.cgi?id=1563783 


while we'll track the "manual backup steps" on this ticket.
Comment 12 mathieu bultel 2018-04-11 06:57:52 EDT
Hi Omri,
Can you set the pm_ack for this BZ.

Thank you,
Mathieu
Comment 13 Omri Hochman 2018-04-16 10:16:11 EDT
(In reply to mathieu bultel from comment #12)
> Hi Omri,
> Can you set the pm_ack for this BZ.
> 
> Thank you,
> Mathieu

Should be PM_ACK by PM . Adding Jarda.
Comment 14 Jaromir Coufal 2018-04-16 10:26:53 EDT
granted
Comment 15 Omri Hochman 2018-04-16 18:51:14 EDT
Attempted to restore, by running the steps manually and encountered the following issue : 

[root@undercloud75 ~]# cat /root/undercloud-all-databases.sql | mysql
ERROR 2006 (HY000) at line 3153: MySQL server has gone away

 [root@undercloud75 ~]# for i in ceilometer glance heat ironic keystone neutron nova;do mysql -e "drop user $i";done
ERROR 1396 (HY000) at line 1: Operation DROP USER failed for 'ceilometer'@'%'
ERROR 1396 (HY000) at line 1: Operation DROP USER failed for 'glance'@'%'
ERROR 1396 (HY000) at line 1: Operation DROP USER failed for 'heat'@'%'
ERROR 1396 (HY000) at line 1: Operation DROP USER failed for 'ironic'@'%'
ERROR 1396 (HY000) at line 1: Operation DROP USER failed for 'keystone'@'%'
ERROR 1396 (HY000) at line 1: Operation DROP USER failed for 'neutron'@'%'
ERROR 1396 (HY000) at line 1: Operation DROP USER failed for 'nova'@'%


Later on attempted to continue despite the errors, result in
undercloud install failure. 

RuntimeError: os-refresh-config failed. See log for details.
2018-02-20 16:22:05,237 ERROR:
#############################################################################
Undercloud install failed.
Comment 18 Carlos Camacho 2018-05-14 06:07:14 EDT
After applying  (if you want to run this on 13, use the backports)
https://review.openstack.org/#/c/568245/
https://review.openstack.org/#/c/564784/

The following commands should reinstall the UC from the backup (tested just now):


source ~/stackrc
rm -rf /var/tmp/test_uc_backup
mkdir -p /var/tmp/test_uc_backup
cd /var/tmp/test_uc_backup
openstack container delete undercloud-backups --recursive
openstack undercloud backup --exclude-path /home/stack/
openstack container save undercloud-backups
tar -xvf *.tar
gunzip *.gz
cat all-databases-*.sql | sudo mysql
for i in ceilometer glance heat ironic keystone neutron nova;do sudo mysql -e "drop user $i" || true;done
sudo mysql -e 'flush privileges'
openstack undercloud install


In this case, if trying to run the manual workflow, just be sure you run the DB dump correctly.
Comment 20 Carlos Camacho 2018-05-16 07:00:39 EDT
Hello, 

The latest docs version available is here: http://tripleo.org/install/controlplane_backup_restore/00_index.html
Comment 21 Omri Hochman 2018-05-16 13:46:10 EDT
(In reply to Carlos Camacho from comment #20)
> Hello, 
> 
> The latest docs version available is here:
> http://tripleo.org/install/controlplane_backup_restore/00_index.html

Attempted to follow the docs we failed on conflict, probably we would need to add a patch to deal with: 
 /root/.my.cnf was conflicting with the install
Comment 22 Carlos Camacho 2018-05-17 10:17:11 EDT
Ok, I'm working on a few ansible playbooks to agree on the actual docs test for this feature.

The playbooks will:
 1) Create the backup.
 2) Destroy the Undercloud node (remove packages, DB server and config files).
 3) Restore the Undercloud.
Comment 23 Carlos Camacho 2018-05-18 08:34:49 EDT
Hi, here you have some playbooks for testing the backup/restore

https://github.com/ccamacho/tripleo-ansible/tree/master/undercloud-backup-restore-check

I'll push this upstream in docs.
Comment 25 Mike Orazi 2018-05-21 10:30:15 EDT
Omri,

Are you able to test with the playbooks Carlos has linked and provide some feedback?
Comment 26 Omri Hochman 2018-05-21 11:36:25 EDT
Carlos, correct me if I'm wrong, AFAIK the last time playbooks partially succeeded as we had an issue to connect to the endpoints after undercloud was restored? 

we can run again when it's ready, I think would also be great to verify the steps from the documentation.
Comment 27 Carlos Camacho 2018-05-22 09:37:22 EDT
Hey Omri, the thing is how we are testing it, the way I was trying to make it work was on the same Undercloud used to launch the upgrade, removing files/db server and run it again (which currently works).

1) The first check worked using the same Undercloud node, that's what the playbooks do currently (Tested in both, my and your environment).

2) What we tried on your env last Friday was to use another clean machine with the undercloud Installed (but not configured), after the restore and the reinstall finished the "openstack stack list" worked but "openstack server list" failed with a keystone issue (probably because we are skipping to restore another config file).

The workflow for the restore works, as it's the same we were using since all the time[1], the only new addition is the way we create the backup, running 'openstack undercloud backup' which will create the DB dump and copy files.

I'm trying to reproduce this using Quickstart to integrate the playbooks upstream, but creating an Undercloud snapshot before deploying the Overcloud does not seem to be something easy to do.

[1]: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/12/html/back_up_and_restore_the_director_undercloud/restore

I don't think this should be a blocker as we are able to create the backup correctly, also we can test it using the same Undercloud node.
Comment 28 Omri Hochman 2018-05-22 10:41:50 EDT
(In reply to Carlos Camacho from comment #27)
> 
> I don't think this should be a blocker as we are able to create the backup
> correctly, also we can test it using the same Undercloud node.

We were using the same undercloud node, but it was cleaned (revert to snapshot)  and re-installed (as the instructions suggest).  

Removing the blocker should be PM decision. In the past AFAIR we were testing backup and restore with the following scenario: 

(1) deploy UC + OC 
(2) backup undercloud 
(3) copy the backup files to side folder  
(4) revert the undercloud machine to clean rhel
(5) re-install undercloud
(6) copy the backup files to back undercloud machine 
(7) attempt to restore undercloud from backup  -> it works

with latest OSP13 this scenario ^ did not PASS for us.
Comment 29 Carlos Camacho 2018-05-22 12:25:45 EDT
Let's have some feedback from PM, in any case this should be a documentation bug.
Comment 30 Marius Cornea 2018-05-22 13:31:04 EDT
I think Omri's use case is the most common backup/restore test case which is the ability to restore backed up data from a remote location on a clean undercloud machine. This covers the scenario where the undercloud machine is lost(due to a hardware failure for example). In order to bring the undercloud machine back online we should be able to restore the data that has been previously backed up on a clean rhel undercloud machine.

I think the ability to restore the undercloud from backup after a failed upgrade attempt is a different test case but it should work in addition to Omri's test.
Comment 34 Carlos Camacho 2018-05-23 08:39:54 EDT
Hey Omri you are right here, the .pem was failing I was able to restore them and now I'm having issues with glance (didn't restore the glance images on fs).

I'm taking all the steps here for verification https://gist.github.com/ccamacho/f22037dcb305d326182b81d5be61b279

Just let me finish to restore all the data to verify these docs amend.

I'll try to put all the steps together upstream, in an ansible playbook to test the steps in an upstream CI Job (https://review.openstack.org/#/c/569991/)
Comment 35 Omri Hochman 2018-05-23 10:59:12 EDT
(In reply to Carlos Camacho from comment #34)
> Hey Omri you are right here, the .pem was failing I was able to restore them
> and now I'm having issues with glance (didn't restore the glance images on
> fs).

Hi Carlos, Yes, we're testing with  SSL enabled on the undercloud, so the .pem was one issue, the other issue as you mentioned, was the glance images.

> 
> I'm taking all the steps here for verification
> https://gist.github.com/ccamacho/f22037dcb305d326182b81d5be61b279
> 
> Just let me finish to restore all the data to verify these docs amend.
> 
> I'll try to put all the steps together upstream, in an ansible playbook to
> test the steps in an upstream CI Job
> (https://review.openstack.org/#/c/569991/)

looks good plan: 
https://gist.github.com/ccamacho/f22037dcb305d326182b81d5be61b279

And that's great to have it tested with playbooks upstream. to verify this bug, we would need the updated/fixed steps that are on the playbooks to be in the documentation, so we can, in QE validate the steps work for OSP13 RC  downstream puddles.
Comment 37 Carlos Camacho 2018-05-28 07:34:13 EDT
Hello,

Last Friday I verified this on ohochman's env. the issue was that the Undercloud was using SSL to retrieve endpoints data, but there was no docs reference about how to update the certificates, thus, the post install steps when installing the Undercloud fails.

I'm waiting for QE people from Canada wakes up to show them how to run the steps and verify this BZ.
Comment 38 Carlos Camacho 2018-05-29 03:15:41 EDT
Ill move this to MODIFIED when https://review.openstack.org/#/c/570554 is merged. Those steps were verified with QE yesterday.
Comment 39 Carlos Camacho 2018-05-29 04:58:25 EDT
Docs merged.
Comment 43 nlevinki 2018-06-18 10:03:34 EDT
Hi Derek,
can you check this ticket and eaplain how we can close it
Thanks
Comment 46 Lucy Bopf 2018-06-20 21:11:50 EDT
Moving to NEW, because this work has not been accepted or assigned yet by the RHOSP docs team.
Comment 49 Lucy Bopf 2018-07-04 03:46:46 EDT
Formally accepting this work into the RHOSP 13 z program and assigning to Dan for review.

Dan, we've since spoken about what you think is required for this update, and I believe it renders comment 48 obsolete.
Comment 58 Carlos Camacho 2018-08-02 05:09:07 EDT
We also have this info in https://docs.openstack.org/tripleo-docs/latest/install/controlplane_backup_restore/01_undercloud_backup.html

Which is tested in the Jenkins job from Omri.


Maybe this just needs a QA verification but should be fine.
Comment 59 Dan Macpherson 2018-08-06 02:45:14 EDT
I think the original issue has been resolved. Since we're up to 58 comments here, I'll close this BZ down so we can refocus on the "openstack undercloud backup" command in this new BZ:

https://bugzilla.redhat.com/show_bug.cgi?id=1612697

Note You need to log in before you can comment on or make changes to this bug.