Bug 1557794 - [Docs][Back Up and Restore] Update Back Up and Restore the Director Undercloud for RHOSP 13 (fails in a RHOSP 13 env when following the procedure and steps as it was documented for RHOSP 12)
Summary: [Docs][Back Up and Restore] Update Back Up and Restore the Director Underclou...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: documentation
Version: 13.0 (Queens)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: z2
: 13.0 (Queens)
Assignee: Dan Macpherson
QA Contact: Julie
URL:
Whiteboard: docs-accepted
Depends On: 1594279
Blocks: 1597920
TreeView+ depends on / blocked
 
Reported: 2018-03-18 17:32 UTC by Omri Hochman
Modified: 2022-08-02 17:19 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
A regression was identified in the procedure for backing up and restoring the director undercloud. As a result, the procedure requires modification and verification before it can be published. The book 'Back Up and Restore the Director Undercloud' is therefore not available with the general availability of Red Hat OpenStack Platform 13. The procedure will be updated as a priority after the general availability release, and published as soon as it is verified.
Clone Of:
Environment:
Last Closed: 2018-08-06 06:45:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 558429 0 'None' MERGED [Backup/Restore] Fix nits in docs and clarification note. 2020-11-12 04:38:32 UTC
OpenStack gerrit 570554 0 'None' MERGED Fix docs for Undercloud backup and restore 2020-11-12 04:38:32 UTC
OpenStack gerrit 572800 0 'None' MERGED Add keystone credential-keys and fernet-keys clarification when restoring the Undercloud 2020-11-12 04:38:32 UTC
Red Hat Bugzilla 1612697 0 unspecified CLOSED [Docs] Create module for "openstack undercloud backup" command 2021-02-22 00:41:40 UTC
Red Hat Issue Tracker OSP-5001 0 None None None 2022-08-02 17:19:53 UTC
Red Hat Issue Tracker UPG-2245 0 None None None 2021-09-09 13:28:31 UTC

Internal Links: 1612697

Description Omri Hochman 2018-03-18 17:32:40 UTC
[OSP13] Undercloud backup and restore fails when following the procedure and steps as it was documented for OSP12  ( fails on OSP13 env ). 

Environment:
------------
OSP13 Puddle 2018-03-16.1

Description : 
---------------
Undercloud backup and restore fails when following the procedure and steps as it was documented for OSP12  ( it fails on OSP13 env ). 


Results : 
---------
(1) missing /etc/my.cnf.d/server.cnf (on osp13 env )
(2) when attempted to use mariadb-server.cnf (instead of server.cnf) the command " cat /root/undercloud-all-databases.sql | mysql"  failed with ERROR

Docs link:
-----------
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/12/html/back_up_and_restore_the_director_undercloud/restore

 

list of files exist under /etc/my.cnf.d/ 
-----------------------------------------
[root@undercloud75 ~]# cd /etc/my.cnf.d/
[root@undercloud75 my.cnf.d]# ls
auth_gssapi.cnf  client.cnf  enable_encryption.preset  galera.cnf  mariadb-server.cnf  mysql-clients.cnf  tokudb.cnf
[root@undercloud75 my.cnf.d]# ls -ls
total 28
4 -rw-r--r--. 1 root root   41 Feb  1 12:42 auth_gssapi.cnf
4 -rw-r--r--. 1 root root  295 Dec 14  2016 client.cnf
4 -rw-r--r--. 1 root root  763 Dec 14  2016 enable_encryption.preset
4 -rw-r--r--. 1 root root  999 Feb 20 15:01 galera.cnf
4 -rw-r--r--. 1 root root 1462 Feb  1 12:42 mariadb-server.cnf
4 -rw-r--r--. 1 root root  232 Dec 14  2016 mysql-clients.cnf
4 -rw-r--r--. 1 root root  285 Dec 14  2016 tokudb.cnf



When attempting to use mariadb-server.cnf instead of server.cnf :
("cat /root/undercloud-all-databases.sql | mysql" ->  FAILED)) 
-----------------------------------------------------------------
[root@undercloud75 ~]#   tar -xzC / -f undercloud-backup-*.tar.gz etc/my.cnf.d/mariadb-server.cnf
[root@undercloud75 ~]#   tar -xzC / -f undercloud-backup-*.tar.gz root/undercloud-all-databases.sql
tar: root/undercloud-all-databases.sql: time stamp 2018-02-20 19:31:31.598594491 is 15509.047497842 s in the future
[root@undercloud75 ~]#   systemctl start mariadb
[root@undercloud75 ~]#   mysql -uroot -e"set global max_allowed_packet = 16777216;"
[root@undercloud75 ~]#   cat /root/undercloud-all-databases.sql | mysql
ERROR 1911 (HY000) at line 3992: Unknown option 'STATS_PERSISTENT'

Comment 2 Carlos Camacho 2018-03-22 08:57:29 UTC
Hey Omri,

Can you clarify the steps executed from the docs?
Im trying to follow up your steps in my environment but im not getting them correctly.

Comment 3 Carlos Camacho 2018-03-26 09:06:59 UTC
Hello Omri,

We are working on a new refactor for the backup/restore procedures, this is the upstrean review https://review.openstack.org/#/c/544975/ do you mind to check it?


Cheers,
Carlos.

Comment 4 Omri Hochman 2018-03-28 07:48:52 UTC
(In reply to Carlos Camacho from comment #2)
> Hey Omri,
> 
> Can you clarify the steps executed from the docs?
> Im trying to follow up your steps in my environment but im not getting them
> correctly.

Hi Carlos,   I was following the official RH documentation for OSP12, the section that explains the backup/restore undercloud ( link on Bz body) . 


 (In reply to Carlos Camacho from comment #3)
> Hello Omri,
> 
> We are working on a new refactor for the backup/restore procedures, this is
> the upstrean review https://review.openstack.org/#/c/544975/ do you mind to
> check it?
> 

That's great- I'll try to follow those steps and report back here.  


> 
> Cheers,
> Carlos.

Comment 6 Omri Hochman 2018-03-29 08:20:11 UTC
(In reply to Carlos Camacho from comment #3)
> Hello Omri,
> 
> We are working on a new refactor for the backup/restore procedures, this is
> the upstrean review https://review.openstack.org/#/c/544975/ do you mind to
> check it?
> 
> 
> Cheers,
> Carlos.

By attempting to run the manual commands suggested in the documentation path patch https://review.openstack.org/#/c/544975/, 
it seems that again, there is a request to tar /etc/my.cnf.d/server.cnf 
which is a file that no longer exist in osp13 environment (as reported in Bz body).

Comment 7 Carlos Camacho 2018-04-03 08:50:35 UTC
Hello Omri,

You are right there, depending on the MySQL version we can have a different configuration file name.

I pushed a fix for the latest docs in https://review.openstack.org/#/c/558429

Comment 8 Omri Hochman 2018-04-04 02:35:55 UTC
(In reply to Carlos Camacho from comment #7)
> Hello Omri,
> 
> You are right there, depending on the MySQL version we can have a different
> configuration file name.
> 
> I pushed a fix for the latest docs in https://review.openstack.org/#/c/558429


Trying to follow the steps on osp13 env :
-------------------------------------------
[root@undercloud75 ~]# tar --ignore-failed-read -czf \
>       undercloud-backup-`date +%F`.tar.gz \
>       /root/undercloud-all-databases.sql \
>       /etc/my.cnf.d \
>       /var/lib/glance/images \
>       /srv/node \
>       /home/stack \
>       /etc/pki \
>       /opt/stack
tar: Removing leading `/' from member names
tar: /root/undercloud-all-databases.sql: Warning: Cannot stat: No such file or directory

seems like we're missign undercloud-all-databases.sql 

[root@undercloud75 ~]# find / -name undercloud-all-databases.sql
[root@undercloud75 ~]# 
[stack@undercloud75 ~]$ sudo find / -name  *.sql
/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/migrate_repo/versions/246_sqlite_upgrade.sql
/usr/lib/python2.7/site-packages/glance/db/sqlalchemy/migrate_repo/versions/003_sqlite_upgrade.sql
/usr/lib/python2.7/site-packages/glance/db/sqlalchemy/migrate_repo/versions/006_mysql_upgrade.sql
/usr/lib/python2.7/site-packages/glance/db/sqlalchemy/migrate_repo/versions/006_sqlite_upgrade.sql
/usr/lib/python2.7/site-packages/glance/db/sqlalchemy/migrate_repo/versions/045_sqlite_upgrade.sql
/usr/lib/python2.7/site-packages/glance/db/sqlalchemy/migrate_repo/versions/011_sqlite_upgrade.sql
/usr/lib/python2.7/site-packages/glance/db/sqlalchemy/migrate_repo/versions/037_sqlite_upgrade.sql
/usr/share/mariadb/mroonga/install.sql
/usr/share/mariadb/mroonga/uninstall.sql
/usr/share/mariadb/fill_help_tables.sql
/usr/share/mariadb/install_spider.sql
/usr/share/mariadb/maria_add_gis_sp.sql
/usr/share/mariadb/maria_add_gis_sp_bootstrap.sql
/usr/share/mariadb/mysql_performance_tables.sql
/usr/share/mariadb/mysql_system_tables.sql
/usr/share/mariadb/mysql_system_tables_data.sql
/usr/share/mariadb/mysql_test_data_timezone.sql
/usr/share/mariadb/mysql_to_mariadb.sql
/usr/share/puppet/ext/dbfix.sql
/usr/share/openstack-puppet/modules/veritas_hyperscale/files/scripts/db/01_HyperScale.sql
/usr/share/openstack-puppet/modules/veritas_hyperscale/files/scripts/db/02_HyperScaleStatsSchema.sql
/usr/share/openstack-puppet/modules/veritas_hyperscale/files/scripts/db/03_HyperScaleWorkflow.sql
/usr/share/openstack-puppet/modules/veritas_hyperscale/files/scripts/db/51_HyperScaleAlertsDescription.sql

Comment 9 Carlos Camacho 2018-04-04 07:36:50 UTC
Hello,

Omri, did you run the previous step? Is this one:

  mysqldump --opt --single-transaction --all-databases > /root/undercloud-all-databases.sql

If you don't create the DB dump you won't be able to zip it.

Comment 10 Omri Hochman 2018-04-04 14:36:01 UTC
(undercloud) [stack@undercloud75 ~]$ openstack undercloud backup --add-path /etc/hosts \
>                               --add-path /var/log/ \
>                               --add-path /var/lib/glance/images/ \
>                               --add-path /srv/node/ \
>                               --add-path /etc/

Swift API is failing during the backup : 
-----------------------------------------

                          u'version': u'2.0'},
                u'updated_at': u'2018-02-22 15:08:57'},
 u'message': {u'msg': u'Object PUT failed: https://192.168.0.2:13808/v1/AUTH_5b0b3efc458a4fc4af8e21487556aeca/undercloud-backups/UC-backup-20180222101308.tar 413 Request Entity Too Large  [first 60 chars of response] <html><h1>Request Entity Too Large</h1><p>The body of your r'},
 u'status': u'FAILED'}

[root@undercloud75 undercloud-backup-JjgPDr]# ls -lah
total 8.2G
drwx------.  2 mistral mistral   86 Feb 22 10:09 .
drwxrwxrwt. 14 root    root    4.0K Feb 22 10:08 ..
-rw-r--r--.  1 mistral mistral  35M Feb 22 10:09 all-databases-20180222100900.sql.gz
-rw-r--r--.  1 mistral mistral 8.2G Feb 22 10:13 filesystem-20180222100929.tar

Comment 11 Omri Hochman 2018-04-04 17:06:41 UTC
(In reply to Omri Hochman from comment #10)
> (undercloud) [stack@undercloud75 ~]$ openstack undercloud backup --add-path
> /etc/hosts \
> >                               --add-path /var/log/ \
> >                               --add-path /var/lib/glance/images/ \
> >                               --add-path /srv/node/ \
> >                               --add-path /etc/
> 

Opened the following Bz for the openstack undercloud backup : https://bugzilla.redhat.com/show_bug.cgi?id=1563783 


while we'll track the "manual backup steps" on this ticket.

Comment 12 mathieu bultel 2018-04-11 10:57:52 UTC
Hi Omri,
Can you set the pm_ack for this BZ.

Thank you,
Mathieu

Comment 13 Omri Hochman 2018-04-16 14:16:11 UTC
(In reply to mathieu bultel from comment #12)
> Hi Omri,
> Can you set the pm_ack for this BZ.
> 
> Thank you,
> Mathieu

Should be PM_ACK by PM . Adding Jarda.

Comment 14 Jaromir Coufal 2018-04-16 14:26:53 UTC
granted

Comment 15 Omri Hochman 2018-04-16 22:51:14 UTC
Attempted to restore, by running the steps manually and encountered the following issue : 

[root@undercloud75 ~]# cat /root/undercloud-all-databases.sql | mysql
ERROR 2006 (HY000) at line 3153: MySQL server has gone away

 [root@undercloud75 ~]# for i in ceilometer glance heat ironic keystone neutron nova;do mysql -e "drop user $i";done
ERROR 1396 (HY000) at line 1: Operation DROP USER failed for 'ceilometer'@'%'
ERROR 1396 (HY000) at line 1: Operation DROP USER failed for 'glance'@'%'
ERROR 1396 (HY000) at line 1: Operation DROP USER failed for 'heat'@'%'
ERROR 1396 (HY000) at line 1: Operation DROP USER failed for 'ironic'@'%'
ERROR 1396 (HY000) at line 1: Operation DROP USER failed for 'keystone'@'%'
ERROR 1396 (HY000) at line 1: Operation DROP USER failed for 'neutron'@'%'
ERROR 1396 (HY000) at line 1: Operation DROP USER failed for 'nova'@'%


Later on attempted to continue despite the errors, result in
undercloud install failure. 

RuntimeError: os-refresh-config failed. See log for details.
2018-02-20 16:22:05,237 ERROR:
#############################################################################
Undercloud install failed.

Comment 18 Carlos Camacho 2018-05-14 10:07:14 UTC
After applying  (if you want to run this on 13, use the backports)
https://review.openstack.org/#/c/568245/
https://review.openstack.org/#/c/564784/

The following commands should reinstall the UC from the backup (tested just now):


source ~/stackrc
rm -rf /var/tmp/test_uc_backup
mkdir -p /var/tmp/test_uc_backup
cd /var/tmp/test_uc_backup
openstack container delete undercloud-backups --recursive
openstack undercloud backup --exclude-path /home/stack/
openstack container save undercloud-backups
tar -xvf *.tar
gunzip *.gz
cat all-databases-*.sql | sudo mysql
for i in ceilometer glance heat ironic keystone neutron nova;do sudo mysql -e "drop user $i" || true;done
sudo mysql -e 'flush privileges'
openstack undercloud install


In this case, if trying to run the manual workflow, just be sure you run the DB dump correctly.

Comment 20 Carlos Camacho 2018-05-16 11:00:39 UTC
Hello, 

The latest docs version available is here: http://tripleo.org/install/controlplane_backup_restore/00_index.html

Comment 21 Omri Hochman 2018-05-16 17:46:10 UTC
(In reply to Carlos Camacho from comment #20)
> Hello, 
> 
> The latest docs version available is here:
> http://tripleo.org/install/controlplane_backup_restore/00_index.html

Attempted to follow the docs we failed on conflict, probably we would need to add a patch to deal with: 
 /root/.my.cnf was conflicting with the install

Comment 22 Carlos Camacho 2018-05-17 14:17:11 UTC
Ok, I'm working on a few ansible playbooks to agree on the actual docs test for this feature.

The playbooks will:
 1) Create the backup.
 2) Destroy the Undercloud node (remove packages, DB server and config files).
 3) Restore the Undercloud.

Comment 23 Carlos Camacho 2018-05-18 12:34:49 UTC
Hi, here you have some playbooks for testing the backup/restore

https://github.com/ccamacho/tripleo-ansible/tree/master/undercloud-backup-restore-check

I'll push this upstream in docs.

Comment 25 Mike Orazi 2018-05-21 14:30:15 UTC
Omri,

Are you able to test with the playbooks Carlos has linked and provide some feedback?

Comment 26 Omri Hochman 2018-05-21 15:36:25 UTC
Carlos, correct me if I'm wrong, AFAIK the last time playbooks partially succeeded as we had an issue to connect to the endpoints after undercloud was restored? 

we can run again when it's ready, I think would also be great to verify the steps from the documentation.

Comment 27 Carlos Camacho 2018-05-22 13:37:22 UTC
Hey Omri, the thing is how we are testing it, the way I was trying to make it work was on the same Undercloud used to launch the upgrade, removing files/db server and run it again (which currently works).

1) The first check worked using the same Undercloud node, that's what the playbooks do currently (Tested in both, my and your environment).

2) What we tried on your env last Friday was to use another clean machine with the undercloud Installed (but not configured), after the restore and the reinstall finished the "openstack stack list" worked but "openstack server list" failed with a keystone issue (probably because we are skipping to restore another config file).

The workflow for the restore works, as it's the same we were using since all the time[1], the only new addition is the way we create the backup, running 'openstack undercloud backup' which will create the DB dump and copy files.

I'm trying to reproduce this using Quickstart to integrate the playbooks upstream, but creating an Undercloud snapshot before deploying the Overcloud does not seem to be something easy to do.

[1]: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/12/html/back_up_and_restore_the_director_undercloud/restore

I don't think this should be a blocker as we are able to create the backup correctly, also we can test it using the same Undercloud node.

Comment 28 Omri Hochman 2018-05-22 14:41:50 UTC
(In reply to Carlos Camacho from comment #27)
> 
> I don't think this should be a blocker as we are able to create the backup
> correctly, also we can test it using the same Undercloud node.

We were using the same undercloud node, but it was cleaned (revert to snapshot)  and re-installed (as the instructions suggest).  

Removing the blocker should be PM decision. In the past AFAIR we were testing backup and restore with the following scenario: 

(1) deploy UC + OC 
(2) backup undercloud 
(3) copy the backup files to side folder  
(4) revert the undercloud machine to clean rhel
(5) re-install undercloud
(6) copy the backup files to back undercloud machine 
(7) attempt to restore undercloud from backup  -> it works

with latest OSP13 this scenario ^ did not PASS for us.

Comment 29 Carlos Camacho 2018-05-22 16:25:45 UTC
Let's have some feedback from PM, in any case this should be a documentation bug.

Comment 30 Marius Cornea 2018-05-22 17:31:04 UTC
I think Omri's use case is the most common backup/restore test case which is the ability to restore backed up data from a remote location on a clean undercloud machine. This covers the scenario where the undercloud machine is lost(due to a hardware failure for example). In order to bring the undercloud machine back online we should be able to restore the data that has been previously backed up on a clean rhel undercloud machine.

I think the ability to restore the undercloud from backup after a failed upgrade attempt is a different test case but it should work in addition to Omri's test.

Comment 34 Carlos Camacho 2018-05-23 12:39:54 UTC
Hey Omri you are right here, the .pem was failing I was able to restore them and now I'm having issues with glance (didn't restore the glance images on fs).

I'm taking all the steps here for verification https://gist.github.com/ccamacho/f22037dcb305d326182b81d5be61b279

Just let me finish to restore all the data to verify these docs amend.

I'll try to put all the steps together upstream, in an ansible playbook to test the steps in an upstream CI Job (https://review.openstack.org/#/c/569991/)

Comment 35 Omri Hochman 2018-05-23 14:59:12 UTC
(In reply to Carlos Camacho from comment #34)
> Hey Omri you are right here, the .pem was failing I was able to restore them
> and now I'm having issues with glance (didn't restore the glance images on
> fs).

Hi Carlos, Yes, we're testing with  SSL enabled on the undercloud, so the .pem was one issue, the other issue as you mentioned, was the glance images.

> 
> I'm taking all the steps here for verification
> https://gist.github.com/ccamacho/f22037dcb305d326182b81d5be61b279
> 
> Just let me finish to restore all the data to verify these docs amend.
> 
> I'll try to put all the steps together upstream, in an ansible playbook to
> test the steps in an upstream CI Job
> (https://review.openstack.org/#/c/569991/)

looks good plan: 
https://gist.github.com/ccamacho/f22037dcb305d326182b81d5be61b279

And that's great to have it tested with playbooks upstream. to verify this bug, we would need the updated/fixed steps that are on the playbooks to be in the documentation, so we can, in QE validate the steps work for OSP13 RC  downstream puddles.

Comment 37 Carlos Camacho 2018-05-28 11:34:13 UTC
Hello,

Last Friday I verified this on ohochman's env. the issue was that the Undercloud was using SSL to retrieve endpoints data, but there was no docs reference about how to update the certificates, thus, the post install steps when installing the Undercloud fails.

I'm waiting for QE people from Canada wakes up to show them how to run the steps and verify this BZ.

Comment 38 Carlos Camacho 2018-05-29 07:15:41 UTC
Ill move this to MODIFIED when https://review.openstack.org/#/c/570554 is merged. Those steps were verified with QE yesterday.

Comment 39 Carlos Camacho 2018-05-29 08:58:25 UTC
Docs merged.

Comment 43 nlevinki 2018-06-18 14:03:34 UTC
Hi Derek,
can you check this ticket and eaplain how we can close it
Thanks

Comment 46 Lucy Bopf 2018-06-21 01:11:50 UTC
Moving to NEW, because this work has not been accepted or assigned yet by the RHOSP docs team.

Comment 49 Lucy Bopf 2018-07-04 07:46:46 UTC
Formally accepting this work into the RHOSP 13 z program and assigning to Dan for review.

Dan, we've since spoken about what you think is required for this update, and I believe it renders comment 48 obsolete.

Comment 58 Carlos Camacho 2018-08-02 09:09:07 UTC
We also have this info in https://docs.openstack.org/tripleo-docs/latest/install/controlplane_backup_restore/01_undercloud_backup.html

Which is tested in the Jenkins job from Omri.


Maybe this just needs a QA verification but should be fine.

Comment 59 Dan Macpherson 2018-08-06 06:45:14 UTC
I think the original issue has been resolved. Since we're up to 58 comments here, I'll close this BZ down so we can refocus on the "openstack undercloud backup" command in this new BZ:

https://bugzilla.redhat.com/show_bug.cgi?id=1612697


Note You need to log in before you can comment on or make changes to this bug.