Description of problem:
Initial deployment went without issue. A update of the overcloud was attempted and failed. A suggestion was made to comment out the network-isolation.yaml from answers.yaml. This caused the undercloud neutron and heat db's corrupted during another update of the overcloud. Command used:
openstack overcloud deploy -r /home/stack/templates/roles_data.yaml --answers-file ~/answers.yaml --ntp-server a.ntp.br,b.ntp.br,c.ntp.br,pool.ntp.br
The following line was commented out in the answers.yaml file:
# - /home/stack/openstack-tripleo-heat-templates/environments/network-isolation.yaml
Removal of this line caused the system to attempt a redeploy of the neutron networks. New networking UUID's were created in the undercloud neutron database and the original install UUID's were removed. The heat templates that are created from the db had no knowledge of the existing networks. The overcloud deploy attempted to re-create the networks and found they already existed. This cause the deploy to fail and the undercloud neutron and heat db's to be corrupted with the new UUID's for already deployed networks.
There were db backups taken before issue and after corruption. Only issue is the before backup is missing 3 ceph nodes, so the database cannot be restored to before the corruption. Before backup has correct neutron and heat info but missing the 3 critical ceph nodes.
The two neutron databases were diff'd using mysqldbcompare. Example: mysqldbcompare --server1=root@localhost --server2=root@backup_host:3310 neutron2:neutron. The customer is going to use the information collected in this procedure to merge the two databases to repair the neutron network uuid issue.
The production stack cannot be updated until the undercloud db is fixed.
Version-Release number of selected component (if applicable):
Every time the network-isolation.yaml is commented out in answers.yaml
Steps to Reproduce:
Corrupted undercloud db
Cannot attach undercloud backups and sosreport to bug because they are too large.
Several things here:
1) "A update of the overcloud was attempted and failed"
We'll need a dedicated BZ about this update failure and investigate properly.
2) "A suggestion was made to comment out the network-isolation.yaml from answers.yaml"
Who made the suggestion? Can you also share your templates?
3) "This caused the undercloud neutron and heat db's corrupted during another update of the overcloud"
I'm not sure about that statement, what made you think the database was corrupted? What was the symptom?
Before going to the next steps, I would like answers to these questions, so we can efficiently help you.
Created attachment 1427695 [details]
Answer to 1:
In contact with Andrew Ludwar about redeploy failure and opening another bug 1572686
Answer to 2: Provided and templates loaded to bug
Answer to 3:
Redeploy created new UUID's for subnet and attempted to build new subnets. Subnets existed so creation failed, but new UUID's replaced the original UUID's from the first deploy in the neutron and heat databases. That is the core of the issue.
I cannot upload databases due to size limitations. I will attempt to update bug with a diff of the databases using mysqldbcompare.
Cannot recreate condition in a lab instance of OSP 12 stack that creates new networks by removing network-isolation.yaml from deploy. I could recreate the first deployment error from bugzilla 1572686. I then did a deploy without network-isolation.yaml and it worked without changing any networks or subnets UUID's. Will continue to try to replicate issue.
Any updates on how to help the customer recover from this? They need to add more storage and compute nodes.
Is there a way to get the data to reconstruct the heat db from the overcloud db's using sql. Could we extract the data and reconstruct in into a database we could recover into the undercloud db?
I am sure this is not the last time this situation will happen so having a solution would be very helpful in the future.
Created attachment 1441229 [details]
Production overcloud db of running stack
Possible source of network uuid's needed for heat template db recovery
Thanks, Thomas. Who do we need to look at the neutron database?
Is this still an issue, and is there any way I can help?
Sorry, maybe my question should have been - given there are two neutron DBs linked here, what are the problems starting-up, or what are the differences we need to investigate? I'm guessing we have to make sure things are synced correctly based on the old heat template?
Please let me know if we need to setup a call as it's still not clear to me what needs to be done with the DBs yet.
We fixed the database issue. Now we're getting some issues with Ceph, but hopefully they are close to be handled.