Bug 1572017

Summary: Undercloud neutron and heat db corrupted after commenting out network-isolation.yaml from answers.yaml and doing overcloud deploy
Product: Red Hat OpenStack Reporter: Stan Toporek <stoporek>
Component: openstack-heatAssignee: Thomas Hervé <therve>
Status: CLOSED WORKSFORME QA Contact: Ronnie Rasouli <rrasouli>
Severity: high Docs Contact:
Priority: high    
Version: 12.0 (Pike)CC: bfournie, bhaley, emacchi, mbayer, mburns, sbaker, shardy, srevivo, stoporek, therve
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: 12.0 (Pike)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-03 06:56:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Stan Toporek 2018-04-26 00:58:30 UTC
Description of problem:

Initial deployment went without issue. A update of the overcloud was attempted and failed. A suggestion was made to comment out the network-isolation.yaml from answers.yaml. This caused the undercloud neutron and heat db's corrupted during another update of the overcloud. Command used:

openstack overcloud deploy -r /home/stack/templates/roles_data.yaml --answers-file ~/answers.yaml --ntp-server a.ntp.br,b.ntp.br,c.ntp.br,pool.ntp.br

The following line was commented out in the answers.yaml file:

#  - /home/stack/openstack-tripleo-heat-templates/environments/network-isolation.yaml

Removal of this line caused the system to attempt a redeploy of the neutron networks. New networking UUID's were created in the undercloud neutron database and the original install UUID's were removed. The heat templates that are created from the db had no knowledge of the existing networks. The overcloud deploy attempted to re-create the networks and found they already existed. This cause the deploy to fail and the undercloud neutron and heat db's to be corrupted with the new UUID's for already deployed networks.

There were db backups taken before issue and after corruption. Only issue is the before backup is missing 3 ceph nodes, so the database cannot be restored to before the corruption. Before backup has correct neutron and heat info but missing the 3 critical ceph nodes.

The two neutron databases were diff'd using mysqldbcompare. Example: mysqldbcompare --server1=root@localhost --server2=root@backup_host:3310 neutron2:neutron. The customer is going to use the information collected in this procedure to merge the two databases to repair the neutron network uuid issue.
 
The production stack cannot be updated until the undercloud db is fixed.


Version-Release number of selected component (if applicable):


How reproducible:

Every time the network-isolation.yaml is commented out in answers.yaml

Steps to Reproduce:
1.
2.
3.

Actual results:

Corrupted undercloud db

Expected results:

Additional info:

Cannot attach undercloud backups and sosreport to bug because they are too large.

Comment 1 Emilien Macchi 2018-04-26 19:03:57 UTC
Several things here:

1) "A update of the overcloud was attempted and failed"

We'll need a dedicated BZ about this update failure and investigate properly.

2) "A suggestion was made to comment out the network-isolation.yaml from answers.yaml"

Who made the suggestion? Can you also share your templates?

3) "This caused the undercloud neutron and heat db's corrupted during another update of the overcloud"

I'm not sure about that statement, what made you think the database was corrupted? What was the symptom? 



Before going to the next steps, I would like answers to these questions, so we can efficiently help you.

Comment 2 Stan Toporek 2018-04-27 14:31:17 UTC
Created attachment 1427695 [details]
Heat Templates

Comment 4 Stan Toporek 2018-04-27 15:25:41 UTC
Answer to 1:

In contact with Andrew Ludwar about redeploy failure and opening another bug 1572686

Answer to 2: Provided and templates loaded to bug

Answer to 3: 

Redeploy created new UUID's for subnet and attempted to build new subnets. Subnets existed so creation failed, but new UUID's replaced the original UUID's from the first deploy in the neutron and heat databases. That is the core of the issue. 

I cannot upload databases due to size limitations. I will attempt to update bug with a diff of the databases using mysqldbcompare.

Comment 7 Stan Toporek 2018-05-07 15:21:10 UTC
Any updates?

Comment 11 Stan Toporek 2018-05-19 21:31:05 UTC
Cannot recreate condition in a lab instance of OSP 12 stack that creates new networks by removing network-isolation.yaml from deploy. I could recreate the first deployment error from bugzilla 1572686. I then did a deploy without network-isolation.yaml and it worked without changing any networks or subnets UUID's. Will continue to try to replicate issue.

Any updates on how to help the customer recover from this? They need to add more storage and compute nodes.

Comment 14 Stan Toporek 2018-05-24 16:35:59 UTC
Is there a way to get the data to reconstruct the heat db from the overcloud db's using sql. Could we extract the data and reconstruct in into a database we could recover into the undercloud db?

I am sure this is not the last time this situation will happen so having a solution would be very helpful in the future.

Comment 15 Stan Toporek 2018-05-24 19:21:49 UTC
Created attachment 1441229 [details]
Production overcloud db of running stack

Possible source of network uuid's needed for heat template db recovery

Comment 29 Stan Toporek 2018-06-08 15:59:31 UTC
Thanks, Thomas. Who do we need to look at the neutron database?

Comment 32 Brian Haley 2018-06-19 13:49:05 UTC
Is this still an issue, and is there any way I can help?

Comment 33 Brian Haley 2018-06-19 14:36:41 UTC
Sorry, maybe my question should have been - given there are two neutron DBs linked here, what are the problems starting-up, or what are the differences we need to investigate?  I'm guessing we have to make sure things are synced correctly based on the old heat template?

Comment 35 Brian Haley 2018-06-20 19:58:52 UTC
Please let me know if we need to setup a call as it's still not clear to me what needs to be done with the DBs yet.

Comment 49 Thomas Hervé 2018-07-23 13:19:08 UTC
We fixed the database issue. Now we're getting some issues with Ceph, but hopefully they are close to be handled.