Bug 1383700

Summary: After upgrade from OSP8 to OSP9 neutron server fails to start on 2/3 nodes
Product: Red Hat OpenStack Reporter: Jeremy <jmelvin>
Component: openstack-neutronAssignee: Bernard Cafarelli <bcafarel>
Status: CLOSED WONTFIX QA Contact: Toni Freger <tfreger>
Severity: high Docs Contact:
Priority: high    
Version: 9.0 (Mitaka)CC: amuller, bcafarel, bhaley, chrisw, jmelvin, nyechiel, oblaut, srevivo
Target Milestone: asyncKeywords: Triaged, ZStream
Target Release: 9.0 (Mitaka)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-05-14 15:10:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ovs_neutorn output from overcloud controller none

Description Jeremy 2016-10-11 13:29:37 UTC
Description of problem: neutron-server fails to start, pacemaker shows started on only 1 controller. When I check neutron/server.log on the failed node I see errors related to mysql on controller-1

2016-10-10 18:01:13.602 6966 ERROR neutron DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, u"Duplicate entry 'datacentre-1' for key 'PRIMARY'") [SQL: u'INSERT INTO ml2_vlan_allocations (physical_network, vlan_id, allocated) VALUES (%s, %s, %s)'] [parameters: (('datacentre', 1, 0), ('datacentre', 2, 0), ('datacentre', 3, 0), ('datacentre', 4, 0), ('datacentre', 5, 0), ('datacentre', 6, 0), ('datacentre', 7, 0), ('datacentre', 8, 0)  ... displaying 10 of 999 total bound parameter sets ...  ('datacentre', 998, 0), ('datacentre', 999, 0))]
2016-10-10 18:01:13.602 6966 ERROR neutron




Version-Release number of selected component (if applicable):
openstack-neutron-8.1.2-5.el7ost.noarch 


How reproducible:
unknown

Steps to Reproduce:
1.upgrade from osp8 to osp9
2.notice neutron server fails to start on all 3 controllers.
3.

Actual results:
neutron fails to start

Expected results:
neutron and other neutron services start

Additional info:
From pcs status:

 Clone Set: neutron-server-clone [neutron-server]
     Started: [ overcloud-controller-0 ]
     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]

Comment 2 Assaf Muller 2016-10-11 14:11:55 UTC
(In reply to Jeremy from comment #0)

[...]

> 1.upgrade from osp8 to osp9
> 2.notice neutron server fails to start on all 3 controllers.

[...]

> 
>  Clone Set: neutron-server-clone [neutron-server]
>      Started: [ overcloud-controller-0 ]
>      Stopped: [ overcloud-controller-1 overcloud-controller-2 ]

Looking at pcs output and neutron server log it seems like only neutron-server on controller 1 and 2 failed to start but the neutron-server on controller0 is up and dandy. Just making sure we have the same understanding of the situation.

Comment 3 Jeremy 2016-10-11 14:15:31 UTC
correct

Comment 4 Assaf Muller 2016-10-11 14:23:33 UTC
Can you please show output of:

mysql
use neutron;
select * from ml2_vlan_allocations;

Comment 5 Jeremy 2016-10-11 22:16:50 UTC
MariaDB [neutron]> select * from ml2_vlan_allocations;
+------------------+---------+-----------+
| physical_network | vlan_id | allocated |
+------------------+---------+-----------+
| physnet1         |    1000 |         0 |
| physnet1         |    1001 |         0 |
| physnet1         |    1002 |         0 |
| physnet1         |    1003 |         0 |
.........................................
| physnet1         |    2995 |         0 |
| physnet1         |    2996 |         0 |
| physnet1         |    2997 |         0 |
| physnet1         |    2998 |         0 |
| physnet1         |    2999 |         0 |
+------------------+---------+-----------+

Comment 6 Jeremy 2016-10-12 18:53:44 UTC
Created attachment 1209709 [details]
ovs_neutorn output from overcloud controller

Comment 8 Bernard Cafarelli 2016-10-13 14:23:05 UTC
OK, so the output matches what the other nodes try to add (physical_network="datacentre", vlan_id from 1 to 1000).
Looking at the server.log timestamps on the 3 nodes, all 3 were started at the same time, this looks like a race condition in the VlanTypeDriver initialization, in _sync_vlan_allocations() function, with all nodes trying to insert the same dataset at the same time.

As a temporary workaround, restarting the openstack-server service on the failed nodes should work (the code only adds the missing allocatable vlans to the table, so it should do nothing as controller 0 has started and done the job).

I will be looking for the root cause in the meantime

Comment 9 Bernard Cafarelli 2016-10-13 16:13:32 UTC
This definitely looks like upstream https://bugs.launchpad.net/neutron/+bug/1421626 though the bug mentions it should work with a Galera setup (triggering deadlocks instead of the DBDuplicateEntry items we have here)

Comment 10 Jeremy 2016-10-13 18:22:07 UTC
Thanks for the workaround. I was able to do pcs resource unmanage neutron-server; systemctl restart neutron-server on ctl1, ctl2. Then pcs resource manage neutron-server; pcs resource cleanup neutron-server to get everything working correctly again.

Comment 11 Bernard Cafarelli 2016-10-14 09:07:07 UTC
Nice, so we have a workaround in the meantime!

The servers starting up correctly confirm this is a race condition, most probably the linked upstream issue.

Comment 14 Brian Haley 2018-05-14 15:10:50 UTC
Customer case closed, closing.