Description of problem: neutron-server fails to start, pacemaker shows started on only 1 controller. When I check neutron/server.log on the failed node I see errors related to mysql on controller-1 2016-10-10 18:01:13.602 6966 ERROR neutron DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, u"Duplicate entry 'datacentre-1' for key 'PRIMARY'") [SQL: u'INSERT INTO ml2_vlan_allocations (physical_network, vlan_id, allocated) VALUES (%s, %s, %s)'] [parameters: (('datacentre', 1, 0), ('datacentre', 2, 0), ('datacentre', 3, 0), ('datacentre', 4, 0), ('datacentre', 5, 0), ('datacentre', 6, 0), ('datacentre', 7, 0), ('datacentre', 8, 0) ... displaying 10 of 999 total bound parameter sets ... ('datacentre', 998, 0), ('datacentre', 999, 0))] 2016-10-10 18:01:13.602 6966 ERROR neutron Version-Release number of selected component (if applicable): openstack-neutron-8.1.2-5.el7ost.noarch How reproducible: unknown Steps to Reproduce: 1.upgrade from osp8 to osp9 2.notice neutron server fails to start on all 3 controllers. 3. Actual results: neutron fails to start Expected results: neutron and other neutron services start Additional info: From pcs status: Clone Set: neutron-server-clone [neutron-server] Started: [ overcloud-controller-0 ] Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
(In reply to Jeremy from comment #0) [...] > 1.upgrade from osp8 to osp9 > 2.notice neutron server fails to start on all 3 controllers. [...] > > Clone Set: neutron-server-clone [neutron-server] > Started: [ overcloud-controller-0 ] > Stopped: [ overcloud-controller-1 overcloud-controller-2 ] Looking at pcs output and neutron server log it seems like only neutron-server on controller 1 and 2 failed to start but the neutron-server on controller0 is up and dandy. Just making sure we have the same understanding of the situation.
correct
Can you please show output of: mysql use neutron; select * from ml2_vlan_allocations;
MariaDB [neutron]> select * from ml2_vlan_allocations; +------------------+---------+-----------+ | physical_network | vlan_id | allocated | +------------------+---------+-----------+ | physnet1 | 1000 | 0 | | physnet1 | 1001 | 0 | | physnet1 | 1002 | 0 | | physnet1 | 1003 | 0 | ......................................... | physnet1 | 2995 | 0 | | physnet1 | 2996 | 0 | | physnet1 | 2997 | 0 | | physnet1 | 2998 | 0 | | physnet1 | 2999 | 0 | +------------------+---------+-----------+
Created attachment 1209709 [details] ovs_neutorn output from overcloud controller
OK, so the output matches what the other nodes try to add (physical_network="datacentre", vlan_id from 1 to 1000). Looking at the server.log timestamps on the 3 nodes, all 3 were started at the same time, this looks like a race condition in the VlanTypeDriver initialization, in _sync_vlan_allocations() function, with all nodes trying to insert the same dataset at the same time. As a temporary workaround, restarting the openstack-server service on the failed nodes should work (the code only adds the missing allocatable vlans to the table, so it should do nothing as controller 0 has started and done the job). I will be looking for the root cause in the meantime
This definitely looks like upstream https://bugs.launchpad.net/neutron/+bug/1421626 though the bug mentions it should work with a Galera setup (triggering deadlocks instead of the DBDuplicateEntry items we have here)
Thanks for the workaround. I was able to do pcs resource unmanage neutron-server; systemctl restart neutron-server on ctl1, ctl2. Then pcs resource manage neutron-server; pcs resource cleanup neutron-server to get everything working correctly again.
Nice, so we have a workaround in the meantime! The servers starting up correctly confirm this is a race condition, most probably the linked upstream issue.
Customer case closed, closing.