Bug 1383700 - After upgrade from OSP8 to OSP9 neutron server fails to start on 2/3 nodes
Summary: After upgrade from OSP8 to OSP9 neutron server fails to start on 2/3 nodes
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron
Version: 9.0 (Mitaka)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: async
: 9.0 (Mitaka)
Assignee: Bernard Cafarelli
QA Contact: Toni Freger
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-11 13:29 UTC by Jeremy
Modified: 2019-12-16 07:04 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-14 15:10:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ovs_neutorn output from overcloud controller (16.41 KB, text/plain)
2016-10-12 18:53 UTC, Jeremy
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1421626 0 None None None 2016-10-14 09:07:06 UTC

Description Jeremy 2016-10-11 13:29:37 UTC
Description of problem: neutron-server fails to start, pacemaker shows started on only 1 controller. When I check neutron/server.log on the failed node I see errors related to mysql on controller-1

2016-10-10 18:01:13.602 6966 ERROR neutron DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, u"Duplicate entry 'datacentre-1' for key 'PRIMARY'") [SQL: u'INSERT INTO ml2_vlan_allocations (physical_network, vlan_id, allocated) VALUES (%s, %s, %s)'] [parameters: (('datacentre', 1, 0), ('datacentre', 2, 0), ('datacentre', 3, 0), ('datacentre', 4, 0), ('datacentre', 5, 0), ('datacentre', 6, 0), ('datacentre', 7, 0), ('datacentre', 8, 0)  ... displaying 10 of 999 total bound parameter sets ...  ('datacentre', 998, 0), ('datacentre', 999, 0))]
2016-10-10 18:01:13.602 6966 ERROR neutron




Version-Release number of selected component (if applicable):
openstack-neutron-8.1.2-5.el7ost.noarch 


How reproducible:
unknown

Steps to Reproduce:
1.upgrade from osp8 to osp9
2.notice neutron server fails to start on all 3 controllers.
3.

Actual results:
neutron fails to start

Expected results:
neutron and other neutron services start

Additional info:
From pcs status:

 Clone Set: neutron-server-clone [neutron-server]
     Started: [ overcloud-controller-0 ]
     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]

Comment 2 Assaf Muller 2016-10-11 14:11:55 UTC
(In reply to Jeremy from comment #0)

[...]

> 1.upgrade from osp8 to osp9
> 2.notice neutron server fails to start on all 3 controllers.

[...]

> 
>  Clone Set: neutron-server-clone [neutron-server]
>      Started: [ overcloud-controller-0 ]
>      Stopped: [ overcloud-controller-1 overcloud-controller-2 ]

Looking at pcs output and neutron server log it seems like only neutron-server on controller 1 and 2 failed to start but the neutron-server on controller0 is up and dandy. Just making sure we have the same understanding of the situation.

Comment 3 Jeremy 2016-10-11 14:15:31 UTC
correct

Comment 4 Assaf Muller 2016-10-11 14:23:33 UTC
Can you please show output of:

mysql
use neutron;
select * from ml2_vlan_allocations;

Comment 5 Jeremy 2016-10-11 22:16:50 UTC
MariaDB [neutron]> select * from ml2_vlan_allocations;
+------------------+---------+-----------+
| physical_network | vlan_id | allocated |
+------------------+---------+-----------+
| physnet1         |    1000 |         0 |
| physnet1         |    1001 |         0 |
| physnet1         |    1002 |         0 |
| physnet1         |    1003 |         0 |
.........................................
| physnet1         |    2995 |         0 |
| physnet1         |    2996 |         0 |
| physnet1         |    2997 |         0 |
| physnet1         |    2998 |         0 |
| physnet1         |    2999 |         0 |
+------------------+---------+-----------+

Comment 6 Jeremy 2016-10-12 18:53:44 UTC
Created attachment 1209709 [details]
ovs_neutorn output from overcloud controller

Comment 8 Bernard Cafarelli 2016-10-13 14:23:05 UTC
OK, so the output matches what the other nodes try to add (physical_network="datacentre", vlan_id from 1 to 1000).
Looking at the server.log timestamps on the 3 nodes, all 3 were started at the same time, this looks like a race condition in the VlanTypeDriver initialization, in _sync_vlan_allocations() function, with all nodes trying to insert the same dataset at the same time.

As a temporary workaround, restarting the openstack-server service on the failed nodes should work (the code only adds the missing allocatable vlans to the table, so it should do nothing as controller 0 has started and done the job).

I will be looking for the root cause in the meantime

Comment 9 Bernard Cafarelli 2016-10-13 16:13:32 UTC
This definitely looks like upstream https://bugs.launchpad.net/neutron/+bug/1421626 though the bug mentions it should work with a Galera setup (triggering deadlocks instead of the DBDuplicateEntry items we have here)

Comment 10 Jeremy 2016-10-13 18:22:07 UTC
Thanks for the workaround. I was able to do pcs resource unmanage neutron-server; systemctl restart neutron-server on ctl1, ctl2. Then pcs resource manage neutron-server; pcs resource cleanup neutron-server to get everything working correctly again.

Comment 11 Bernard Cafarelli 2016-10-14 09:07:07 UTC
Nice, so we have a workaround in the meantime!

The servers starting up correctly confirm this is a race condition, most probably the linked upstream issue.

Comment 14 Brian Haley 2018-05-14 15:10:50 UTC
Customer case closed, closing.


Note You need to log in before you can comment on or make changes to this bug.