Bug 1383700
Summary: | After upgrade from OSP8 to OSP9 neutron server fails to start on 2/3 nodes | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Jeremy <jmelvin> | ||||
Component: | openstack-neutron | Assignee: | Bernard Cafarelli <bcafarel> | ||||
Status: | CLOSED WONTFIX | QA Contact: | Toni Freger <tfreger> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 9.0 (Mitaka) | CC: | amuller, bcafarel, bhaley, chrisw, jmelvin, nyechiel, oblaut, srevivo | ||||
Target Milestone: | async | Keywords: | Triaged, ZStream | ||||
Target Release: | 9.0 (Mitaka) | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2018-05-14 15:10:50 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Jeremy
2016-10-11 13:29:37 UTC
(In reply to Jeremy from comment #0) [...] > 1.upgrade from osp8 to osp9 > 2.notice neutron server fails to start on all 3 controllers. [...] > > Clone Set: neutron-server-clone [neutron-server] > Started: [ overcloud-controller-0 ] > Stopped: [ overcloud-controller-1 overcloud-controller-2 ] Looking at pcs output and neutron server log it seems like only neutron-server on controller 1 and 2 failed to start but the neutron-server on controller0 is up and dandy. Just making sure we have the same understanding of the situation. correct Can you please show output of: mysql use neutron; select * from ml2_vlan_allocations; MariaDB [neutron]> select * from ml2_vlan_allocations; +------------------+---------+-----------+ | physical_network | vlan_id | allocated | +------------------+---------+-----------+ | physnet1 | 1000 | 0 | | physnet1 | 1001 | 0 | | physnet1 | 1002 | 0 | | physnet1 | 1003 | 0 | ......................................... | physnet1 | 2995 | 0 | | physnet1 | 2996 | 0 | | physnet1 | 2997 | 0 | | physnet1 | 2998 | 0 | | physnet1 | 2999 | 0 | +------------------+---------+-----------+ Created attachment 1209709 [details]
ovs_neutorn output from overcloud controller
OK, so the output matches what the other nodes try to add (physical_network="datacentre", vlan_id from 1 to 1000). Looking at the server.log timestamps on the 3 nodes, all 3 were started at the same time, this looks like a race condition in the VlanTypeDriver initialization, in _sync_vlan_allocations() function, with all nodes trying to insert the same dataset at the same time. As a temporary workaround, restarting the openstack-server service on the failed nodes should work (the code only adds the missing allocatable vlans to the table, so it should do nothing as controller 0 has started and done the job). I will be looking for the root cause in the meantime This definitely looks like upstream https://bugs.launchpad.net/neutron/+bug/1421626 though the bug mentions it should work with a Galera setup (triggering deadlocks instead of the DBDuplicateEntry items we have here) Thanks for the workaround. I was able to do pcs resource unmanage neutron-server; systemctl restart neutron-server on ctl1, ctl2. Then pcs resource manage neutron-server; pcs resource cleanup neutron-server to get everything working correctly again. Nice, so we have a workaround in the meantime! The servers starting up correctly confirm this is a race condition, most probably the linked upstream issue. Customer case closed, closing. |