Bug 1383700

Summary:

After upgrade from OSP8 to OSP9 neutron server fails to start on 2/3 nodes

Product:

Red Hat OpenStack

Reporter:

Jeremy <jmelvin>

Component:

openstack-neutron

Assignee:

Bernard Cafarelli <bcafarel>

Status:

CLOSED WONTFIX

QA Contact:

Toni Freger <tfreger>

Severity:

high

Docs Contact:

Priority:

high

Version:

9.0 (Mitaka)

CC:

amuller, bcafarel, bhaley, chrisw, jmelvin, nyechiel, oblaut, srevivo

Target Milestone:

async

Keywords:

Triaged, ZStream

Target Release:

9.0 (Mitaka)

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-05-14 15:10:50 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
ovs_neutorn output from overcloud controller	none

Description Jeremy 2016-10-11 13:29:37 UTC

Description of problem: neutron-server fails to start, pacemaker shows started on only 1 controller. When I check neutron/server.log on the failed node I see errors related to mysql on controller-1

2016-10-10 18:01:13.602 6966 ERROR neutron DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, u"Duplicate entry 'datacentre-1' for key 'PRIMARY'") [SQL: u'INSERT INTO ml2_vlan_allocations (physical_network, vlan_id, allocated) VALUES (%s, %s, %s)'] [parameters: (('datacentre', 1, 0), ('datacentre', 2, 0), ('datacentre', 3, 0), ('datacentre', 4, 0), ('datacentre', 5, 0), ('datacentre', 6, 0), ('datacentre', 7, 0), ('datacentre', 8, 0)  ... displaying 10 of 999 total bound parameter sets ...  ('datacentre', 998, 0), ('datacentre', 999, 0))]
2016-10-10 18:01:13.602 6966 ERROR neutron




Version-Release number of selected component (if applicable):
openstack-neutron-8.1.2-5.el7ost.noarch 


How reproducible:
unknown

Steps to Reproduce:
1.upgrade from osp8 to osp9
2.notice neutron server fails to start on all 3 controllers.
3.

Actual results:
neutron fails to start

Expected results:
neutron and other neutron services start

Additional info:
From pcs status:

 Clone Set: neutron-server-clone [neutron-server]
     Started: [ overcloud-controller-0 ]
     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]

Comment 2 Assaf Muller 2016-10-11 14:11:55 UTC

(In reply to Jeremy from comment #0)

[...]

> 1.upgrade from osp8 to osp9
> 2.notice neutron server fails to start on all 3 controllers.

[...]

> 
>  Clone Set: neutron-server-clone [neutron-server]
>      Started: [ overcloud-controller-0 ]
>      Stopped: [ overcloud-controller-1 overcloud-controller-2 ]

Looking at pcs output and neutron server log it seems like only neutron-server on controller 1 and 2 failed to start but the neutron-server on controller0 is up and dandy. Just making sure we have the same understanding of the situation.

Comment 3 Jeremy 2016-10-11 14:15:31 UTC

correct

Comment 4 Assaf Muller 2016-10-11 14:23:33 UTC

Can you please show output of:

mysql
use neutron;
select * from ml2_vlan_allocations;

Comment 5 Jeremy 2016-10-11 22:16:50 UTC

MariaDB [neutron]> select * from ml2_vlan_allocations;
+------------------+---------+-----------+
| physical_network | vlan_id | allocated |
+------------------+---------+-----------+
| physnet1         |    1000 |         0 |
| physnet1         |    1001 |         0 |
| physnet1         |    1002 |         0 |
| physnet1         |    1003 |         0 |
.........................................
| physnet1         |    2995 |         0 |
| physnet1         |    2996 |         0 |
| physnet1         |    2997 |         0 |
| physnet1         |    2998 |         0 |
| physnet1         |    2999 |         0 |
+------------------+---------+-----------+

Comment 6 Jeremy 2016-10-12 18:53:44 UTC

Created attachment 1209709 [details]
ovs_neutorn output from overcloud controller

Comment 8 Bernard Cafarelli 2016-10-13 14:23:05 UTC

OK, so the output matches what the other nodes try to add (physical_network="datacentre", vlan_id from 1 to 1000).
Looking at the server.log timestamps on the 3 nodes, all 3 were started at the same time, this looks like a race condition in the VlanTypeDriver initialization, in _sync_vlan_allocations() function, with all nodes trying to insert the same dataset at the same time.

As a temporary workaround, restarting the openstack-server service on the failed nodes should work (the code only adds the missing allocatable vlans to the table, so it should do nothing as controller 0 has started and done the job).

I will be looking for the root cause in the meantime

Comment 9 Bernard Cafarelli 2016-10-13 16:13:32 UTC

This definitely looks like upstream https://bugs.launchpad.net/neutron/+bug/1421626 though the bug mentions it should work with a Galera setup (triggering deadlocks instead of the DBDuplicateEntry items we have here)

Comment 10 Jeremy 2016-10-13 18:22:07 UTC

Thanks for the workaround. I was able to do pcs resource unmanage neutron-server; systemctl restart neutron-server on ctl1, ctl2. Then pcs resource manage neutron-server; pcs resource cleanup neutron-server to get everything working correctly again.

Comment 11 Bernard Cafarelli 2016-10-14 09:07:07 UTC

Nice, so we have a workaround in the meantime!

The servers starting up correctly confirm this is a race condition, most probably the linked upstream issue.

Comment 14 Brian Haley 2018-05-14 15:10:50 UTC

Customer case closed, closing.