Bug 1233061

Summary: rhel-osp-director: HA deployment - neutron services fails to start on one controller .
Product: Red Hat OpenStack Reporter: Ofer Blaut <oblaut>
Component: openstack-tripleo-heat-templatesAssignee: Giulio Fidente <gfidente>
Status: CLOSED ERRATA QA Contact: Ofer Blaut <oblaut>
Severity: high Docs Contact:
Priority: urgent    
Version: DirectorCC: calfonso, fdinitto, gfidente, kbasil, majopela, mandreou, mburns, mcornea, oblaut, ohochman, rbiba, rhel-osp-director-maint, yeylon
Target Milestone: betaKeywords: Regression
Target Release: Director   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-0.8.6-15.el7ost Doc Type: Bug Fix
Doc Text:
Previously, a race condition occurred during the initialization of the neutron database when neutron-server was first run. This error was seen when two controllers happened to start neutron-server simultaneously. Subsequently, the startup of neutron-server and agents failed on the controller node that lost the race, and as a consequence, Neutron services failed to start on the affected controller nodes. Errors in the logs look like the following: DBDuplicateEntry: (IntegrityError) (1062, "Duplicate entry 'datacentre-1' for key 'PRIMARY'") 'INSERT INTO ml2_vlan_allocations (physical_network, vlan_id, allocated) VALUES (%s, %s, %s)' (('datacentre', 1, 0), With this release, the Neutron server is momentarily started and then stopped on one node, the pacemaker master, allowing this initial database setup to happen, before allowing the rest of the puppet or pacemaker configuration to happen. As a result, Neutron services are brought up on all controllers nodes without error.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-08-05 13:54:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
controller logs none

Description Ofer Blaut 2015-06-18 07:21:56 UTC
Created attachment 1040290 [details]
controller logs

Description of problem:

neutron services failed to start on one controller 

[root@overcloud-controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Last updated: Thu Jun 18 02:56:09 2015
Last change: Wed Jun 17 12:57:19 2015
Stack: corosync
Current DC: overcloud-controller-0 (1) - partition with quorum
Version: 1.1.12-a14efad
3 Nodes configured
109 Resources configured


Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Full list of resources:

 ip-192.0.2.12	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-0 
 ip-192.0.2.14	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-1 
 Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Master/Slave Set: galera-master [galera]
     Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 ip-192.0.2.13	(ocf::heartbeat:IPaddr2):	Started overcloud-controller-2 
 Master/Slave Set: redis-master [redis]
     Masters: [ overcloud-controller-0 ]
     Slaves: [ overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: mongod-clone [mongod]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: memcached-clone [memcached]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-l3-agent-clone [neutron-l3-agent]
     Started: [ overcloud-controller-1 overcloud-controller-2 ]
     Stopped: [ overcloud-controller-0 ]
 Clone Set: openstack-ceilometer-alarm-notifier-clone [openstack-ceilometer-alarm-notifier]
     Started: [ overcloud-controller-0 ]
     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-engine-clone [openstack-heat-engine]
     Started: [ overcloud-controller-0 ]
     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-api-clone [openstack-ceilometer-api]
     Started: [ overcloud-controller-0 ]
     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent]
     Started: [ overcloud-controller-1 overcloud-controller-2 ]
     Stopped: [ overcloud-controller-0 ]
 Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup]
     Started: [ overcloud-controller-1 overcloud-controller-2 ]
     Stopped: [ overcloud-controller-0 ]
 Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup]
     Started: [ overcloud-controller-1 overcloud-controller-2 ]
     Stopped: [ overcloud-controller-0 ]
 Clone Set: openstack-heat-api-clone [openstack-heat-api]
     Started: [ overcloud-controller-0 ]
     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-api-clone [openstack-nova-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch]
     Started: [ overcloud-controller-0 ]
     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-collector-clone [openstack-ceilometer-collector]
     Started: [ overcloud-controller-0 ]
     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-keystone-clone [openstack-keystone]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-glance-registry-clone [openstack-glance-registry]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-notification-clone [openstack-ceilometer-notification]
     Started: [ overcloud-controller-0 ]
     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-cinder-api-clone [openstack-cinder-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent]
     Started: [ overcloud-controller-1 overcloud-controller-2 ]
     Stopped: [ overcloud-controller-0 ]
 Clone Set: openstack-glance-api-clone [openstack-glance-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-scale-clone [neutron-scale] (unique)
     neutron-scale:0	(ocf::neutron:NeutronScale):	Stopped 
     neutron-scale:1	(ocf::neutron:NeutronScale):	Started overcloud-controller-1 
     neutron-scale:2	(ocf::neutron:NeutronScale):	Started overcloud-controller-2 
 Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent]
     neutron-openvswitch-agent	(systemd:neutron-openvswitch-agent):	FAILED overcloud-controller-0 (unmanaged) 
     Started: [ overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: delay-clone [delay]
     Started: [ overcloud-controller-0 ]
     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-server-clone [neutron-server]
     Started: [ overcloud-controller-1 overcloud-controller-2 ]
     Stopped: [ overcloud-controller-0 ]
 Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central]
     Started: [ overcloud-controller-0 ]
     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-alarm-evaluator-clone [openstack-ceilometer-alarm-evaluator]
     Started: [ overcloud-controller-0 ]
     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
     Started: [ overcloud-controller-0 ]
     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
 openstack-cinder-volume	(systemd:openstack-cinder-volume):	Started overcloud-controller-0 
 Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Failed actions:
    neutron-openvswitch-agent_stop_0 on overcloud-controller-0 'OCF_TIMEOUT' (198): call=253, status=Timed Out, exit-reason='none', last-rc-change='Wed Jun 17 12:56:14 2015', queued=12ms, exec=2ms
    neutron-openvswitch-agent_stop_0 on overcloud-controller-0 'OCF_TIMEOUT' (198): call=253, status=Timed Out, exit-reason='none', last-rc-change='Wed Jun 17 12:56:14 2015', queued=12ms, exec=2ms
    openstack-nova-novncproxy_start_0 on overcloud-controller-0 'not running' (7): call=259, status=complete, exit-reason='none', last-rc-change='Wed Jun 17 12:55:58 2015', queued=2001ms, exec=2ms
    neutron-server_start_0 on overcloud-controller-0 'not running' (7): call=257, status=complete, exit-reason='none', last-rc-change='Wed Jun 17 12:55:58 2015', queued=2001ms, exec=3ms
    openstack-ceilometer-api_monitor_60000 on overcloud-controller-2 'not running' (7): call=137, status=complete, exit-reason='none', last-rc-change='Wed Jun 17 12:57:25 2015', queued=0ms, exec=0ms
    openstack-ceilometer-notification_start_0 on overcloud-controller-2 'not running' (7): call=216, status=complete, exit-reason='none', last-rc-change='Wed Jun 17 12:55:36 2015', queued=2001ms, exec=4ms
    neutron-openvswitch-agent_monitor_60000 on overcloud-controller-2 'not running' (7): call=232, status=complete, exit-reason='none', last-rc-change='Wed Jun 17 12:56:43 2015', queued=0ms, exec=0ms
    openstack-nova-novncproxy_start_0 on overcloud-controller-2 'not running' (7): call=255, status=complete, exit-reason='none', last-rc-change='Wed Jun 17 12:55:58 2015', queued=2001ms, exec=2ms
    openstack-ceilometer-central_start_0 on overcloud-controller-2 'not running' (7): call=254, status=complete, exit-reason='none', last-rc-change='Wed Jun 17 12:55:56 2015', queued=2000ms, exec=5ms
    openstack-ceilometer-notification_start_0 on overcloud-controller-1 'not running' (7): call=217, status=complete, exit-reason='none', last-rc-change='Wed Jun 17 12:55:36 2015', queued=2001ms, exec=3ms
    openstack-nova-novncproxy_start_0 on overcloud-controller-1 'not running' (7): call=256, status=complete, exit-reason='none', last-rc-change='Wed Jun 17 12:55:58 2015', queued=2000ms, exec=3ms
    openstack-ceilometer-central_start_0 on overcloud-controller-1 'not running' (7): call=255, status=complete, exit-reason='none', last-rc-change='Wed Jun 17 12:55:56 2015', queued=2001ms, exec=4ms


PCSD Status:
  overcloud-controller-0: Online
  overcloud-controller-1: Online
  overcloud-controller-2: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Fabio Massimo Di Nitto 2015-06-22 15:28:55 UTC
Please attach sosreports and crm_report from all nodes.

Please be aware that due to a bug in sosreport, /etc/neutron /var/lib/neutron and /var/log/neutron have to be collected manually.

Comment 3 Omri Hochman 2015-06-22 21:54:57 UTC
couldn't reproduce with the latest puddle from: 2015-06-22 

instack-undercloud-2.1.2-1.el7ost.noarch
instack-0.0.7-1.el7ost.noarch
openstack-heat-engine-2015.1.0-3.el7ost.noarch
openstack-heat-api-cfn-2015.1.0-3.el7ost.noarch
openstack-heat-api-2015.1.0-3.el7ost.noarch
heat-cfntools-1.2.8-2.el7.noarch
openstack-heat-templates-0-0.6.20150605git.el7ost.noarch
openstack-heat-common-2015.1.0-3.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-13.el7ost.noarch
python-heatclient-0.6.0-1.el7ost.noarch
openstack-heat-api-cloudwatch-2015.1.0-3.el7ost.noarch

Comment 4 Ofer Blaut 2015-06-23 06:09:19 UTC
*** Bug 1234631 has been marked as a duplicate of this bug. ***

Comment 5 Giulio Fidente 2015-06-23 10:56:59 UTC
As per conversation with Fabio on IRC, we might need to add a start/sleep/stop sequence for neutron-server service to be executed only from one node, before the normal pacemaker initialization.

Comment 6 Omri Hochman 2015-06-23 13:09:15 UTC
Reproduce with the puddle :RHEL-OSP director puddle 7.0 RC - 2015-06-22.1


pcs status:
------------
Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent]
     neutron-openvswitch-agent  (systemd:neutron-openvswitch-agent):    FAILED overcloud-controller-0 (unmanaged) 
     Started: [ overcloud-controller-2 ]
     Stopped: [ overcloud-controller-1 ]



Neutron server.log : 
---------------------
2015-06-22 18:40:40.502 48360 TRACE oslo_messaging.rpc.dispatcher AgentNotFoundByTypeHost: Agent with agent_type=L3 agent and host=overcloud-controller-0.localdomain could not be found
2015-06-22 18:40:40.502 48360 TRACE oslo_messaging.rpc.dispatcher 
2015-06-22 18:40:40.525 48360 ERROR oslo_messaging._drivers.common [req-80b2ebe8-b6e3-46c5-a63f-6e863ef7bf35 ] Returning exception Agent with agent_type=L3 agent and host=overcloud-controller-0.localdomain could not be found to caller
2015-06-22 18:40:40.525 48360 ERROR oslo_messaging._drivers.common [req-80b2ebe8-b6e3-46c5-a63f-6e863ef7bf35 ] ['Traceback (most recent call last):\n', '  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply\n    executor_callback))\n', '  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch\n    executor_callback)\n', '  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 130, in _do_dispatch\n    result = func(ctxt, **new_args)\n', '  File "/usr/lib/python2.7/site-packages/neutron/api/rpc/handlers/l3_rpc.py", line 81, in sync_routers\n    context, host, router_ids))\n', '  File "/usr/lib/python2.7/site-packages/neutron/db/l3_agentschedulers_db.py", line 290, in list_active_sync_routers_on_active_l3_agent\n    context, constants.AGENT_TYPE_L3, host)\n', '  File "/usr/lib/python2.7/site-packages/neutron/db/agents_db.py", line 197, in _get_agent_by_type_and_host\n    host=host)\n', 'AgentNotFoundByTypeHost: Agent with agent_type=L3 agent and host=overcloud-controller-0.localdomain could not be found\n']

Comment 7 Omri Hochman 2015-06-23 13:10:42 UTC
From some reason this issue reproduced to me only on Bare-Metal-Env.

Comment 8 Marios Andreou 2015-06-23 16:44:08 UTC
so v3 of the review @ [1] should fix this. In the absence of a solid repro we can't be sure (I have yet to his this in a virt env, and don't have bm). 

With v3 applied I was able to at least get the overcloud deployed and neutron-* on the controllers (pcs status ok). 

The fixup basically starts neutron-server, sleeps 5 then stops it. Then lets the rest of the normal neutron-* startup happen.


[1] https://review.openstack.org/#/c/194610/ Add special handling of neutron-server service startup to fix race

Comment 9 Marios Andreou 2015-06-23 16:46:53 UTC
upstream bug report @ https://bugs.launchpad.net/tripleo/+bug/1467904

Comment 11 Ofer Blaut 2015-06-25 11:06:10 UTC
Bug is ON _QA , i didn't had working setup to attach logs , lets see if issue reproduced again

Comment 12 Ofer Blaut 2015-06-25 11:37:57 UTC
Not reproduced on latest build

[stack@puma42 ~]$ rpm -qa | grep triple
openstack-tripleo-common-0.0.1.dev6-0.git49b57eb.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-15.el7ost.noarch
openstack-tripleo-puppet-elements-0.0.1-2.el7ost.noarch
openstack-tripleo-0.0.7-0.1.1664e566.el7ost.noarch
openstack-tripleo-image-elements-0.9.6-1.el7ost.noarch

Comment 14 errata-xmlrpc 2015-08-05 13:54:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2015:1549