Bug 1233061
Summary: | rhel-osp-director: HA deployment - neutron services fails to start on one controller . | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Ofer Blaut <oblaut> | ||||
Component: | openstack-tripleo-heat-templates | Assignee: | Giulio Fidente <gfidente> | ||||
Status: | CLOSED ERRATA | QA Contact: | Ofer Blaut <oblaut> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | Director | CC: | calfonso, fdinitto, gfidente, kbasil, majopela, mandreou, mburns, mcornea, oblaut, ohochman, rbiba, rhel-osp-director-maint, yeylon | ||||
Target Milestone: | beta | Keywords: | Regression | ||||
Target Release: | Director | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | openstack-tripleo-heat-templates-0.8.6-15.el7ost | Doc Type: | Bug Fix | ||||
Doc Text: |
Previously, a race condition occurred during the initialization of the neutron database when neutron-server was first run. This error was seen when two controllers happened to start neutron-server simultaneously. Subsequently, the startup of neutron-server and agents failed on the controller node that lost the race, and as a consequence, Neutron services failed to start on the affected controller nodes. Errors in the logs look like the following:
DBDuplicateEntry: (IntegrityError) (1062, "Duplicate entry 'datacentre-1' for key 'PRIMARY'") 'INSERT INTO ml2_vlan_allocations (physical_network, vlan_id, allocated) VALUES (%s, %s, %s)' (('datacentre', 1, 0),
With this release, the Neutron server is momentarily started and then stopped on one node, the pacemaker master, allowing this initial database setup to happen, before allowing the rest of the puppet or pacemaker configuration to happen. As a result, Neutron services are brought up on all controllers nodes without error.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2015-08-05 13:54:32 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Please attach sosreports and crm_report from all nodes. Please be aware that due to a bug in sosreport, /etc/neutron /var/lib/neutron and /var/log/neutron have to be collected manually. couldn't reproduce with the latest puddle from: 2015-06-22 instack-undercloud-2.1.2-1.el7ost.noarch instack-0.0.7-1.el7ost.noarch openstack-heat-engine-2015.1.0-3.el7ost.noarch openstack-heat-api-cfn-2015.1.0-3.el7ost.noarch openstack-heat-api-2015.1.0-3.el7ost.noarch heat-cfntools-1.2.8-2.el7.noarch openstack-heat-templates-0-0.6.20150605git.el7ost.noarch openstack-heat-common-2015.1.0-3.el7ost.noarch openstack-tripleo-heat-templates-0.8.6-13.el7ost.noarch python-heatclient-0.6.0-1.el7ost.noarch openstack-heat-api-cloudwatch-2015.1.0-3.el7ost.noarch *** Bug 1234631 has been marked as a duplicate of this bug. *** As per conversation with Fabio on IRC, we might need to add a start/sleep/stop sequence for neutron-server service to be executed only from one node, before the normal pacemaker initialization. Reproduce with the puddle :RHEL-OSP director puddle 7.0 RC - 2015-06-22.1 pcs status: ------------ Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent] neutron-openvswitch-agent (systemd:neutron-openvswitch-agent): FAILED overcloud-controller-0 (unmanaged) Started: [ overcloud-controller-2 ] Stopped: [ overcloud-controller-1 ] Neutron server.log : --------------------- 2015-06-22 18:40:40.502 48360 TRACE oslo_messaging.rpc.dispatcher AgentNotFoundByTypeHost: Agent with agent_type=L3 agent and host=overcloud-controller-0.localdomain could not be found 2015-06-22 18:40:40.502 48360 TRACE oslo_messaging.rpc.dispatcher 2015-06-22 18:40:40.525 48360 ERROR oslo_messaging._drivers.common [req-80b2ebe8-b6e3-46c5-a63f-6e863ef7bf35 ] Returning exception Agent with agent_type=L3 agent and host=overcloud-controller-0.localdomain could not be found to caller 2015-06-22 18:40:40.525 48360 ERROR oslo_messaging._drivers.common [req-80b2ebe8-b6e3-46c5-a63f-6e863ef7bf35 ] ['Traceback (most recent call last):\n', ' File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply\n executor_callback))\n', ' File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch\n executor_callback)\n', ' File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 130, in _do_dispatch\n result = func(ctxt, **new_args)\n', ' File "/usr/lib/python2.7/site-packages/neutron/api/rpc/handlers/l3_rpc.py", line 81, in sync_routers\n context, host, router_ids))\n', ' File "/usr/lib/python2.7/site-packages/neutron/db/l3_agentschedulers_db.py", line 290, in list_active_sync_routers_on_active_l3_agent\n context, constants.AGENT_TYPE_L3, host)\n', ' File "/usr/lib/python2.7/site-packages/neutron/db/agents_db.py", line 197, in _get_agent_by_type_and_host\n host=host)\n', 'AgentNotFoundByTypeHost: Agent with agent_type=L3 agent and host=overcloud-controller-0.localdomain could not be found\n'] From some reason this issue reproduced to me only on Bare-Metal-Env. so v3 of the review @ [1] should fix this. In the absence of a solid repro we can't be sure (I have yet to his this in a virt env, and don't have bm). With v3 applied I was able to at least get the overcloud deployed and neutron-* on the controllers (pcs status ok). The fixup basically starts neutron-server, sleeps 5 then stops it. Then lets the rest of the normal neutron-* startup happen. [1] https://review.openstack.org/#/c/194610/ Add special handling of neutron-server service startup to fix race upstream bug report @ https://bugs.launchpad.net/tripleo/+bug/1467904 Bug is ON _QA , i didn't had working setup to attach logs , lets see if issue reproduced again Not reproduced on latest build [stack@puma42 ~]$ rpm -qa | grep triple openstack-tripleo-common-0.0.1.dev6-0.git49b57eb.el7ost.noarch openstack-tripleo-heat-templates-0.8.6-15.el7ost.noarch openstack-tripleo-puppet-elements-0.0.1-2.el7ost.noarch openstack-tripleo-0.0.7-0.1.1664e566.el7ost.noarch openstack-tripleo-image-elements-0.9.6-1.el7ost.noarch Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2015:1549 |
Created attachment 1040290 [details] controller logs Description of problem: neutron services failed to start on one controller [root@overcloud-controller-0 ~]# pcs status Cluster name: tripleo_cluster Last updated: Thu Jun 18 02:56:09 2015 Last change: Wed Jun 17 12:57:19 2015 Stack: corosync Current DC: overcloud-controller-0 (1) - partition with quorum Version: 1.1.12-a14efad 3 Nodes configured 109 Resources configured Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Full list of resources: ip-192.0.2.12 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-192.0.2.14 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1 Clone Set: haproxy-clone [haproxy] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Master/Slave Set: galera-master [galera] Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] ip-192.0.2.13 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 Master/Slave Set: redis-master [redis] Masters: [ overcloud-controller-0 ] Slaves: [ overcloud-controller-1 overcloud-controller-2 ] Clone Set: mongod-clone [mongod] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: rabbitmq-clone [rabbitmq] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: memcached-clone [memcached] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-l3-agent-clone [neutron-l3-agent] Started: [ overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-controller-0 ] Clone Set: openstack-ceilometer-alarm-notifier-clone [openstack-ceilometer-alarm-notifier] Started: [ overcloud-controller-0 ] Stopped: [ overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-heat-engine-clone [openstack-heat-engine] Started: [ overcloud-controller-0 ] Stopped: [ overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-api-clone [openstack-ceilometer-api] Started: [ overcloud-controller-0 ] Stopped: [ overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent] Started: [ overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-controller-0 ] Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup] Started: [ overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-controller-0 ] Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup] Started: [ overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-controller-0 ] Clone Set: openstack-heat-api-clone [openstack-heat-api] Started: [ overcloud-controller-0 ] Stopped: [ overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-api-clone [openstack-nova-api] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch] Started: [ overcloud-controller-0 ] Stopped: [ overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-collector-clone [openstack-ceilometer-collector] Started: [ overcloud-controller-0 ] Stopped: [ overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-keystone-clone [openstack-keystone] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-glance-registry-clone [openstack-glance-registry] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-notification-clone [openstack-ceilometer-notification] Started: [ overcloud-controller-0 ] Stopped: [ overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-cinder-api-clone [openstack-cinder-api] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent] Started: [ overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-controller-0 ] Clone Set: openstack-glance-api-clone [openstack-glance-api] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-scale-clone [neutron-scale] (unique) neutron-scale:0 (ocf::neutron:NeutronScale): Stopped neutron-scale:1 (ocf::neutron:NeutronScale): Started overcloud-controller-1 neutron-scale:2 (ocf::neutron:NeutronScale): Started overcloud-controller-2 Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent] neutron-openvswitch-agent (systemd:neutron-openvswitch-agent): FAILED overcloud-controller-0 (unmanaged) Started: [ overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy] Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: delay-clone [delay] Started: [ overcloud-controller-0 ] Stopped: [ overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-server-clone [neutron-server] Started: [ overcloud-controller-1 overcloud-controller-2 ] Stopped: [ overcloud-controller-0 ] Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central] Started: [ overcloud-controller-0 ] Stopped: [ overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-alarm-evaluator-clone [openstack-ceilometer-alarm-evaluator] Started: [ overcloud-controller-0 ] Stopped: [ overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn] Started: [ overcloud-controller-0 ] Stopped: [ overcloud-controller-1 overcloud-controller-2 ] openstack-cinder-volume (systemd:openstack-cinder-volume): Started overcloud-controller-0 Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Failed actions: neutron-openvswitch-agent_stop_0 on overcloud-controller-0 'OCF_TIMEOUT' (198): call=253, status=Timed Out, exit-reason='none', last-rc-change='Wed Jun 17 12:56:14 2015', queued=12ms, exec=2ms neutron-openvswitch-agent_stop_0 on overcloud-controller-0 'OCF_TIMEOUT' (198): call=253, status=Timed Out, exit-reason='none', last-rc-change='Wed Jun 17 12:56:14 2015', queued=12ms, exec=2ms openstack-nova-novncproxy_start_0 on overcloud-controller-0 'not running' (7): call=259, status=complete, exit-reason='none', last-rc-change='Wed Jun 17 12:55:58 2015', queued=2001ms, exec=2ms neutron-server_start_0 on overcloud-controller-0 'not running' (7): call=257, status=complete, exit-reason='none', last-rc-change='Wed Jun 17 12:55:58 2015', queued=2001ms, exec=3ms openstack-ceilometer-api_monitor_60000 on overcloud-controller-2 'not running' (7): call=137, status=complete, exit-reason='none', last-rc-change='Wed Jun 17 12:57:25 2015', queued=0ms, exec=0ms openstack-ceilometer-notification_start_0 on overcloud-controller-2 'not running' (7): call=216, status=complete, exit-reason='none', last-rc-change='Wed Jun 17 12:55:36 2015', queued=2001ms, exec=4ms neutron-openvswitch-agent_monitor_60000 on overcloud-controller-2 'not running' (7): call=232, status=complete, exit-reason='none', last-rc-change='Wed Jun 17 12:56:43 2015', queued=0ms, exec=0ms openstack-nova-novncproxy_start_0 on overcloud-controller-2 'not running' (7): call=255, status=complete, exit-reason='none', last-rc-change='Wed Jun 17 12:55:58 2015', queued=2001ms, exec=2ms openstack-ceilometer-central_start_0 on overcloud-controller-2 'not running' (7): call=254, status=complete, exit-reason='none', last-rc-change='Wed Jun 17 12:55:56 2015', queued=2000ms, exec=5ms openstack-ceilometer-notification_start_0 on overcloud-controller-1 'not running' (7): call=217, status=complete, exit-reason='none', last-rc-change='Wed Jun 17 12:55:36 2015', queued=2001ms, exec=3ms openstack-nova-novncproxy_start_0 on overcloud-controller-1 'not running' (7): call=256, status=complete, exit-reason='none', last-rc-change='Wed Jun 17 12:55:58 2015', queued=2000ms, exec=3ms openstack-ceilometer-central_start_0 on overcloud-controller-1 'not running' (7): call=255, status=complete, exit-reason='none', last-rc-change='Wed Jun 17 12:55:56 2015', queued=2001ms, exec=4ms PCSD Status: overcloud-controller-0: Online overcloud-controller-1: Online overcloud-controller-2: Online Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: