Bug 1236578 - HA deploys on virt create stack but error: "ERROR: openstack Not enough l3 agents available to ensure HA."
Summary: HA deploys on virt create stack but error: "ERROR: openstack Not enough l3 ag...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: ga
: Director
Assignee: Marios Andreou
QA Contact: Leonid Natapov
URL:
Whiteboard:
Depends On: 1238117
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-06-29 13:10 UTC by Ronelle Landy
Modified: 2015-08-05 13:57 UTC (History)
12 users (show)

Fixed In Version: openstack-tripleo-heat-templates-0.8.6-40.el7ost python-rdomanager-oscplugin-0.0.8-37.el7ost
Doc Type: Bug Fix
Doc Text:
The NeutronScale resource renamed neutron agents on Controller nodes. This caused an inconsistency with the "neutron agent-list" and as result Neutron reported errors of not having enough L3 agents for L3 HA. This fix removes the NeutronScale resource from Overcloud Heat templates and plans. NeutronScale does not appear in "neutron agent-list" and Neutron reports no errors.
Clone Of:
Environment:
Last Closed: 2015-08-05 13:57:35 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gerrithub.io 238087 0 None None None Never
Gerrithub.io 238097 0 None None None Never
Gerrithub.io 238320 0 None None None Never
Gerrithub.io 238893 0 None None None Never
OpenStack gerrit 198016 0 None None None Never
OpenStack gerrit 199102 0 None None None Never
Red Hat Product Errata RHEA-2015:1549 0 normal SHIPPED_LIVE Red Hat Enterprise Linux OpenStack Platform director Release 2015-08-05 17:49:10 UTC

Description Ronelle Landy 2015-06-29 13:10:21 UTC
Description of problem:

Deploying the overcloud with HA fails with "ERROR: openstack Not enough l3 agents available to ensure HA.". This error causes CI to fail but afaict, the stack is successfully deployed:

*** Adding notes from email thread:

* (from morazi):
>> I wanted to add a few bits details that may not be totally obvious from
>> the links above:
>>
>> 1.  This is an HA job
>> 2.  We have seen this HA job pass on the same puddle
>> 3.  This is the virt version of the job.
>> 4.  the heat stack did not actually seem to fail at any point in time.
>> 5.  When checking the deployment after the fact l3 agent was running on
>> all 3 controllers despite the error message.

* (from marios):
I hit this on Friday on a fresh virt setup and +1 to all of these things
plus the fact that pcs status on a controller was reporting fine
--control-scale 3 --ceph-storage-scale 1). Also as Mike says neutron
also seemed to be working ok, like [stack@instack ~]$ neutron agent-list
| grep neutron-l3 gave me the list of three agents, one on each of the
controllers.

* (from rlandy)
> This error "ERROR: openstack Not enough l3 agents available to ensure HA. Minimum required 2, available 0." is fairly consistent in HA jobs. It looks similar to: https://bugs.launchpad.net/neutron/+bug/1420117.
> I tried manually applying this change, restarting neutron services and redeploying the overcloud. Similar story though - the overcloud did deploy successfully although this error was visible at the *end* of the output:
>
> /home/stack/.ssh/known_hosts updated.
> Original contents retained as /home/stack/.ssh/known_hosts.old
> PKI initialization in init-keystone is deprecated and will be removed.
> Warning: Permanently added '192.0.2.14' (ECDSA) to the list of known hosts.
> The following cert files already exist, use --rebuild to remove the existing files before regenerating:
> /etc/keystone/ssl/certs/ca.pem already exists
> /etc/keystone/ssl/private/signing_key.pem already exists
> /etc/keystone/ssl/certs/signing_cert.pem already exists
> Connection to 192.0.2.14 closed.
> ERROR: openstack Not enough l3 agents available to ensure HA. Minimum required 2, available 0.
>
> Is this error something that can be ignored? If not, it's blocking HA.

not sure yet. Trouble is it is another intermittent one (so we don't
have any solid repro at this point right?), that seems to be virt
specific. On my second run on the same box (just overcloud deploy, so
using same images, templates roles etc) with the exact same params (3
control, 1 ceph 1 compute) it deployed and postconfig without fuss.

It isn't an issue with Heat (our params, config, initialisation etc,
afaics) since the stack CREATE_COMPLETE. In our code we are hitting it
when we initialise neutron, towards the end of def _deploy_postconfig
[1] so really it is happening inside neutron client/server. I wonder if
this is another case of needing to sleep for a second since something is
happening too fast in a virt env (like we do for vm
discovery/introspection for example). 


Version-Release number of selected component (if applicable):

$ rpm -qa | grep openstack
openstack-ceilometer-alarm-2015.1.0-6.el7ost.noarch
openstack-keystone-2015.1.0-1.el7ost.noarch
openstack-tripleo-puppet-elements-0.0.1-2.el7ost.noarch
openstack-utils-2014.2-1.el7ost.noarch
openstack-heat-api-cloudwatch-2015.1.0-4.el7ost.noarch
openstack-tuskar-0.4.18-3.el7ost.noarch
openstack-swift-2.3.0-1.el7ost.noarch
openstack-nova-novncproxy-2015.1.0-13.el7ost.noarch
openstack-swift-object-2.3.0-1.el7ost.noarch
redhat-access-plugin-openstack-7.0.0-0.el7ost.noarch
openstack-heat-api-2015.1.0-4.el7ost.noarch
openstack-ceilometer-central-2015.1.0-6.el7ost.noarch
openstack-tripleo-common-0.0.1.dev6-0.git49b57eb.el7ost.noarch
openstack-neutron-openvswitch-2015.1.0-10.el7ost.noarch
openstack-nova-api-2015.1.0-13.el7ost.noarch
openstack-nova-common-2015.1.0-13.el7ost.noarch
openstack-tripleo-image-elements-0.9.6-4.el7ost.noarch
openstack-ceilometer-notification-2015.1.0-6.el7ost.noarch
openstack-ceilometer-collector-2015.1.0-6.el7ost.noarch
openstack-ironic-common-2015.1.0-7.el7ost.noarch
openstack-nova-compute-2015.1.0-13.el7ost.noarch
openstack-nova-conductor-2015.1.0-13.el7ost.noarch
openstack-swift-account-2.3.0-1.el7ost.noarch
openstack-swift-proxy-2.3.0-1.el7ost.noarch
openstack-dashboard-theme-2015.1.0-10.el7ost.noarch
openstack-tuskar-ui-extras-0.0.4-1.el7ost.noarch
openstack-nova-console-2015.1.0-13.el7ost.noarch
openstack-neutron-common-2015.1.0-10.el7ost.noarch
openstack-neutron-2015.1.0-10.el7ost.noarch
openstack-heat-engine-2015.1.0-4.el7ost.noarch
openstack-ceilometer-common-2015.1.0-6.el7ost.noarch
openstack-heat-api-cfn-2015.1.0-4.el7ost.noarch
openstack-ironic-conductor-2015.1.0-7.el7ost.noarch
openstack-selinux-0.6.35-1.el7ost.noarch
openstack-ceilometer-api-2015.1.0-6.el7ost.noarch
openstack-ironic-api-2015.1.0-7.el7ost.noarch
openstack-swift-plugin-swift3-1.7-3.el7ost.noarch
openstack-puppet-modules-2015.1.7-5.el7ost.noarch
openstack-dashboard-2015.1.0-10.el7ost.noarch
openstack-heat-templates-0-0.6.20150605git.el7ost.noarch
openstack-tempest-kilo-20150507.2.el7ost.noarch
openstack-neutron-ml2-2015.1.0-10.el7ost.noarch
openstack-nova-scheduler-2015.1.0-13.el7ost.noarch
openstack-nova-cert-2015.1.0-13.el7ost.noarch
python-django-openstack-auth-1.2.0-3.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-19.el7ost.noarch
openstack-glance-2015.1.0-6.el7ost.noarch
python-openstackclient-1.0.3-2.el7ost.noarch
openstack-ironic-discoverd-1.1.0-4.el7ost.noarch
openstack-swift-container-2.3.0-1.el7ost.noarch
openstack-tuskar-ui-0.3.0-6.el7ost.noarch
openstack-heat-common-2015.1.0-4.el7ost.noarch
openstack-tripleo-0.0.7-0.1.1664e566.el7ost.noarch

How reproducible:
Fairly common  - although not completely consistent. Shows up almost always in CI - which might strengthen the "sleep" argument.

Steps to Reproduce:
1. Install openstack bits from latest poodle/puddle on virt env with enough VMs for HA
2. Deploy overcloud with HA (three controllers)
3. See error fro, deploy command

Actual results:

"ERROR: openstack Not enough l3 agents available to ensure HA."
but the stack is in CREATE_COMPLETE

Expected results:

no errors on deploy

Additional info:

Comment 3 Marios Andreou 2015-06-29 13:25:57 UTC
rlandy thanks for filing the bug i poked a bit more today but basically:

the fail is happening exactly at [1] and passed back though os-cloud-config which actually invokes the neutron client [2]. I hit this again this morning and I verified that neutron reports all l3 agents ok (source overcloudrc; neutron agent-list | grep -ni neutron-l3, and pcs status etc on the controllers is fine, like [3]

I am wondering if for now we can band aid it; catch this error, and try to call initialize_neutron again. 



[1] https://github.com/openstack/neutron/blob/master/neutron/db/l3_hamode_db.py#L278
[2] https://github.com/openstack/os-cloud-config/blob/master/os_cloud_config/neutron.py#L23
[3] 
[root@overcloud-controller-1 ~]# pcs status | grep -A 3 -B 1 -i neutron
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-l3-agent-clone [neutron-l3-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-alarm-notifier-clone [openstack-ceilometer-alarm-notifier]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
--
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-clone [openstack-heat-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
--
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-glance-api-clone [openstack-glance-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-scale-clone [neutron-scale] (unique)
     neutron-scale:0	(ocf::neutron:NeutronScale):	Started overcloud-controller-0 
     neutron-scale:1	(ocf::neutron:NeutronScale):	Started overcloud-controller-1 
     neutron-scale:2	(ocf::neutron:NeutronScale):	Started overcloud-controller-2 
 Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
--
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-server-clone [neutron-server]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
[root@overcloud-controller-1 ~]# service neutron-l3-agent status
Redirecting to /bin/systemctl status  neutron-l3-agent.service
neutron-l3-agent.service - Cluster Controlled neutron-l3-agent
   Loaded: loaded (/usr/lib/systemd/system/neutron-l3-agent.service; disabled)
  Drop-In: /run/systemd/system/neutron-l3-agent.service.d
           └─50-pacemaker.conf
   Active: active (running) since Mon 2015-06-29 04:37:44 EDT; 15min ago
 Main PID: 27293 (neutron-l3-agen)
   CGroup: /system.slice/neutron-l3-agent.service
           └─27293 /usr/bin/python2 /usr/bin/neutron-l3-agent --config-file /usr/share/neutro...

Jun 29 04:37:44 overcloud-controller-1.localdomain systemd[1]: Starting Cluster Controlled n....
Jun 29 04:37:44 overcloud-controller-1.localdomain systemd[1]: Started Cluster Controlled ne....
Hint: Some lines were ellipsized, use -l to show in full.

Comment 4 Marios Andreou 2015-06-29 15:17:54 UTC
alternative band aid, instead of trying to implement retry, we can just sleep a bit after stack create and before post config like @ https://github.com/rdo-management/python-rdomanager-oscplugin/blob/master/rdomanager_oscplugin/v1/overcloud_deploy.py#L677

at least see if it fixes it (fairly common it seems, on virt at least, I hit it friday and today). Unless someone beats me to it or there are better ideas I will revisit tomorrow morning with a patch to do that

Comment 5 Ronelle Landy 2015-06-29 15:24:14 UTC
Marios, would that be just a naked sleep @ https://github.com/rdo-management/python-rdomanager-oscplugin/blob/master/rdomanager_oscplugin/v1/overcloud_deploy.py#L677

or will we be polling for something deterministic? If not, how will you sleep the code?

Comment 6 Marios Andreou 2015-06-30 07:44:17 UTC
rlandy that is a good question. 

Initialize neutron is just setting up initialising the tenant overcloud, so we can expect at this point that l3 agents (and all others) are up and working ok.
Ideally we want to query the l3-agents, like the shell command 'neutron agent-list | grep neutron-l3' - i think that would get messy quick though.



review at https://review.gerrithub.io/#/c/238087 for now, lets see if it helps

Comment 7 Marios Andreou 2015-06-30 09:57:38 UTC
so dmathews landed the patch so at least we get some testing. Poking some more it seems we can be a bit smarter about this, like neutron_client.list_agents() is a thing... am playing with that now but let's see how the sleep gets on

Comment 8 Marios Andreou 2015-06-30 11:12:39 UTC
since the original review is merged I opened another @ https://review.gerrithub.io/#/c/238097/ which tries to be a bit smarter about this (conditional sleep on the number of l3 agents). Really the big advantage of this approach is that we get some debug info when it is hit i.e. when there really aren't enough l3 agents.

Comment 9 Marios Andreou 2015-07-01 11:50:31 UTC
i just hit this again on a local setup - just updating to say it really isn't that rare. I applied the fix I have at https://review.gerrithub.io/#/c/238097/2/rdomanager_oscplugin/v1/overcloud_deploy.py (the second fix) and going again.

Meanwhile has the sleep helped as a stopgap (the first fix) - I am kind of hoping not hearing anything at all is a good thing?

Comment 10 Jay Dobies 2015-07-01 15:15:28 UTC
This is possibly related to https://bugzilla.redhat.com/show_bug.cgi?id=1238117

Comment 11 Marios Andreou 2015-07-02 09:03:18 UTC
I hit this again today. It seems the 60s is not quite enough. I increased to 120s (20seconds sleep per run) and it passed OK. Also, not quite sure where the log.debug is going to but I couldn't find it. I will change this to a print statement so it is at least useful. For example, on my second run I got:

Connection to 192.0.2.15 closed.
Warning can't get enough l3 agents. Retrying in 20 seconds. Agent ids: [] 
Warning can't get enough l3 agents. Retrying in 20 seconds. Agent ids: [] 
Warning can't get enough l3 agents. Retrying in 20 seconds. Agent ids: [] 
Overcloud Endpoint: http://192.0.2.15:5000/v2.0/
Overcloud Deployed
[stack@instack ~]$ 

So it was about a minute until the l3 agents were ok. Giulio's theory that this and https://bugzilla.redhat.com/show_bug.cgi?id=1238117 being related is gaining ground I think. Am looking into this more today

Comment 12 Dougal Matthews 2015-07-02 09:23:17 UTC
You need to pass --debug to the command to have it log to stdout. It isn't saved to a file by default I don't think.

Comment 13 Marios Andreou 2015-07-02 10:06:47 UTC
thanks Dougal. I have an updated review @ https://review.gerrithub.io/238320 

On my second run on the same box I got it once:

Connection to 192.0.2.17 closed.
Warning can't get enough l3 agents. Retrying in 20 seconds. Agent ids: [] 
Overcloud Endpoint: http://192.0.2.17:5000/v2.0/
Overcloud Deployed

Am going to continue poking at the root cause but for now I think this should help (my setup is 5 vms, 3 control 1 compute and 1 ceph), at least it does for me
(root cause, current theory being the neutronscale changing host in /etc/neutron/neutron.conf, as per bug https://bugzilla.redhat.com/show_bug.cgi?id=1238117 )

Comment 14 Marios Andreou 2015-07-02 15:20:29 UTC
[Updating here based on work I've done today on the related bug 1238117]

The review at https://review.openstack.org/#/c/198016/ removes the NeutronScale resource from the overcloud pacemaker puppet manifest. That seems to be the root cause of the not enough agents issue here... details on the other bug but basically,neutronscale starts up, all the neutron agents get a host entry like 'neutron-n-0', 'neutron-n-1' etc and there is momentary inconsistency in the agent states, hence this bug, and the behaviour described at https://bugzilla.redhat.com/show_bug.cgi?id=1238117

If we do remove NeutronScale then the fixes here (sleep) should be harmless anyway. If we are to retain NeutronScale, then we can improve on the fix here and rather than just check for 'enough' l3 agents, we can check based on the naming scheme used by NeutronScale like 'neutron-n-0' in the list of agents before continuing with intialize_neutron.

Comment 16 Marios Andreou 2015-07-06 10:12:22 UTC
Update again. The review still at https://review.gerrithub.io/#/c/238320/8 (its the same one but like third pass at it) has explicitly added grepping for 'neutron-n-?' so we can get enough of the NeutronScale l3 agents before initialize

There are a few related bugs now, this one https://bugzilla.redhat.com/show_bug.cgi?id=1238750 is important I think since it shows the agent inconsistency also happens on BM

Comment 17 Jay Dobies 2015-07-06 19:41:56 UTC
Moving back to ON_DEV and making it depend on 1238117. This *should* be fixed by the fix for that, so this should be moved back to ON_QA along with 1238117.

Comment 19 Udi Shkalim 2015-07-13 14:13:18 UTC
Failed QE. Problem happened again in a None-HA setup:
- Virtual environment
- 1 Controller, 2 Computes, 1 Ceph
- l3_ha=False on neutron.conf 
L3, DHCP and openvswitch agents are down.


From Installation:
Warning not enough l3 agents (attempt 1 of 6). Retrying in 20 seconds. Agent ids: [] 
Warning not enough l3 agents (attempt 2 of 6). Retrying in 20 seconds. Agent ids: [] 
Warning not enough l3 agents (attempt 3 of 6). Retrying in 20 seconds. Agent ids: [] 
Warning not enough l3 agents (attempt 4 of 6). Retrying in 20 seconds. Agent ids: [] 
Warning not enough l3 agents (attempt 5 of 6). Retrying in 20 seconds. Agent ids: [] 
Warning not enough l3 agents (attempt 6 of 6). Retrying in 20 seconds. Agent ids: [] 
Warning can't get enough l3 agents. Giving up, agents are {u'agents': [{u'binary': u'neutron-dhcp-agent', u'description': None, u'admin_state_up': True, u'heartbeat_timestamp': u'2015-07-13 12:31:11', u'alive': False, u'id': u'3781bed8-763f-4587-9117-43d0f411a455', u'topic': u'dhcp_agent', u'host': u'overcloud-controller-0.localdomain', u'agent_type': u'DHCP agent', u'started_at': u'2015-07-13 12:30:11', u'created_at': u'2015-07-13 12:30:11', u'configurations': {u'subnets': 0, u'use_namespaces': True, u'dhcp_lease_duration': 86400, u'dhcp_driver': u'neutron.agent.linux.dhcp.Dnsmasq', u'networks': 0, u'ports': 0}}, {u'binary': u'neutron-openvswitch-agent', u'description': None, u'admin_state_up': True, u'heartbeat_timestamp': u'2015-07-13 12:37:05', u'alive': True, u'id': u'3854dd3b-a95c-48a3-a425-76f9c0dd8743', u'topic': u'N/A', u'host': u'overcloud-compute-0.localdomain', u'agent_type': u'Open vSwitch agent', u'started_at': u'2015-07-13 12:31:05', u'created_at': u'2015-07-13 12:31:05', u'configurations': {u'in_distributed_mode': False, u'arp_responder_enabled': False, u'tunneling_ip': u'192.0.2.12', u'devices': 0, u'l2_population': False, u'tunnel_types': [u'gre'], u'enable_distributed_routing': False, u'bridge_mappings': {u'datacentre': u'br-ex'}}}, {u'binary': u'neutron-openvswitch-agent', u'description': None, u'admin_state_up': True, u'heartbeat_timestamp': u'2015-07-13 12:37:05', u'alive': True, u'id': u'4ff5f64a-2611-46c6-aeb4-257424490e50', u'topic': u'N/A', u'host': u'overcloud-compute-1.localdomain', u'agent_type': u'Open vSwitch agent', u'started_at': u'2015-07-13 12:31:05', u'created_at': u'2015-07-13 12:31:05', u'configurations': {u'in_distributed_mode': False, u'arp_responder_enabled': False, u'tunneling_ip': u'192.0.2.10', u'devices': 0, u'l2_population': False, u'tunnel_types': [u'gre'], u'enable_distributed_routing': False, u'bridge_mappings': {u'datacentre': u'br-ex'}}}, {u'binary': u'neutron-openvswitch-agent', u'description': None, u'admin_state_up': True, u'heartbeat_timestamp': u'2015-07-13 12:33:38', u'alive': False, u'id': u'9a01d083-b9be-4722-b03a-6d10b883af39', u'topic': u'N/A', u'host': u'neutron-n-0', u'agent_type': u'Open vSwitch agent', u'started_at': u'2015-07-13 12:33:38', u'created_at': u'2015-07-13 12:31:10', u'configurations': {u'in_distributed_mode': False, u'arp_responder_enabled': False, u'tunneling_ip': u'192.0.2.11', u'devices': 0, u'l2_population': False, u'tunnel_types': [u'gre'], u'enable_distributed_routing': False, u'bridge_mappings': {u'datacentre': u'br-ex'}}}, {u'binary': u'neutron-l3-agent', u'description': None, u'admin_state_up': True, u'heartbeat_timestamp': u'2015-07-13 12:31:05', u'alive': False, u'id': u'b421ede7-a73b-412d-b07b-b6368976eaa3', u'topic': u'l3_agent', u'host': u'overcloud-controller-0.localdomain', u'agent_type': u'L3 agent', u'started_at': u'2015-07-13 12:31:05', u'created_at': u'2015-07-13 12:31:05', u'configurations': {u'router_id': u'', u'agent_mode': u'legacy', u'gateway_external_network_id': u'', u'handle_internal_only_routers': True, u'use_namespaces': True, u'routers': 0, u'interfaces': 0, u'floating_ips': 0, u'interface_driver': u'neutron.agent.linux.interface.OVSInterfaceDriver', u'external_network_bridge': u'br-ex', u'ex_gw_ports': 0}}]}: 
Overcloud Endpoint: http://192.0.2.7:5000/v2.0/
Overcloud Deployed


[stack@instack ~]$ neutron agent-list
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host                               | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+
| 3781bed8-763f-4587-9117-43d0f411a455 | DHCP agent         | overcloud-controller-0.localdomain | xxx   | True           | neutron-dhcp-agent        |
| 3854dd3b-a95c-48a3-a425-76f9c0dd8743 | Open vSwitch agent | overcloud-compute-0.localdomain    | :-)   | True           | neutron-openvswitch-agent |
| 4ff5f64a-2611-46c6-aeb4-257424490e50 | Open vSwitch agent | overcloud-compute-1.localdomain    | :-)   | True           | neutron-openvswitch-agent |
| 9a01d083-b9be-4722-b03a-6d10b883af39 | Open vSwitch agent | neutron-n-0                        | xxx   | True           | neutron-openvswitch-agent |
| b421ede7-a73b-412d-b07b-b6368976eaa3 | L3 agent           | overcloud-controller-0.localdomain | xxx   | True           | neutron-l3-agent          |
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+

Comment 20 Marios Andreou 2015-07-13 14:37:45 UTC
just to clarify, where are you setting:

- l3_ha=False on neutron.conf 


also when you say non-ha do you mean 1 controller (cos pacemaker will still be running). I am confused by the agent list though. Not really sure how to proceed, I will try and deploy as above (1 control, 2 compute, 1 ceph, with the change applied from https://review.gerrithub.io/#/c/238320/8/rdomanager_oscplugin/v1/overcloud_deploy.py (which must have been applied here given the output above)

Comment 21 Marios Andreou 2015-07-14 06:54:32 UTC
I can confirm that I hit this when doing 

openstack  overcloud deploy --plan-uuid 7a5cdb96-6c41-40fe-af2f-d66ac9ae70c0  --control-scale 1 --ceph-storage-scale 1 --compute-scale 2

I tested previously with 1 controller, trying to find out what is going wrong

Comment 22 Marios Andreou 2015-07-14 13:34:07 UTC
So: tl;dr NeutronScale must go, as discussed in the bug this depends on at  [1]. It is the root cause of this. Once removed you can successfully deploy like  --control-scale 1 --ceph-storage-scale 2 --compute-scale 2.

With NeutronScale in place, even the sleep hack we have as a temp fix at [2] (and which is failing here) doesn't help. I can reliably reproduce the issue described in the original report above everytime by setting:

--control-scale 1 --ceph-storage-scale 1 --compute-scale 2

What happens is neutron-openvswitch-agent fails to start on the controller (still not clear, something to do with the message queue, which could make sense given the host name change from neutron-scale which afaik is used in the message topics) and becomes unmanaged by pcs. Once that isn't up, neither will whatever is below it in the pacemaker constraints chain [3] (so no l3, metadata or dhcp agents) from the single controller. 

I removed NeutronScale, like the review @ [4]. I also edited the rdo-manager plugin like [5] (otherwise would fail w/out neutronscale - [UPDATE I -1 that review since [6] landed so that sleep code is completely gone now, hence [5] is obsoleted]). I was able to do the --control-scale 1 --ceph-storage-scale 1 --compute-scale 2 which otherwise fails here. I also did:

 --control-scale 1 --ceph-storage-scale 2 --compute-scale 2
 --control-scale 1 --ceph-storage-scale 1 --compute-scale 1
 --control-scale 3 --ceph-storage-scale 1 --compute-scale 1
 --control-scale 3 --compute-scale 2

all completed without fuss, neutron agent-list seems good etc.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1238117#c18
[2] https://review.gerrithub.io/#/c/238320/8 Increase the sleep time while trying to get neutron l3 agents
[3] https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/manifests/overcloud_controller_pacemaker.pp#L995
[4] https://review.openstack.org/#/c/198016/ Removes the NeutronScale resource from controller pcmk manifest
[5] https://review.gerrithub.io/239450 Remove search for l3_agent name since NeutronScale is gone
[6] https://review.gerrithub.io/#/c/239833/1

Comment 23 Marios Andreou 2015-07-15 14:46:44 UTC
I wanted to sanity check this on a fresh poodle setup today. Everything has now landed upstream for safe removal of NeutronScale. The NeutronDhcpAgentsPerNetwork template-side landed @ [1] and was wired into the osc plugin @ [2]. The removal of NeutronScale from the templates has landed at [3].

I tested NeutronScale removal on 2 boxes today, one slightly older (few days old) and another from today, poodle.

On older box:
I couldn't reproduce the bug with --control-scale 1 --compute-scale 2. 

Then, applied the removal of NeutronScale, reloaded roles and tried --control-scale 1 --compute-scale 2 and also --control-scale 2 --compute-scale 1. 
The 2 controller case looks like [4] (difference from today is the hosts for controllers aren't renamed into neutron-n-0 ). Also sanity checked  --control-scale 1 --compute-scale 1 no drama.

On today's box:
I couldn't repro the bug with the --control-scale 1 --compute-scale 2 --ceph-storage-scale 1 (which was reliable for me in yesterday's environment).

I applied removal of NeutronScale, reloaded and did:
 --control-scale 3 --compute-scale 2 --ceph-storage-scale 1. No drama and again the agents look like [5] (I also ran --control-scale 1 --compute-scale 2 --ceph-storage-scale 1).
 
In both cases above I applied the NeutronScale removal to the current downstream tripleo heat templates (what we use in poodle).


[1] https://review.openstack.org/#/c/199102/ Adds the NeutronDhcpAgentsPerNetwork parameter


[2] https://review.gerrithub.io/238893  Wires up NeutronDhcpAgentsPerNetwork parameter to deploy


[3] https://review.openstack.org/#/c/198016/ Removes the NeutronScale resource from controller pcmk manifest


[4] [stack@instack ~]$ neutron agent-list

+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host                               | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+
| 15f71d83-7a2b-4c20-8d6b-96e81bd687ac | Open vSwitch agent | overcloud-compute-0.localdomain    | :-)   | True           | neutron-openvswitch-agent |
| 242fe32e-49c8-49a3-8ff7-460240570c1e | Open vSwitch agent | overcloud-controller-1.localdomain | :-)   | True           | neutron-openvswitch-agent |
| 9b7694eb-6518-45c9-a730-eaa3c388a790 | Metadata agent     | overcloud-controller-0.localdomain | :-)   | True           | neutron-metadata-agent    |
| c46e735f-03a1-42ce-a3ce-58660ec3954b | Open vSwitch agent | overcloud-controller-0.localdomain | :-)   | True           | neutron-openvswitch-agent |
| e130f1a0-7778-4104-bf65-9d807f05f59c | DHCP agent         | overcloud-controller-0.localdomain | :-)   | True           | neutron-dhcp-agent        |
| e1a82bc6-35d0-4d35-bf03-a56130528a84 | DHCP agent         | overcloud-controller-1.localdomain | :-)   | True           | neutron-dhcp-agent        |
| e212145c-029b-4bbe-a213-d659543b33f9 | Metadata agent     | overcloud-controller-1.localdomain | :-)   | True           | neutron-metadata-agent    |
| e6d9aba0-f328-4fe5-ac93-de27f801a8bc | L3 agent           | overcloud-controller-1.localdomain | :-)   | True           | neutron-l3-agent          |
| f8a90c0d-35e3-4bb7-a589-51060dc5df76 | L3 agent           | overcloud-controller-0.localdomain | :-)   | True           | neutron-l3-agent          |
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+


[5] [stack@instack ~]$ neutron agent-list

+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host                               | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+
| 05948d01-a6fa-42df-83a6-f7e667784334 | DHCP agent         | overcloud-controller-2.localdomain | :-)   | True           | neutron-dhcp-agent        |
| 1a962825-9a42-4199-a24b-f5612fb10cea | DHCP agent         | overcloud-controller-0.localdomain | :-)   | True           | neutron-dhcp-agent        |
| 1aa4f414-7754-4522-b735-54a23ed29b7b | Metadata agent     | overcloud-controller-2.localdomain | :-)   | True           | neutron-metadata-agent    |
| 21dac728-887b-423f-8ee8-860af6c08848 | Open vSwitch agent | overcloud-compute-1.localdomain    | :-)   | True           | neutron-openvswitch-agent |
| 222afb12-7d64-453a-8c53-dd1b2e726824 | DHCP agent         | overcloud-controller-1.localdomain | :-)   | True           | neutron-dhcp-agent        |
| 51de4430-0992-46a0-afce-cea1714684de | Metadata agent     | overcloud-controller-1.localdomain | :-)   | True           | neutron-metadata-agent    |
| 683347ac-aef1-4b9a-8ad4-b7c5ee5df303 | Open vSwitch agent | overcloud-controller-0.localdomain | :-)   | True           | neutron-openvswitch-agent |
| 6b07e49f-ef53-47f8-bca0-586bb9509a86 | L3 agent           | overcloud-controller-1.localdomain | :-)   | True           | neutron-l3-agent          |
| 8498f5d5-0a9d-4bf6-b19f-05e46c5dfa1f | Open vSwitch agent | overcloud-compute-0.localdomain    | :-)   | True           | neutron-openvswitch-agent |
| 9a707c2a-abf8-44ab-9aaa-b3c91b9cd6e9 | Open vSwitch agent | overcloud-controller-1.localdomain | :-)   | True           | neutron-openvswitch-agent |
| a60e67b3-1c0b-4038-8526-70683c46c922 | L3 agent           | overcloud-controller-0.localdomain | :-)   | True           | neutron-l3-agent          |
| c2be17a1-d171-4271-a49c-ad43ae7cc449 | Open vSwitch agent | overcloud-controller-2.localdomain | :-)   | True           | neutron-openvswitch-agent |
| d24536ec-3327-4169-8bf7-49584167f7c9 | Metadata agent     | overcloud-controller-0.localdomain | :-)   | True           | neutron-metadata-agent    |
| d52657cb-f8c5-4573-afa6-c27c3d2f98d5 | L3 agent           | overcloud-controller-2.localdomain | :-)   | True           | neutron-l3-agent          |
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+

Comment 24 Marios Andreou 2015-07-16 13:18:19 UTC
confirmed today for https://bugzilla.redhat.com/show_bug.cgi?id=1238117#c26 python-rdomanager-oscplugin-0.0.8-38.el7ost.noarch includes the neutronscale removal and should fix this bug

Comment 26 Ofer Blaut 2015-07-19 09:33:10 UTC
Issue is not seen in python-rdomanager-oscplugin-0.0.8-41.el7ost.noarch

Comment 28 errata-xmlrpc 2015-08-05 13:57:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2015:1549


Note You need to log in before you can comment on or make changes to this bug.