Bug 1238117 - Possible race condition causing neutron to have bad configuration state
Summary: Possible race condition causing neutron to have bad configuration state
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: ga
: Director
Assignee: Marios Andreou
QA Contact: Itzik Brown
URL:
Whiteboard:
Depends On:
Blocks: 1236578 1237144 1238750
TreeView+ depends on / blocked
 
Reported: 2015-07-01 08:58 UTC by Graeme Gillies
Modified: 2023-02-22 23:02 UTC (History)
12 users (show)

Fixed In Version: python-rdomanager-oscplugin-0.0.8-37.el7ost openstack-tripleo-heat-templates-0.8.6-40.el7ost
Doc Type: Bug Fix
Doc Text:
Previously, OpenStack was using the NeutronScale puppet resource that was enabled on controller nodes and tasked with rewriting the neutron agents' "host" entries to look like "neutron-n-0" on controller 0 or "neutron-n-1" on controller 1. This renaming was done toward the end of the deployment, when the corresponding neutron-scale resource was started by pacemaker. Mostly reported in VM environments, neutron would subsequently complain about not having enough L3 agents for L3 HA, and there would be inconsistency in the overcloud neutron agent-list. Consequently, in some cases, the error manifested itself in an error message from Neutron that there were not enough L3 agents to provide HA (the default minimum of 2). The "neutron agent-list" command on the overcloud would show inconsistency in the agents; for example, duplicate entries for each agent with both the original agent on host "overcloud-controller-1.localdomain" (typically shown "XXX") and the "newer" agent on host "neutron-n-1" (alive status ":-)", or at least eventually). In other cases, agent renaming would cause one of the neutron agents, openvswitch, to fail when there was only one controller, and then the rest of the agents under it would also fail to start as they were chained, resulting in no L3, metadata, or dhcp agents. This problem has been fixed by ensuring that the native neutron L3 High Availability is used, and that enough DHCP agents per network are enabled for native neutron HA. The latter is a needed addition as it was previously statically set at two in all cases. This was added as a configurable parameter in the tripleo heat templates with a default value of '3' and also wired up to deploy in the oscplugin. The NeutronScale resource itself has been removed from the tripleo heat templates where the overcloud controller puppet manifest is kept. As a result, deployments made after this fix will not have the neutron-scale resource on controller nodes, which can be verified by the following commands: 1. On a controller node: # pcs status | grep -n neutron -A 1 You should not see any "neutron-scale" clone set or resource definition. 2. On the undercloud: $ source overcloudrc $ neutron agent-list All the neutron agents should be reported as being on a host with a name like "overcloud-controller-0.localdomain" or "overcloud-controller-2.localdomain" but not "neutron-n-0" or "neutron-n-2".
Clone Of:
Environment:
Last Closed: 2015-08-05 13:57:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
email conversation (about neutron scale, our setup, and if we can remove it) for context (5.73 KB, text/plain)
2015-07-10 14:35 UTC, Marios Andreou
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Gerrithub.io 238320 0 None None None Never
Red Hat Product Errata RHEA-2015:1549 0 normal SHIPPED_LIVE Red Hat Enterprise Linux OpenStack Platform director Release 2015-08-05 17:49:10 UTC

Description Graeme Gillies 2015-07-01 08:58:41 UTC
Hi,

Doing a deployment with the latest poodle 2015-06-29-1 I noticed that my neutron configuration was broken (ports where in vlan 4095). I then noticed this

$ neutron agent-list
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host                               | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+
| 2bb808cc-ef7c-4770-b408-c482aaaf8d99 | Open vSwitch agent | overcloud-compute-0.localdomain    | :-)   | True           | neutron-openvswitch-agent |
| 717a010d-6a08-4640-b9b1-7827634371b0 | L3 agent           | overcloud-controller-0.localdomain | xxx   | True           | neutron-l3-agent          |
| 9700a8c3-6825-4aa1-a5b8-d1a17f9efc90 | Metadata agent     | overcloud-controller-0.localdomain | :-)   | True           | neutron-metadata-agent    |
| a31fc769-5271-4119-a224-8fc5e0155c85 | DHCP agent         | overcloud-controller-0.localdomain | :-)   | True           | neutron-dhcp-agent        |
| db9f6feb-f08c-41c7-a535-2108c4a34b00 | L3 agent           | neutron-n-0                        | :-)   | True           | neutron-l3-agent          |
| f1bdad7e-359b-4a3e-9a49-90b2e9f2fc3e | Open vSwitch agent | neutron-n-0                        | :-)   | True           | neutron-openvswitch-agent |
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+

It looks like there is some sort of race condition in the configuration of neutron. Initially the services are getting started with the /etc/neutron/neutron.conf setting for host either blank or set to controller-0.localdomain

then something is coming along and changing that setting to neutron-n-0 and bouncing the service. This leaves phantom agents and things broken as the l3 and the l3 agent don't match.

ovs shows the broken ports

# ovs-vsctl show
7e9b65b1-e2d4-4884-9f1d-9de7f98c8eab
    Bridge br-int
        fail_mode: secure
        Port int-br-ex
            Interface int-br-ex
                type: patch
                options: {peer=phy-br-ex}
        Port "qr-e7ccec8c-fc"
            tag: 1
            Interface "qr-e7ccec8c-fc"
                type: internal
        Port patch-tun
            Interface patch-tun
                type: patch
                options: {peer=patch-int}
        Port br-int
            Interface br-int
                type: internal
        Port "tap8d869a83-70"
            tag: 4095
            Interface "tap8d869a83-70"
                type: internal
    Bridge br-ex
        Port "qg-296ff137-58"
            tag: 4095
            Interface "qg-296ff137-58"
                type: internal
        Port "eth1"
            Interface "eth1"
        Port phy-br-ex
            Interface phy-br-ex
                type: patch
                options: {peer=int-br-ex}
        Port br-ex
            Interface br-ex
                type: internal
    Bridge br-tun
        fail_mode: secure
        Port "gre-c0000210"
            Interface "gre-c0000210"
                type: gre
                options: {df_default="true", in_key=flow, local_ip="192.0.2.15", out_key=flow, remote_ip="192.0.2.16"}
        Port patch-int
            Interface patch-int
                type: patch
                options: {peer=patch-tun}
        Port br-tun
            Interface br-tun
                type: internal
    ovs_version: "2.3.1-git3282e51"

Not I am doing this on the worlds slowest hardware (intel NUC celeron units), so it could be some sort of race condition we don't normally see due to most deployments being on faster machines.

It looks like pacemaker is doing the change (looking at the NeutronScale resource agent)

neutron_scale_start() {
        hostid=${OCF_RESKEY_hostbasename}-${OCF_RESKEY_CRM_meta_clone}
        for i in $neutronconfigfiles; do
                if [ -f "/etc/neutron/$i" ]; then
                        openstack-config --set /etc/neutron/$i DEFAULT host $hostid


but it is puzzling why neutron was started before this or how things went pair shaped in the first place

Regards,

Graeme

Comment 3 Giulio Fidente 2015-07-01 09:39:37 UTC
NeutronScale is editing the neutron host setting but it might happen after some of the agents were started (due to start constraints provisioned after the resource is created)

Comment 5 Marios Andreou 2015-07-01 13:42:31 UTC
Giulio if this is a case of what/where neutron was started before pacemaker, how about the -> we added to get over the neutron-server startup race?

Comment 6 Marios Andreou 2015-07-02 13:16:54 UTC
I don't think this is a bug. I mean, I don't think there is a problem here with the naming of the l3 agents in neutron agent-list or with (eventually) the state of overcloud neutron in general. If I am wrong there is a path forward but I need that feedback asap, details below. Thanks!

We are using NeutronScale. As Graeme points out, its sole purpose is to change the host entry in the various neutron.conf/ini files on a given host [5]. I *think* as long as all agents have the same value there, you can pretty much set it to whatever you want [1][2] (it certainly doesn't have to be a fqdn, consider what we have before NeutronScale, like "overcloud-compute-0.localdomain" on a compute host for example).

If you waited a minute, the agents would all switch over and all is well again, like the example output below at [7] - note only the compute openvswitch agent retains the original host setting, since it isn't running NeutronScale. NeutronScale is important not just for setting the agents' host entry but because of the enforced constraints in the relevant pacemaker manifest [3] - it goes like keystone -> neutron-server -> neutron-scale -> neutron-ovs-cleanup -> neutron-ns-cleanup -> openvswitch-agent -> dhcp -> l3 -> metadata - so neutron-scale comes first after the server in the chain to startup all the agents (colocations etc).

As I said above, I *think* all is well, assuming you've waited long enough for the agents to settle down after deploy (and critically for NeutronScale to startup on all the controllers and then the agents), usually within a minute (see for example the output at the related bug [6] comments #11 and #13 I think that is showing this inconsistent state the agents find themselves in, whilst neutron-scale starts up). I believe once that happens, overcloud neutron is functioning correctly. I haven't poked too much (I was able to do basic operations) somebody *please correct me* if this is not the case. 

If using the NeutronScale given host names for agents does not cause any problems, then perhaps the fix at [4] will help anyway; that sleeps a bit to avoid a (possibly) related bug [6]. At least at the point when we declare Overcloud Deployed (and postconfig) we can have a homogeneous list in the neutron agents host entries (we can even improve that patch to grep for the NeutronScale specific naming like 'neutron-n-0' before initialising neutron).

Alternatively, we stop using NeutronScale. In fact we don't really need it, since we already have very good control over the hostnames and they should be safe enough for scaling ("overcloud-controller-0.localdomain", "overcloud-controller-1.localdomain"). Note that this would involve changing the startup constraints (probably neutron-server -> neutron-ovs-cleanup, just skip scale), mentioning this since we should only do it at this stage if we really need to (i guess this should be considered a significant change). Speaking of which, I still can't work out where we are pulling this particular resource agent in from, so I have pasted it in full at [5] (from one of my controllers) for reference if you are interested.

Grateful for feedback and thoughts, thanks

[1] http://docs.openstack.org/kilo/config-reference/content/section_networking-options-reference.html # "Hostname to be used by the neutron server, agents and services running on this machine. All the agents and services running on this machine must use the same host value."

[2] ICEHOUSE http://docs.openstack.org/icehouse/config-reference/content/section_neutron.conf.html # "# host = myhost.com"

[3] https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/manifests/overcloud_controller_pacemaker.pp#L884
 
[4] https://review.gerrithub.io/#/c/238320/2

[5] http://pastebin.test.redhat.com/294389

[6] https://bugzilla.redhat.com/show_bug.cgi?id=1236578

[7]
[stack@instack ~]$ neutron agent-list
+--------------------------------------+--------------------+---------------------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host                            | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+---------------------------------+-------+----------------+---------------------------+
| 0227a273-fa2b-4cdb-86d9-e523cc63c0e7 | Metadata agent     | neutron-n-1                     | :-)   | True           | neutron-metadata-agent    |
| 03bf3be1-cf8f-4815-be0e-ea604b777581 | Open vSwitch agent | neutron-n-0                     | :-)   | True           | neutron-openvswitch-agent |
| 1cc1ed4c-0aa3-45d6-898b-c444d9f5de4e | Open vSwitch agent | neutron-n-1                     | :-)   | True           | neutron-openvswitch-agent |
| 23fb7353-ffcb-410a-b979-c40a416227c0 | DHCP agent         | neutron-n-2                     | :-)   | True           | neutron-dhcp-agent        |
| 2ae1d115-2160-41bd-b1c1-543f06dcadd2 | Metadata agent     | neutron-n-2                     | :-)   | True           | neutron-metadata-agent    |
| 5c6df943-46e3-4311-95ac-aea39e2406e5 | Open vSwitch agent | neutron-n-2                     | :-)   | True           | neutron-openvswitch-agent |
| 5fa8ac50-6fcf-41e1-9785-099d3eb7ee3b | L3 agent           | neutron-n-0                     | :-)   | True           | neutron-l3-agent          |
| 6bf6b59d-6c3b-4745-a056-d78503e6f5c6 | DHCP agent         | neutron-n-0                     | :-)   | True           | neutron-dhcp-agent        |
| 7f62c11e-9ab1-426b-886d-c9568b62eb66 | DHCP agent         | neutron-n-1                     | :-)   | True           | neutron-dhcp-agent        |
| 8382f230-4d95-4707-bad7-fc992a99ad6e | L3 agent           | neutron-n-2                     | :-)   | True           | neutron-l3-agent          |
| ba9bcfd1-2373-440a-bdb9-5cf167f6c936 | Metadata agent     | neutron-n-0                     | :-)   | True           | neutron-metadata-agent    |
| bcc09edc-38fd-4c57-aef0-f6eed6706052 | Open vSwitch agent | overcloud-compute-0.localdomain | :-)   | True           | neutron-openvswitch-agent |
| e70945e8-3b6b-40b0-8ccb-c76de718d0cb | L3 agent           | neutron-n-1                     | :-)   | True           | neutron-l3-agent          |

Comment 7 Giulio Fidente 2015-07-02 13:58:40 UTC
Marios, it agree with you looks we could try drop NeutronScale; we should check with Neutron team.

Comment 8 Marios Andreou 2015-07-02 14:04:52 UTC
thanks gfidente

to which end am about to get a review out in case we need it (for the templates, want to get rid of scale and deploy it to make sure we have exact syntax etc ready to go)

Comment 9 Marios Andreou 2015-07-02 15:13:45 UTC
so the review at https://review.openstack.org/#/c/198016 "Removes the NeutronScale resource from controller pcmk manifest" does what it claims to. I tested this locally, and things did not explode... so if we can and do want to go with removing NeutronScale then that is the way to do it. I applied to current downstream tripleo heat templates and got overcloud deployed (no post config complaints wrt 1236578, though that doesn't happen every time) and:

[root@overcloud-controller-0 ~]# pcs status | grep neutron -A 2
 Clone Set: neutron-l3-agent-clone [neutron-l3-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-alarm-notifier-clone [openstack-ceilometer-alarm-notifier]
--
 Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-clone [openstack-heat-api]
--
 Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-glance-api-clone [openstack-glance-api]
--
 Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy]
--
 Clone Set: neutron-server-clone [neutron-server]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: httpd-clone [httpd]


[stack@instack ~]$ . overcloudrc 
[stack@instack ~]$ neutron agent-list
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host                               | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+
| 0463b2c7-4ab0-40cf-a105-10c96629265a | L3 agent           | overcloud-controller-0.localdomain | :-)   | True           | neutron-l3-agent          |
| 203505b4-6284-485d-8aec-ca2d89c75033 | DHCP agent         | overcloud-controller-0.localdomain | :-)   | True           | neutron-dhcp-agent        |
| 25d7f078-99ee-4cbd-b4a3-61f216738b5b | Open vSwitch agent | overcloud-compute-0.localdomain    | :-)   | True           | neutron-openvswitch-agent |
| 2e7d891d-f91f-436d-a37a-2e559ca8de3b | L3 agent           | overcloud-controller-2.localdomain | :-)   | True           | neutron-l3-agent          |
| 54d5082b-a6b8-40ce-a9ee-b3013337f298 | Open vSwitch agent | overcloud-controller-2.localdomain | :-)   | True           | neutron-openvswitch-agent |
| 6b6a4496-c81d-40b2-9bde-23706740f59e | Metadata agent     | overcloud-controller-2.localdomain | :-)   | True           | neutron-metadata-agent    |
| 7048b446-2684-463e-b72d-4b4f7c1bfbe9 | Metadata agent     | overcloud-controller-1.localdomain | :-)   | True           | neutron-metadata-agent    |
| 791e1269-ecc1-4af1-9c74-fbb929f2c00b | Open vSwitch agent | overcloud-controller-0.localdomain | :-)   | True           | neutron-openvswitch-agent |
| 8c3347ac-c0b3-43f2-a123-ca83a2e3fdb0 | Open vSwitch agent | overcloud-controller-1.localdomain | :-)   | True           | neutron-openvswitch-agent |
| a27ed414-42a8-4020-9f9b-d6884485569c | L3 agent           | overcloud-controller-1.localdomain | :-)   | True           | neutron-l3-agent          |
| a2ddb44a-705f-47ba-a353-64fb4fee7b5a | DHCP agent         | overcloud-controller-2.localdomain | :-)   | True           | neutron-dhcp-agent        |
| aaf0f96a-c921-4f74-ae7e-d9a31dba9b45 | DHCP agent         | overcloud-controller-1.localdomain | :-)   | True           | neutron-dhcp-agent        |
| e40b635c-ddd3-4f9d-a21b-ddcaf492d63a | Metadata agent     | overcloud-controller-0.localdomain | :-)   | True           | neutron-metadata-agent    |
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+

Comment 10 Graeme Gillies 2015-07-03 02:27:22 UTC
Hi Marios,

I would make sure to check with the HA/Pacemaker team first before making this change. The reason iirc that the NeutronScale resource agent exists is that the hostname on all controller nodes needs to be the same in neutron. That way when we fail over an agent, it thinks it's the same l2/l3 agent and not a new one. In your example above, each l3 agent has a different name, which I believe will be problematic when it comes time for failover (though I am not an expert).

It's worth noting that in my environment things didn't fix themselves easier (it wasn't due to it not finished settling). I had time to upload an image into my overcloud, boot an instance from it, notice network wasn't working, then I discovered the agents were wrong.

However one thing that might help is when I was doing this testing I had no ntp enabled on the environment, and I did notice the clock on my controller was a bit off, not sure if this is related, but it could be something causing funnyness with the heat orchestration steps maybe?

Regards,

Graeme

Comment 11 Giulio Fidente 2015-07-03 06:16:29 UTC
hi Graeme, I think it does the contrary, it ensures every cloned instance has a unique id (eg. neutron-n-{X,Y,Z}) in order to uniquely identify the instance; this is why using hostnames seemed safe (we get overcloud-controller-{X,Y,Z}).

There could be other reasons why we can't rely on hostnames though, conversation is ongoing with HA and Neutron teams.

Comment 12 Graeme Gillies 2015-07-03 06:19:41 UTC
(In reply to Giulio Fidente from comment #11)
> hi Graeme, I think it does the contrary, it ensures every cloned instance
> has a unique id (eg. neutron-n-{X,Y,Z}) in order to uniquely identify the
> instance; this is why using hostnames seemed safe (we get
> overcloud-controller-{X,Y,Z}).
> 
> There could be other reasons why we can't rely on hostnames though,
> conversation is ongoing with HA and Neutron teams.

Oh ok my understanding must be out of date. Thank you for the correction

Comment 13 Marios Andreou 2015-07-03 10:34:04 UTC
(In reply to Graeme Gillies from comment #12)
> (In reply to Giulio Fidente from comment #11)
> > hi Graeme, I think it does the contrary, it ensures every cloned instance
> > has a unique id (eg. neutron-n-{X,Y,Z}) in order to uniquely identify the
> > instance; this is why using hostnames seemed safe (we get
> > overcloud-controller-{X,Y,Z}).
> > 
> > There could be other reasons why we can't rely on hostnames though,
> > conversation is ongoing with HA and Neutron teams.
> 
> Oh ok my understanding must be out of date. Thank you for the correction

+ thanks Graeme & Giulio, sorry for not responding Graeme I just agreed with what you said about checking with the folks that wrote NeutronScale and used it in the first place (jayg reached out last night waiting to hear back).

Comment 14 Marios Andreou 2015-07-03 12:41:11 UTC
If we continue to use NeutronScale then the fixup at https://review.gerrithub.io/#/c/238320/6 (which is meant to address the related https://bugzilla.redhat.com/show_bug.cgi?id=1236578) should help to alleviate some of the pain - the idea is to get at least two l3 agents with hosts that match the 'neutron-n-?' pattern
 - though given Graeme's comment we may want to up the sleep time (currently 2 mins ish total, doesn't factor in time to invoke neutron client and get a response)

Comment 15 Marios Andreou 2015-07-06 15:30:37 UTC
grateful for any review of the proposed fixup for this @ https://review.gerrithub.io/#/c/238320/8

In particular we can tweak the various parameters there if we continue to see this issue with the fix applied.

thanks

Comment 17 Marios Andreou 2015-07-10 14:35:39 UTC
Created attachment 1050706 [details]
email conversation (about neutron scale, our setup, and if we can remove it) for context

Comment 18 Marios Andreou 2015-07-10 14:37:16 UTC
The root cause and the best fix here, really is to remove NeutronScale, since we have (neutron) native HA - the bit we are missing was control over the dhcp_agents_per_network (and setting a minimum of 3) but we have reviews for that @ [1][2]. Once those land we can then safely land [3] - which removes NeutronScale. HOWEVER, before we land that, and since we landed [4] as a fix here in the meantime, we need [5] to revert it, otherwise we will timeout grepping on 'neutron-n-?' (but there is no NeutronScale so...). Thanks very much to Assaf for his advice, I copy/paste the email conversation (about neutron scale, our setup, and if we can remove it) for context at [6] as an attachment to this bug.


[1] https://review.openstack.org/#/c/199102/ Adds the NeutronDhcpAgentsPerNetwork parameter, oscplugin 
[2] https://review.gerrithub.io/238893  Wires up NeutronDhcpAgentsPerNetwork parameter to deploy
[3] https://review.openstack.org/#/c/198016/ Removes the NeutronScale resource from controller pcmk manifest
[4] https://review.gerrithub.io/#/c/238320/8 Increase the sleep time while trying to get neutron l3 agents
[5] https://review.gerrithub.io/239450 Remove search for l3_agent name since NeutronScale is gone
[6] https://bugzilla.redhat.com/attachment.cgi?id=1050706

Comment 19 Marios Andreou 2015-07-10 14:57:11 UTC
I just finished a run through with all the reviews from comment 18 applied (fixup the dhcp_agents_per_network, remove neutron scale, remove explicit grep for neutron-n-? in l3 agents). Deployed OK and 


[root@overcloud-controller-1 ~]# grep -ni 'dhcp_agents' /etc/neutron/*
grep: /etc/neutron/conf.d: Is a directory
/etc/neutron/neutron.conf:242:# dhcp_agents_per_network = 1
/etc/neutron/neutron.conf:243:dhcp_agents_per_network = 3
grep: /etc/neutron/plugins: Is a directory


[root@overcloud-controller-1 ~]# pcs status | grep neutron -A 1
 Clone Set: neutron-l3-agent-clone [neutron-l3-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
--
 Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
--
 Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
--
 Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
--
 Clone Set: neutron-server-clone [neutron-server]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

[stack@instack ~]$ . overcloudrc 
[stack@instack ~]$ neutron agent-list
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host                               | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+
| 0295688d-8fae-4174-9e9c-5082d4c713e4 | Open vSwitch agent | overcloud-controller-0.localdomain | :-)   | True           | neutron-openvswitch-agent |
| 17cece0a-c3e9-444a-be8a-322da0927fb7 | L3 agent           | overcloud-controller-2.localdomain | :-)   | True           | neutron-l3-agent          |
| 1ff418ea-7acb-4327-b426-8258fa45ee83 | L3 agent           | overcloud-controller-1.localdomain | :-)   | True           | neutron-l3-agent          |
| 2c786688-d4f4-47cd-ac4b-5ebe5eae30d3 | Metadata agent     | overcloud-controller-0.localdomain | :-)   | True           | neutron-metadata-agent    |
| 36a2e098-283e-49e0-931a-93be248a47d2 | Open vSwitch agent | overcloud-compute-0.localdomain    | :-)   | True           | neutron-openvswitch-agent |
| 52c60e25-05da-46fa-972b-28c7d47b1016 | DHCP agent         | overcloud-controller-2.localdomain | :-)   | True           | neutron-dhcp-agent        |
| 63247b37-df3f-44ed-9f1a-79e8b428ead0 | DHCP agent         | overcloud-controller-0.localdomain | :-)   | True           | neutron-dhcp-agent        |
| a5724a0f-55ba-4909-925f-f5e850acac7c | Metadata agent     | overcloud-controller-2.localdomain | :-)   | True           | neutron-metadata-agent    |
| a5ceb4cd-cd48-45fc-96ea-e58ed3d9968d | Open vSwitch agent | overcloud-controller-1.localdomain | :-)   | True           | neutron-openvswitch-agent |
| bb3f4d92-d685-491b-b0ad-508825338f86 | DHCP agent         | overcloud-controller-1.localdomain | :-)   | True           | neutron-dhcp-agent        |
| bf94215a-a197-48fe-a91b-6bc1ab282da8 | Metadata agent     | overcloud-controller-1.localdomain | :-)   | True           | neutron-metadata-agent    |
| c51b72c9-48b1-4e1d-bba5-f4b01ac60ca4 | Open vSwitch agent | overcloud-controller-2.localdomain | :-)   | True           | neutron-openvswitch-agent |
| e928210c-8ea9-41c9-a1ec-f01f245d4d2e | L3 agent           | overcloud-controller-0.localdomain | :-)   | True           | neutron-l3-agent          |
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+

Comment 20 Marios Andreou 2015-07-14 13:33:12 UTC
as an update to comment #18, since we now aren't doing overcloud network postconfig [1] we don't need to revert the sleep in the oscplugin, i -1 the relevant review @ https://review.gerrithub.io/#/c/239450/

[1] https://review.gerrithub.io/#/c/239833/1

Comment 21 Itzik Brown 2015-07-15 12:12:22 UTC
With python-rdomanager-oscplugin-0.0.8-32.el7ost.noarch I get the following:

+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host                               | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+
| 208b3dfa-331a-4bc3-b9b2-dc6e8ade5eae | DHCP agent         | overcloud-controller-0.localdomain | xxx   | True           | neutron-dhcp-agent        |
| 86562501-06f4-4555-9b76-56c537e8c999 | DHCP agent         | neutron-n-0                        | :-)   | True           | neutron-dhcp-agent        |
| 96a0a04c-a336-4e0a-91bf-2fda613fe417 | L3 agent           | neutron-n-0                        | :-)   | True           | neutron-l3-agent          |
| 9c5e7886-afce-4eed-8591-306b89f9099f | Metadata agent     | neutron-n-0                        | xxx   | True           | neutron-metadata-agent    |
| a821979e-ae00-4a52-923a-c881673d6a7f | L3 agent           | overcloud-controller-0.localdomain | xxx   | True           | neutron-l3-agent          |
| b50e6df3-2098-4461-bffd-c2294eedbed5 | Open vSwitch agent | neutron-n-0                        | :-)   | True           | neutron-openvswitch-agent |
| cf46086c-0183-4132-b9da-0c5eab82900b | Open vSwitch agent | overcloud-compute-0.localdomain    | :-)   | True           | neutron-openvswitch-agent |
| fd31250b-a517-453f-a542-06bb3bd183e1 | Open vSwitch agent | overcloud-compute-1.localdomain    | :-)   | True           | neutron-openvswitch-agent |
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+

Comment 22 Marios Andreou 2015-07-15 14:49:46 UTC
thanks Itzik, that looks like what I get from today's poodle setup and it looks OK to me - except the metadata agent (not sure why xxx there? did the stack create complete?).

In any case, the real fix remains removal of neutronscale as discussed above in comment 18 and also more recently in the dependent bug @ https://bugzilla.redhat.com/show_bug.cgi?id=1236578#c23

Comment 23 Itzik Brown 2015-07-16 07:36:34 UTC
After talking with marios and hewbrocca waiting for 
openstack-tripleo-heat-templates-0.8.6-40.el7ost  python-rdomanager-oscplugin-0.0.8-37.el7ost to verify

Comment 24 Marios Andreou 2015-07-16 07:50:55 UTC
thanks Itzik - I am still deploying today's poodle, but fyi I get:

[root@instack rdomanager_oscplugin]# rpm -qa | grep rdomanager
python-rdomanager-oscplugin-0.0.9-dev11.el7.centos.noarch

[root@instack rdomanager_oscplugin]# rpm -qa | grep tripleo-heat
openstack-tripleo-heat-templates-0.8.6-41.el7ost.noarch

The osc-plugin version above (0.0.9) actually doesn't have the dhcp agent change yet (it landed at https://review.gerrithub.io/#/c/238893/ ) so not sure what mburns had in mind with the "python-rdomanager-oscplugin-0.0.8-37.el7ost" requirement (perhaps that is where we dropped the temp sleep fix).

In any case, the good news is that the heat templates version above does have the required removal of neutronscale, so that alone should be enough to get a clear run here and the dependent bugs. I expect the final bit (making dhcp agents per network default to 3) should appear soon enough in a build, will ping mburns later

thanks

Comment 25 Mike Burns 2015-07-16 11:48:07 UTC
(In reply to marios from comment #24)
> thanks Itzik - I am still deploying today's poodle, but fyi I get:
> 
> [root@instack rdomanager_oscplugin]# rpm -qa | grep rdomanager
> python-rdomanager-oscplugin-0.0.9-dev11.el7.centos.noarch
> 
> [root@instack rdomanager_oscplugin]# rpm -qa | grep tripleo-heat
> openstack-tripleo-heat-templates-0.8.6-41.el7ost.noarch
> 
> The osc-plugin version above (0.0.9) actually doesn't have the dhcp agent
> change yet (it landed at https://review.gerrithub.io/#/c/238893/ ) so not
> sure what mburns had in mind with the
> "python-rdomanager-oscplugin-0.0.8-37.el7ost" requirement (perhaps that is
> where we dropped the temp sleep fix).
> 
> In any case, the good news is that the heat templates version above does
> have the required removal of neutronscale, so that alone should be enough to
> get a clear run here and the dependent bugs. I expect the final bit (making
> dhcp agents per network default to 3) should appear soon enough in a build,
> will ping mburns later
> 
> thanks

The patch is included in the most recent builds. As long as you have 0.0.8-37 or newer, it's in there.  

The .centos build is not valid.

Comment 26 Marios Andreou 2015-07-16 12:21:41 UTC
yeah thanks Mike, I was mistakenly enabling the extra repos... once i did it properly I got python-rdomanager-oscplugin-0.0.8-38.el7ost.noarch and confirm it has all the things

Comment 28 Itzik Brown 2015-07-20 12:25:46 UTC
$ neutron agent-list
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host                               | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+
| 29c89c83-1788-4bcc-8e4b-5802c2c6b524 | Metadata agent     | overcloud-controller-0.localdomain | :-)   | True           | neutron-metadata-agent    |
| 2efeee7c-a860-4b6a-8c34-d5fb553da0ec | DHCP agent         | overcloud-controller-0.localdomain | :-)   | True           | neutron-dhcp-agent        |
| 3195a1d3-1fee-458c-95c6-3e47a84b6b1d | L3 agent           | overcloud-controller-0.localdomain | :-)   | True           | neutron-l3-agent          |
| 4c600774-e509-4124-b314-4a185e67f900 | Open vSwitch agent | overcloud-controller-0.localdomain | :-)   | True           | neutron-openvswitch-agent |
| 93dcfea5-188f-4c15-946d-dbac76452c8d | Open vSwitch agent | overcloud-compute-0.localdomain    | :-)   | True           | neutron-openvswitch-agent |
| c901a51a-84d7-48a1-9cdb-552f612041e9 | Open vSwitch agent | overcloud-compute-1.localdomain    | :-)   | True           | neutron-openvswitch-agent |
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+

Checked with:
python-rdomanager-oscplugin-0.0.8-41.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-44.el7ost.noarch

Comment 30 errata-xmlrpc 2015-08-05 13:57:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2015:1549


Note You need to log in before you can comment on or make changes to this bug.