Bug 1237144

Summary: Neutron l3-agent active on all 3 controllers when using network isolation
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: rhosp-directorAssignee: Marios Andreou <mandreou>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: high Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: amuller, calfonso, dmacpher, jason.dobies, mburns, mcornea, rhel-osp-director-maint
Target Milestone: gaKeywords: TestOnly, Triaged
Target Release: Director   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-0.8.6-40.el7ost python-rdomanager-oscplugin-0.0.8-37.el7ost Doc Type: Bug Fix
Doc Text:
The NeutronScale resource renamed neutron agents on Controller nodes. This caused an inconsistency with the "neutron agent-list" and as result Neutron reported errors of not having enough L3 agents for L3 HA. This fix removes the NeutronScale resource from Overcloud Heat templates and plans. NeutronScale does not appear in "neutron agent-list" and Neutron reports no errors.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-08-05 13:57:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1238117    
Bug Blocks:    
Attachments:
Description Flags
deployment results none

Description Marius Cornea 2015-06-30 13:21:41 UTC
Description of problem:
Neutron l3-agent is active on all 3 controllers when running a baremetal HA setup with network isolation.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-0.8.6-19.el7ost.noarch
openstack-tripleo-puppet-elements-0.0.1-2.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy a 3 controller overcloud with network isolation configuration
2. Run 'neutron l3-agent-list-hosting-router default-router' against the overcloud
3.

Actual results:
+--------------------------------------+------------------------------------+----------------+-------+----------+
| id                                   | host                               | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------------------+----------------+-------+----------+
| b35d90a3-44cc-49ce-9802-57ccbd606d8d | overcloud-controller-1.localdomain | True           | :-)   | active   |
| 107ef1e1-255b-4bb4-857a-f02ea40e853b | overcloud-controller-2.localdomain | True           | :-)   | active   |
| c28e98be-ac9d-4735-94f8-473ee1a69d4b | overcloud-controller-0.localdomain | True           | :-)   | active   |
+--------------------------------------+------------------------------------+----------------+-------+----------+

Expected results:
Only one agent would show as active and the other two will be in standby mode.

Additional info:

Comment 4 Mike Burns 2015-06-30 17:31:50 UTC
Assaf,  Can you look at this?  This sounds like it's VRRP/L3_HA and working as designed.

Comment 5 Assaf Muller 2015-06-30 19:39:06 UTC
The router should be scheduled to all three nodes, but it should only be active on one (The last column of neutron l3-agent-list-hosting-router default-router output). If the router is active on all three nodes that means that the router namespaces can't ping each other on their HA interfaces, which pretty much always means that you generally do not have tenant networking connectivity between the nodes.

Comment 6 Marios Andreou 2015-07-03 10:59:07 UTC
I will poke at this some more today. What I find strange about the output in the description is the names of the l3 agent hosts, was this done very quickly (immediately?) after deploy? Once NeutronScale starts up (see https://bugzilla.redhat.com/show_bug.cgi?id=1238117 ) the host entries will be udpated for all the neutron agents. It should eventually look like:

[stack@instack ~]$ neutron l3-agent-list-hosting-router default-router
+--------------------------------------+-------------+----------------+-------+----------+
| id                                   | host        | admin_state_up | alive | ha_state |
+--------------------------------------+-------------+----------------+-------+----------+
| affa289a-ad7b-43eb-a33e-f071e376b9bd | neutron-n-1 | True           | :-)   | active   |
| 21ece9eb-4754-44d4-ae33-f66a46c9481d | neutron-n-2 | True           | :-)   | standby  |
| 00b503fa-9933-4b56-88bb-e815f1edf104 | neutron-n-0 | True           | :-)   | standby  |
+--------------------------------------+-------------+----------------+-------+----------+


and as Assaf says the two are in standby (note the reported agent host names). I will deploy with network isolation and see what happens, is this reproducible at will? Virt or BM env?

Comment 7 Marios Andreou 2015-07-03 10:59:34 UTC
(sorry read baremetal in the description)

Comment 8 Marius Cornea 2015-07-03 11:19:57 UTC
It was probably run within minutes after deployment finished. Here is the output for a deployment that has been up for a couple of hours (baremetal with network isolation)

[stack@puma42 ~]$ neutron l3-agent-list-hosting-router default-router
+--------------------------------------+------------------------------------+----------------+-------+----------+
| id                                   | host                               | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------------------+----------------+-------+----------+
| 6142b3e8-cf10-44b0-859e-3a269e03be86 | overcloud-controller-0.localdomain | True           | xxx   | standby  |
| 28cbcade-0717-4906-a796-eca2e143a4de | neutron-n-1                        | True           | :-)   | active   |
| 7f57cfbe-5bd4-4eee-9bc6-a30ade30c948 | overcloud-controller-1.localdomain | True           | xxx   | standby  |
+--------------------------------------+------------------------------------+----------------+-------+----------+

Comment 9 Marios Andreou 2015-07-03 15:49:25 UTC
thanks for the update. I will revisit on Monday but wanted to update before I go. With respect to the l3 agents - given that we have NeutronScale (and the discussion at bug https://bugzilla.redhat.com/show_bug.cgi?id=1238117#c6) I am not so concerned with the presence of the stale references to "overcloud-controller-1.localdomain" or "overcloud-compute-0.localdomain" as you have above, since they were 'things' at some point. As long as they are not deemed to be alive (xxx).

What I am concerned about is a timely appearance of the correct agents (as per NeutronScale, again see that bug), i.e. neutron-n-0, n1 etc. To this end the review @ https://review.gerrithub.io/#/c/238320/6 (especially v6) adds a check for 'enough' l3 agents that have the NeutronScale pattern in their name, before going onto neutron initialization and finally declaring Overcloud Deployed. 

"Matter of minutes" ... the current timeout at that review is 2 mins. ish... but that is just to get at least min (2) l3 agents with the neutron-n-? pattern in their name. We could revisit those params in general if we need to.

thanks!

Comment 10 Marios Andreou 2015-07-06 15:29:13 UTC
So i ran this today on two boxes, one had only 2 controllers but the other had 3 (and 1 ceph, 1 compute, 5 total), and on both I deployed with network isolation. For clarity (vms setup), the deploy looked like:

openstack overcloud deploy --plan-uuid $plan_id -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml   --control-scale 3 --ceph-storage-scale 1

I had my fixup from https://review.gerrithub.io/#/c/238320/8 applied locally. After deploy I tried neutron-agent-list and also the neutron l3-agent-list-hosting-router default-router in all cases lgtm, like below. I was unable to reproduce, I'd be grateful for any feedback wrt 
that fix, in particular we can tweak the sleep time or even the number of agents we wait for (currently 2, as min)

thanks

[stack@instack ~]$ . overcloudrc 
[stack@instack ~]$ neutron agent-list
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host                               | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+
| 1d9dc657-333d-4ee4-99eb-e96f86876aea | L3 agent           | neutron-n-1                        | :-)   | True           | neutron-l3-agent          |
| 2296fc44-fa51-4c4b-be4f-f11184113134 | DHCP agent         | neutron-n-1                        | :-)   | True           | neutron-dhcp-agent        |
| 2772c7f0-ac4b-4fae-a8f4-26ded0d31064 | L3 agent           | overcloud-controller-1.localdomain | xxx   | True           | neutron-l3-agent          |
| 40aaba2a-7e03-47b5-a48d-93d1cd6139aa | L3 agent           | neutron-n-2                        | :-)   | True           | neutron-l3-agent          |
| 4d58bd1d-971e-4546-9e30-457b4bbd7f05 | Metadata agent     | neutron-n-2                        | :-)   | True           | neutron-metadata-agent    |
| 58f3980e-1b15-4335-9784-9aba776905d0 | DHCP agent         | neutron-n-0                        | :-)   | True           | neutron-dhcp-agent        |
| 5a62eeed-b29a-4e26-ab9a-5358d9bf9ab8 | L3 agent           | neutron-n-0                        | :-)   | True           | neutron-l3-agent          |
| 6da8fcd7-f99f-4ce6-86f2-cb4969e6feb9 | Metadata agent     | neutron-n-1                        | :-)   | True           | neutron-metadata-agent    |
| 77d3b479-1fdc-4a9e-a314-69b6e624f529 | Metadata agent     | neutron-n-0                        | :-)   | True           | neutron-metadata-agent    |
| 93343573-81ad-40d3-a26c-80f2cbaa0c66 | L3 agent           | overcloud-controller-0.localdomain | xxx   | True           | neutron-l3-agent          |
| b0297156-2174-4117-98aa-cf5aba525ba5 | Open vSwitch agent | neutron-n-0                        | :-)   | True           | neutron-openvswitch-agent |
| b793f6e7-69dc-42d0-879e-6bad3d0e5b4d | Open vSwitch agent | overcloud-compute-0.localdomain    | :-)   | True           | neutron-openvswitch-agent |
| d689bd74-88a9-404d-a128-0063eed315ac | Open vSwitch agent | neutron-n-1                        | :-)   | True           | neutron-openvswitch-agent |
| de59c391-e2a6-4c47-bc5b-943b22e09a6b | DHCP agent         | neutron-n-2                        | :-)   | True           | neutron-dhcp-agent        |
| ef4f9139-7bf2-4066-9142-ea89c33d1b24 | Open vSwitch agent | neutron-n-2                        | :-)   | True           | neutron-openvswitch-agent |
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+

[stack@instack ~]$ neutron l3-agent-list-hosting-router default-router
+--------------------------------------+-------------+----------------+-------+----------+
| id                                   | host        | admin_state_up | alive | ha_state |
+--------------------------------------+-------------+----------------+-------+----------+
| 1d9dc657-333d-4ee4-99eb-e96f86876aea | neutron-n-1 | True           | :-)   | active   |
| 40aaba2a-7e03-47b5-a48d-93d1cd6139aa | neutron-n-2 | True           | :-)   | standby  |
| 5a62eeed-b29a-4e26-ab9a-5358d9bf9ab8 | neutron-n-0 | True           | :-)   | standby  |
+--------------------------------------+-------------+----------------+-------+----------+

Comment 11 Jay Dobies 2015-07-06 19:46:14 UTC
Moving back to ON_DEV and making it depend on 1238117. This *should* be fixed by the fix for that, so this should be moved back to ON_QA along with 1238117.

Comment 12 Marius Cornea 2015-07-15 13:21:40 UTC
Created attachment 1052360 [details]
deployment results

Attaching deployment results and l3 agents status.

Comment 13 Marios Andreou 2015-07-15 13:59:26 UTC
thanks marius - as discussed on irc, the code that outputs that is gone now (midstream, at https://review.gerrithub.io/#/c/239833/1/rdomanager_oscplugin/v1/overcloud_deploy.py ).

the real fix again is removal of neutron scale, as discussed in the root bug @ https://bugzilla.redhat.com/show_bug.cgi?id=1238117#c18 and also https://bugzilla.redhat.com/show_bug.cgi?id=1236578#c22 (another dependent one). The reviews for all of those are landed upstream - I am finishing testing and will update those in a while.

Comment 14 Marios Andreou 2015-07-15 14:51:37 UTC
fyi confirmation that the sleep code marius output shows is gone from python-rdomanager-oscplugin-0.0.8-32.el7ost.noarch (see comment @ https://bugzilla.redhat.com/show_bug.cgi?id=1238117#c21 )

Comment 15 Marios Andreou 2015-07-17 13:30:13 UTC


i tested again today (as part of poking at BZ 1236136 ) now that neutronscale is no more. deploy was like:

-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml   --control-scale 3 --ceph-storage-scale 1

(note my environment also had the fixes applied as explained at https://bugzilla.redhat.com/show_bug.cgi?id=1236136#c27 - which are todo with the network isolation not the neutron agents which I am verifying here)

neutron agent-list looks like:

[stack@instack ~]$ neutron agent-list
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host                               | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+
| 10139073-97c4-4264-990e-cb7ab35266ce | Open vSwitch agent | overcloud-controller-1.localdomain | :-)   | True           | neutron-openvswitch-agent |
| 1cc3e9e8-3b3f-4e3f-bd1a-8a6a54a1c7f7 | Metadata agent     | overcloud-controller-0.localdomain | :-)   | True           | neutron-metadata-agent    |
| 26506ec7-1ef3-4b9a-b6f0-7f1c25875d10 | Open vSwitch agent | overcloud-compute-0.localdomain    | :-)   | True           | neutron-openvswitch-agent |
| 28762878-c2d8-4815-b45f-6b05498591fc | Open vSwitch agent | overcloud-compute-1.localdomain    | :-)   | True           | neutron-openvswitch-agent |
| 2aaa19c2-6e8f-41cc-9ac4-9ff3a1017190 | L3 agent           | overcloud-controller-1.localdomain | :-)   | True           | neutron-l3-agent          |
| 371e5ef4-1c69-4557-8d42-1742b78770aa | L3 agent           | overcloud-controller-0.localdomain | :-)   | True           | neutron-l3-agent          |
| 3b3f9716-4e87-4f40-af02-0747e6e20d51 | Open vSwitch agent | overcloud-controller-2.localdomain | :-)   | True           | neutron-openvswitch-agent |
| 4804577a-c0ae-44cd-832d-3212ddfec58e | Open vSwitch agent | overcloud-controller-0.localdomain | :-)   | True           | neutron-openvswitch-agent |
| 5be49eb0-90c9-4f2f-8838-2845585a1bb2 | L3 agent           | overcloud-controller-2.localdomain | :-)   | True           | neutron-l3-agent          |
| 78c8b7e8-3f27-461d-9283-cbe84a1dbaf8 | DHCP agent         | overcloud-controller-2.localdomain | :-)   | True           | neutron-dhcp-agent        |
| 7bdc9b1f-7839-4025-aecc-720e753e7d08 | Metadata agent     | overcloud-controller-2.localdomain | :-)   | True           | neutron-metadata-agent    |
| b01c40c4-00d7-4ded-9ced-7660c8049158 | DHCP agent         | overcloud-controller-0.localdomain | :-)   | True           | neutron-dhcp-agent        |
| cd9feb73-8041-4cfe-9f69-a5831c642d9a | Metadata agent     | overcloud-controller-1.localdomain | :-)   | True           | neutron-metadata-agent    |
| e0439331-e263-41b3-afb1-d70eb273e826 | DHCP agent         | overcloud-controller-1.localdomain | :-)   | True           | neutron-dhcp-agent        |
+--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+

Comment 16 Assaf Muller 2015-07-17 15:12:05 UTC
How about neutron l3-agent-list-hosting-router default-router, is the router active on one agent and standby on the other two?

Comment 17 Marios Andreou 2015-07-17 15:29:14 UTC
yeah see at comment 10

Comment 19 Marius Cornea 2015-07-21 08:27:54 UTC
[stack@instack ~]$ source overcloudrc 
[stack@instack ~]$  neutron l3-agent-list-hosting-router tenant-router
+--------------------------------------+------------------------------------+----------------+-------+----------+
| id                                   | host                               | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------------------+----------------+-------+----------+
| 2d6c126b-6921-4d7e-9cd0-47f096a974d6 | overcloud-controller-1.localdomain | True           | :-)   | standby  |
| 1448244b-69ae-49cc-a00a-683011486f4a | overcloud-controller-2.localdomain | True           | :-)   | standby  |
| 02c78d59-015f-4615-9e5c-af54d40aa37e | overcloud-controller-0.localdomain | True           | :-)   | active   |
+--------------------------------------+------------------------------------+----------------+-------+----------+

Comment 21 errata-xmlrpc 2015-08-05 13:57:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2015:1549