Bug 1237144
Summary: | Neutron l3-agent active on all 3 controllers when using network isolation | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Marius Cornea <mcornea> | ||||
Component: | rhosp-director | Assignee: | Marios Andreou <mandreou> | ||||
Status: | CLOSED ERRATA | QA Contact: | Marius Cornea <mcornea> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.0 (Kilo) | CC: | amuller, calfonso, dmacpher, jason.dobies, mburns, mcornea, rhel-osp-director-maint | ||||
Target Milestone: | ga | Keywords: | TestOnly, Triaged | ||||
Target Release: | Director | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | openstack-tripleo-heat-templates-0.8.6-40.el7ost python-rdomanager-oscplugin-0.0.8-37.el7ost | Doc Type: | Bug Fix | ||||
Doc Text: |
The NeutronScale resource renamed neutron agents on Controller nodes. This caused an inconsistency with the "neutron agent-list" and as result Neutron reported errors of not having enough L3 agents for L3 HA. This fix removes the NeutronScale resource from Overcloud Heat templates and plans. NeutronScale does not appear in "neutron agent-list" and Neutron reports no errors.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2015-08-05 13:57:46 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1238117 | ||||||
Bug Blocks: | |||||||
Attachments: |
|
Description
Marius Cornea
2015-06-30 13:21:41 UTC
Assaf, Can you look at this? This sounds like it's VRRP/L3_HA and working as designed. The router should be scheduled to all three nodes, but it should only be active on one (The last column of neutron l3-agent-list-hosting-router default-router output). If the router is active on all three nodes that means that the router namespaces can't ping each other on their HA interfaces, which pretty much always means that you generally do not have tenant networking connectivity between the nodes. I will poke at this some more today. What I find strange about the output in the description is the names of the l3 agent hosts, was this done very quickly (immediately?) after deploy? Once NeutronScale starts up (see https://bugzilla.redhat.com/show_bug.cgi?id=1238117 ) the host entries will be udpated for all the neutron agents. It should eventually look like: [stack@instack ~]$ neutron l3-agent-list-hosting-router default-router +--------------------------------------+-------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+-------------+----------------+-------+----------+ | affa289a-ad7b-43eb-a33e-f071e376b9bd | neutron-n-1 | True | :-) | active | | 21ece9eb-4754-44d4-ae33-f66a46c9481d | neutron-n-2 | True | :-) | standby | | 00b503fa-9933-4b56-88bb-e815f1edf104 | neutron-n-0 | True | :-) | standby | +--------------------------------------+-------------+----------------+-------+----------+ and as Assaf says the two are in standby (note the reported agent host names). I will deploy with network isolation and see what happens, is this reproducible at will? Virt or BM env? (sorry read baremetal in the description) It was probably run within minutes after deployment finished. Here is the output for a deployment that has been up for a couple of hours (baremetal with network isolation) [stack@puma42 ~]$ neutron l3-agent-list-hosting-router default-router +--------------------------------------+------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+------------------------------------+----------------+-------+----------+ | 6142b3e8-cf10-44b0-859e-3a269e03be86 | overcloud-controller-0.localdomain | True | xxx | standby | | 28cbcade-0717-4906-a796-eca2e143a4de | neutron-n-1 | True | :-) | active | | 7f57cfbe-5bd4-4eee-9bc6-a30ade30c948 | overcloud-controller-1.localdomain | True | xxx | standby | +--------------------------------------+------------------------------------+----------------+-------+----------+ thanks for the update. I will revisit on Monday but wanted to update before I go. With respect to the l3 agents - given that we have NeutronScale (and the discussion at bug https://bugzilla.redhat.com/show_bug.cgi?id=1238117#c6) I am not so concerned with the presence of the stale references to "overcloud-controller-1.localdomain" or "overcloud-compute-0.localdomain" as you have above, since they were 'things' at some point. As long as they are not deemed to be alive (xxx). What I am concerned about is a timely appearance of the correct agents (as per NeutronScale, again see that bug), i.e. neutron-n-0, n1 etc. To this end the review @ https://review.gerrithub.io/#/c/238320/6 (especially v6) adds a check for 'enough' l3 agents that have the NeutronScale pattern in their name, before going onto neutron initialization and finally declaring Overcloud Deployed. "Matter of minutes" ... the current timeout at that review is 2 mins. ish... but that is just to get at least min (2) l3 agents with the neutron-n-? pattern in their name. We could revisit those params in general if we need to. thanks! So i ran this today on two boxes, one had only 2 controllers but the other had 3 (and 1 ceph, 1 compute, 5 total), and on both I deployed with network isolation. For clarity (vms setup), the deploy looked like: openstack overcloud deploy --plan-uuid $plan_id -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml --control-scale 3 --ceph-storage-scale 1 I had my fixup from https://review.gerrithub.io/#/c/238320/8 applied locally. After deploy I tried neutron-agent-list and also the neutron l3-agent-list-hosting-router default-router in all cases lgtm, like below. I was unable to reproduce, I'd be grateful for any feedback wrt that fix, in particular we can tweak the sleep time or even the number of agents we wait for (currently 2, as min) thanks [stack@instack ~]$ . overcloudrc [stack@instack ~]$ neutron agent-list +--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+ | id | agent_type | host | alive | admin_state_up | binary | +--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+ | 1d9dc657-333d-4ee4-99eb-e96f86876aea | L3 agent | neutron-n-1 | :-) | True | neutron-l3-agent | | 2296fc44-fa51-4c4b-be4f-f11184113134 | DHCP agent | neutron-n-1 | :-) | True | neutron-dhcp-agent | | 2772c7f0-ac4b-4fae-a8f4-26ded0d31064 | L3 agent | overcloud-controller-1.localdomain | xxx | True | neutron-l3-agent | | 40aaba2a-7e03-47b5-a48d-93d1cd6139aa | L3 agent | neutron-n-2 | :-) | True | neutron-l3-agent | | 4d58bd1d-971e-4546-9e30-457b4bbd7f05 | Metadata agent | neutron-n-2 | :-) | True | neutron-metadata-agent | | 58f3980e-1b15-4335-9784-9aba776905d0 | DHCP agent | neutron-n-0 | :-) | True | neutron-dhcp-agent | | 5a62eeed-b29a-4e26-ab9a-5358d9bf9ab8 | L3 agent | neutron-n-0 | :-) | True | neutron-l3-agent | | 6da8fcd7-f99f-4ce6-86f2-cb4969e6feb9 | Metadata agent | neutron-n-1 | :-) | True | neutron-metadata-agent | | 77d3b479-1fdc-4a9e-a314-69b6e624f529 | Metadata agent | neutron-n-0 | :-) | True | neutron-metadata-agent | | 93343573-81ad-40d3-a26c-80f2cbaa0c66 | L3 agent | overcloud-controller-0.localdomain | xxx | True | neutron-l3-agent | | b0297156-2174-4117-98aa-cf5aba525ba5 | Open vSwitch agent | neutron-n-0 | :-) | True | neutron-openvswitch-agent | | b793f6e7-69dc-42d0-879e-6bad3d0e5b4d | Open vSwitch agent | overcloud-compute-0.localdomain | :-) | True | neutron-openvswitch-agent | | d689bd74-88a9-404d-a128-0063eed315ac | Open vSwitch agent | neutron-n-1 | :-) | True | neutron-openvswitch-agent | | de59c391-e2a6-4c47-bc5b-943b22e09a6b | DHCP agent | neutron-n-2 | :-) | True | neutron-dhcp-agent | | ef4f9139-7bf2-4066-9142-ea89c33d1b24 | Open vSwitch agent | neutron-n-2 | :-) | True | neutron-openvswitch-agent | +--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+ [stack@instack ~]$ neutron l3-agent-list-hosting-router default-router +--------------------------------------+-------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+-------------+----------------+-------+----------+ | 1d9dc657-333d-4ee4-99eb-e96f86876aea | neutron-n-1 | True | :-) | active | | 40aaba2a-7e03-47b5-a48d-93d1cd6139aa | neutron-n-2 | True | :-) | standby | | 5a62eeed-b29a-4e26-ab9a-5358d9bf9ab8 | neutron-n-0 | True | :-) | standby | +--------------------------------------+-------------+----------------+-------+----------+ Moving back to ON_DEV and making it depend on 1238117. This *should* be fixed by the fix for that, so this should be moved back to ON_QA along with 1238117. Created attachment 1052360 [details]
deployment results
Attaching deployment results and l3 agents status.
thanks marius - as discussed on irc, the code that outputs that is gone now (midstream, at https://review.gerrithub.io/#/c/239833/1/rdomanager_oscplugin/v1/overcloud_deploy.py ). the real fix again is removal of neutron scale, as discussed in the root bug @ https://bugzilla.redhat.com/show_bug.cgi?id=1238117#c18 and also https://bugzilla.redhat.com/show_bug.cgi?id=1236578#c22 (another dependent one). The reviews for all of those are landed upstream - I am finishing testing and will update those in a while. fyi confirmation that the sleep code marius output shows is gone from python-rdomanager-oscplugin-0.0.8-32.el7ost.noarch (see comment @ https://bugzilla.redhat.com/show_bug.cgi?id=1238117#c21 ) i tested again today (as part of poking at BZ 1236136 ) now that neutronscale is no more. deploy was like: -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml --control-scale 3 --ceph-storage-scale 1 (note my environment also had the fixes applied as explained at https://bugzilla.redhat.com/show_bug.cgi?id=1236136#c27 - which are todo with the network isolation not the neutron agents which I am verifying here) neutron agent-list looks like: [stack@instack ~]$ neutron agent-list +--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+ | id | agent_type | host | alive | admin_state_up | binary | +--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+ | 10139073-97c4-4264-990e-cb7ab35266ce | Open vSwitch agent | overcloud-controller-1.localdomain | :-) | True | neutron-openvswitch-agent | | 1cc3e9e8-3b3f-4e3f-bd1a-8a6a54a1c7f7 | Metadata agent | overcloud-controller-0.localdomain | :-) | True | neutron-metadata-agent | | 26506ec7-1ef3-4b9a-b6f0-7f1c25875d10 | Open vSwitch agent | overcloud-compute-0.localdomain | :-) | True | neutron-openvswitch-agent | | 28762878-c2d8-4815-b45f-6b05498591fc | Open vSwitch agent | overcloud-compute-1.localdomain | :-) | True | neutron-openvswitch-agent | | 2aaa19c2-6e8f-41cc-9ac4-9ff3a1017190 | L3 agent | overcloud-controller-1.localdomain | :-) | True | neutron-l3-agent | | 371e5ef4-1c69-4557-8d42-1742b78770aa | L3 agent | overcloud-controller-0.localdomain | :-) | True | neutron-l3-agent | | 3b3f9716-4e87-4f40-af02-0747e6e20d51 | Open vSwitch agent | overcloud-controller-2.localdomain | :-) | True | neutron-openvswitch-agent | | 4804577a-c0ae-44cd-832d-3212ddfec58e | Open vSwitch agent | overcloud-controller-0.localdomain | :-) | True | neutron-openvswitch-agent | | 5be49eb0-90c9-4f2f-8838-2845585a1bb2 | L3 agent | overcloud-controller-2.localdomain | :-) | True | neutron-l3-agent | | 78c8b7e8-3f27-461d-9283-cbe84a1dbaf8 | DHCP agent | overcloud-controller-2.localdomain | :-) | True | neutron-dhcp-agent | | 7bdc9b1f-7839-4025-aecc-720e753e7d08 | Metadata agent | overcloud-controller-2.localdomain | :-) | True | neutron-metadata-agent | | b01c40c4-00d7-4ded-9ced-7660c8049158 | DHCP agent | overcloud-controller-0.localdomain | :-) | True | neutron-dhcp-agent | | cd9feb73-8041-4cfe-9f69-a5831c642d9a | Metadata agent | overcloud-controller-1.localdomain | :-) | True | neutron-metadata-agent | | e0439331-e263-41b3-afb1-d70eb273e826 | DHCP agent | overcloud-controller-1.localdomain | :-) | True | neutron-dhcp-agent | +--------------------------------------+--------------------+------------------------------------+-------+----------------+---------------------------+ How about neutron l3-agent-list-hosting-router default-router, is the router active on one agent and standby on the other two? yeah see at comment 10 [stack@instack ~]$ source overcloudrc [stack@instack ~]$ neutron l3-agent-list-hosting-router tenant-router +--------------------------------------+------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+------------------------------------+----------------+-------+----------+ | 2d6c126b-6921-4d7e-9cd0-47f096a974d6 | overcloud-controller-1.localdomain | True | :-) | standby | | 1448244b-69ae-49cc-a00a-683011486f4a | overcloud-controller-2.localdomain | True | :-) | standby | | 02c78d59-015f-4615-9e5c-af54d40aa37e | overcloud-controller-0.localdomain | True | :-) | active | +--------------------------------------+------------------------------------+----------------+-------+----------+ Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2015:1549 |