Bug 1372680 - On 3 controller nodes deployment Neutron L3 agent HA isn't configured
Summary: On 3 controller nodes deployment Neutron L3 agent HA isn't configured
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 10.0 (Newton)
Assignee: Brent Eagles
QA Contact: Marius Cornea
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-02 11:10 UTC by Marius Cornea
Modified: 2016-12-14 15:55 UTC (History)
11 users (show)

Fixed In Version: openstack-tripleo-heat-templates-5.0.0-0.20160929150845.4cdc4fc.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-12-14 15:55:52 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2016:2948 normal SHIPPED_LIVE Red Hat OpenStack Platform 10 enhancement update 2016-12-14 19:55:27 UTC
OpenStack gerrit 371926 None None None 2016-09-22 12:37:13 UTC
Launchpad 1623155 None None None 2016-09-13 18:40:20 UTC

Description Marius Cornea 2016-09-02 11:10:38 UTC
Description of problem:
Doing a 3 controller deployment but the l3 agent is running on a single controller:

source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e ~/templates/tls-endpoints-public-ip.yaml \
-e ~/templates/ssl-ports.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 1 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--log-file overcloud_deployment.log &> overcloud_install.log


[stack@undercloud ~]$ neutron l3-agent-list-hosting-router stack-89-tenant_net_ext_tagged-pid54aegtvto-router-5pnd642q37ng
+--------------------------------------+------------------------------------+----------------+-------+----------+
| id                                   | host                               | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------------------+----------------+-------+----------+
| ce93a7f5-2848-413b-b046-eff6ed4b5ca6 | overcloud-controller-0.localdomain | True           | :-)   |          |
+--------------------------------------+------------------------------------+----------------+-------+----------+

[root@overcloud-controller-0 heat-admin]# grep l3_ha /etc/neutron/neutron.conf  | grep -v ^#
l3_ha=False


Version-Release number of selected component (if applicable):
puppet-neutron-9.1.0-0.20160822221647.b20ea67.el7ost.noarch
openstack-tripleo-heat-templates-5.0.0-0.20160823140311.72404b.1.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud
2. Create router
3. Check l3-agent-list-hosting-router

Actual results:
The l3 agent is only running on a single controller. 

Expected results:
The l3 agent is running on all 3 controllers.

Additional info:

Comment 2 Brent Eagles 2016-09-13 15:23:44 UTC
The default value for L3HA used to be set in the python-tripleoclient if more than one controller was specified on deploy was moved to the pacemaker specific  neutron-server.yaml file. I wonder if this is a consequence of recent changes for the lightweight HA architecture.

Comment 3 Brent Eagles 2016-09-14 14:49:50 UTC
Here is the story so far:
- I've verified that changes were made in newton that moved the automatic enabling of L3 HA based on controller count from the python-tripleoclient and placed it in a pacemaker specific equivalent of the neutron-api.yaml (the pacemaker version is actually neutron-server.yaml - seems like an oversight)
- due to the changeover to lightweight HA, the pacemaker variants are not included in our HA deployments

So basically, nothing has been setting l3_ha to true for a few weeks (at least). We could lobby for reverting the client change, but that might actually cause problems in our DVR story.

It's not clear to me yet what the most appropriate fix is. A workaround is to require passing an environment file for HA neutron, but that's problematic from a user (and upgrader's) perspective.

One approach I'm investigating is seeing if we can key off of the controller count field in the resource group, and conditional on whether DVR is enabled, set l3_ha to true. I'm not optimistic that this is doable.

Comment 4 Brent Eagles 2016-09-14 17:42:16 UTC
More n(oise|ews):

- we cannot change the default for l3_ha because it will break single-controller setups
- the above rules out looking to pulling in the pacemaker file as a solution
- AFAICT, there is no way for the controller count to affect the template where we set whether or not to enable l3_ha.

So IMO, we are left with two choices:

 1. revert the change to the python tripleoclient where this was conditionally set. I think this is safe because it seems to set it prior to the processing of the environment files and won't break the situation where we enable DVR. I'll test though of course.

 2. create an environment file that enables it, and document that you need to include it

Comment 5 Brent Eagles 2016-09-15 12:02:57 UTC
A possible 3rd choice came to mind that has a pretty decent chance of working. If it doesn't pan out, I'll propose a revert to the client downstream to cover us until I we can figure out a more "proper" solution.

Comment 6 Marius Cornea 2016-09-19 13:18:23 UTC
With the new custom roles feature we're going to be able to create separate nodes for hosting the l3 agent so one could deploy 1 controller + 2 x networker nodes hosting the l3 agent. Would such a scenario be valid from the dataplane HA perspective?

Comment 9 Nir Yechiel 2016-09-22 12:37:14 UTC
Merged in upstream.

Comment 11 Assaf Muller 2016-10-25 14:53:02 UTC
Is there anything that you need help with as far as QE verification?

Comment 12 Marius Cornea 2016-10-25 21:29:10 UTC
(In reply to Assaf Muller from comment #11)
> Is there anything that you need help with as far as QE verification?

It looks good:

openstack-tripleo-heat-templates-5.0.0-0.2016100801535 

neutron l3-agent-list-hosting-router stack-21-tenant_net_ext_tagged-kdeljrwfvqdh-router-kivnciq6tlrr

+--------------------------------------+------------------------------------+----------------+-------+----------+
| id                                   | host                               | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------------------+----------------+-------+----------+
| def66ae8-6d6f-4a0a-be89-48abcbb6512c | overcloud-controller-2.localdomain | True           | :-)   | active   |
| 98c5013b-6aa3-4888-93b3-c17a760b28fe | overcloud-controller-1.localdomain | True           | :-)   | standby  |
| 4dba3138-d789-48ea-871e-d1aa05e7ac1b | overcloud-controller-0.localdomain | True           | :-)   | standby  |
+--------------------------------------+------------------------------------+----------------+-------+----------+

Comment 15 errata-xmlrpc 2016-12-14 15:55:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html


Note You need to log in before you can comment on or make changes to this bug.