Bug 1372680

Summary: On 3 controller nodes deployment Neutron L3 agent HA isn't configured
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Brent Eagles <beagles>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 10.0 (Newton)CC: amuller, beagles, dbecker, jjoyce, jschluet, mburns, mcornea, morazi, nyechiel, rhel-osp-director-maint, tfreger
Target Milestone: rcKeywords: Triaged
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-5.0.0-0.20160929150845.4cdc4fc.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-14 15:55:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marius Cornea 2016-09-02 11:10:38 UTC
Description of problem:
Doing a 3 controller deployment but the l3 agent is running on a single controller:

source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e ~/templates/tls-endpoints-public-ip.yaml \
-e ~/templates/ssl-ports.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 1 \
--ceph-storage-flavor ceph \
--ntp-server clock.redhat.com \
--log-file overcloud_deployment.log &> overcloud_install.log


[stack@undercloud ~]$ neutron l3-agent-list-hosting-router stack-89-tenant_net_ext_tagged-pid54aegtvto-router-5pnd642q37ng
+--------------------------------------+------------------------------------+----------------+-------+----------+
| id                                   | host                               | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------------------+----------------+-------+----------+
| ce93a7f5-2848-413b-b046-eff6ed4b5ca6 | overcloud-controller-0.localdomain | True           | :-)   |          |
+--------------------------------------+------------------------------------+----------------+-------+----------+

[root@overcloud-controller-0 heat-admin]# grep l3_ha /etc/neutron/neutron.conf  | grep -v ^#
l3_ha=False


Version-Release number of selected component (if applicable):
puppet-neutron-9.1.0-0.20160822221647.b20ea67.el7ost.noarch
openstack-tripleo-heat-templates-5.0.0-0.20160823140311.72404b.1.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud
2. Create router
3. Check l3-agent-list-hosting-router

Actual results:
The l3 agent is only running on a single controller. 

Expected results:
The l3 agent is running on all 3 controllers.

Additional info:

Comment 2 Brent Eagles 2016-09-13 15:23:44 UTC
The default value for L3HA used to be set in the python-tripleoclient if more than one controller was specified on deploy was moved to the pacemaker specific  neutron-server.yaml file. I wonder if this is a consequence of recent changes for the lightweight HA architecture.

Comment 3 Brent Eagles 2016-09-14 14:49:50 UTC
Here is the story so far:
- I've verified that changes were made in newton that moved the automatic enabling of L3 HA based on controller count from the python-tripleoclient and placed it in a pacemaker specific equivalent of the neutron-api.yaml (the pacemaker version is actually neutron-server.yaml - seems like an oversight)
- due to the changeover to lightweight HA, the pacemaker variants are not included in our HA deployments

So basically, nothing has been setting l3_ha to true for a few weeks (at least). We could lobby for reverting the client change, but that might actually cause problems in our DVR story.

It's not clear to me yet what the most appropriate fix is. A workaround is to require passing an environment file for HA neutron, but that's problematic from a user (and upgrader's) perspective.

One approach I'm investigating is seeing if we can key off of the controller count field in the resource group, and conditional on whether DVR is enabled, set l3_ha to true. I'm not optimistic that this is doable.

Comment 4 Brent Eagles 2016-09-14 17:42:16 UTC
More n(oise|ews):

- we cannot change the default for l3_ha because it will break single-controller setups
- the above rules out looking to pulling in the pacemaker file as a solution
- AFAICT, there is no way for the controller count to affect the template where we set whether or not to enable l3_ha.

So IMO, we are left with two choices:

 1. revert the change to the python tripleoclient where this was conditionally set. I think this is safe because it seems to set it prior to the processing of the environment files and won't break the situation where we enable DVR. I'll test though of course.

 2. create an environment file that enables it, and document that you need to include it

Comment 5 Brent Eagles 2016-09-15 12:02:57 UTC
A possible 3rd choice came to mind that has a pretty decent chance of working. If it doesn't pan out, I'll propose a revert to the client downstream to cover us until I we can figure out a more "proper" solution.

Comment 6 Marius Cornea 2016-09-19 13:18:23 UTC
With the new custom roles feature we're going to be able to create separate nodes for hosting the l3 agent so one could deploy 1 controller + 2 x networker nodes hosting the l3 agent. Would such a scenario be valid from the dataplane HA perspective?

Comment 9 Nir Yechiel 2016-09-22 12:37:14 UTC
Merged in upstream.

Comment 11 Assaf Muller 2016-10-25 14:53:02 UTC
Is there anything that you need help with as far as QE verification?

Comment 12 Marius Cornea 2016-10-25 21:29:10 UTC
(In reply to Assaf Muller from comment #11)
> Is there anything that you need help with as far as QE verification?

It looks good:

openstack-tripleo-heat-templates-5.0.0-0.2016100801535 

neutron l3-agent-list-hosting-router stack-21-tenant_net_ext_tagged-kdeljrwfvqdh-router-kivnciq6tlrr

+--------------------------------------+------------------------------------+----------------+-------+----------+
| id                                   | host                               | admin_state_up | alive | ha_state |
+--------------------------------------+------------------------------------+----------------+-------+----------+
| def66ae8-6d6f-4a0a-be89-48abcbb6512c | overcloud-controller-2.localdomain | True           | :-)   | active   |
| 98c5013b-6aa3-4888-93b3-c17a760b28fe | overcloud-controller-1.localdomain | True           | :-)   | standby  |
| 4dba3138-d789-48ea-871e-d1aa05e7ac1b | overcloud-controller-0.localdomain | True           | :-)   | standby  |
+--------------------------------------+------------------------------------+----------------+-------+----------+

Comment 15 errata-xmlrpc 2016-12-14 15:55:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html