Description of problem: Doing a 3 controller deployment but the l3 agent is running on a single controller: source ~/stackrc export THT=/usr/share/openstack-tripleo-heat-templates openstack overcloud deploy --templates \ -e $THT/environments/network-isolation.yaml \ -e $THT/environments/network-management.yaml \ -e ~/templates/network-environment.yaml \ -e $THT/environments/storage-environment.yaml \ -e ~/templates/disk-layout.yaml \ -e ~/templates/wipe-disk-env.yaml \ -e ~/templates/enable-tls.yaml \ -e ~/templates/inject-trust-anchor.yaml \ -e ~/templates/tls-endpoints-public-ip.yaml \ -e ~/templates/ssl-ports.yaml \ --control-scale 3 \ --control-flavor controller \ --compute-scale 1 \ --compute-flavor compute \ --ceph-storage-scale 1 \ --ceph-storage-flavor ceph \ --ntp-server clock.redhat.com \ --log-file overcloud_deployment.log &> overcloud_install.log [stack@undercloud ~]$ neutron l3-agent-list-hosting-router stack-89-tenant_net_ext_tagged-pid54aegtvto-router-5pnd642q37ng +--------------------------------------+------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+------------------------------------+----------------+-------+----------+ | ce93a7f5-2848-413b-b046-eff6ed4b5ca6 | overcloud-controller-0.localdomain | True | :-) | | +--------------------------------------+------------------------------------+----------------+-------+----------+ [root@overcloud-controller-0 heat-admin]# grep l3_ha /etc/neutron/neutron.conf | grep -v ^# l3_ha=False Version-Release number of selected component (if applicable): puppet-neutron-9.1.0-0.20160822221647.b20ea67.el7ost.noarch openstack-tripleo-heat-templates-5.0.0-0.20160823140311.72404b.1.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy overcloud 2. Create router 3. Check l3-agent-list-hosting-router Actual results: The l3 agent is only running on a single controller. Expected results: The l3 agent is running on all 3 controllers. Additional info:
The default value for L3HA used to be set in the python-tripleoclient if more than one controller was specified on deploy was moved to the pacemaker specific neutron-server.yaml file. I wonder if this is a consequence of recent changes for the lightweight HA architecture.
Here is the story so far: - I've verified that changes were made in newton that moved the automatic enabling of L3 HA based on controller count from the python-tripleoclient and placed it in a pacemaker specific equivalent of the neutron-api.yaml (the pacemaker version is actually neutron-server.yaml - seems like an oversight) - due to the changeover to lightweight HA, the pacemaker variants are not included in our HA deployments So basically, nothing has been setting l3_ha to true for a few weeks (at least). We could lobby for reverting the client change, but that might actually cause problems in our DVR story. It's not clear to me yet what the most appropriate fix is. A workaround is to require passing an environment file for HA neutron, but that's problematic from a user (and upgrader's) perspective. One approach I'm investigating is seeing if we can key off of the controller count field in the resource group, and conditional on whether DVR is enabled, set l3_ha to true. I'm not optimistic that this is doable.
More n(oise|ews): - we cannot change the default for l3_ha because it will break single-controller setups - the above rules out looking to pulling in the pacemaker file as a solution - AFAICT, there is no way for the controller count to affect the template where we set whether or not to enable l3_ha. So IMO, we are left with two choices: 1. revert the change to the python tripleoclient where this was conditionally set. I think this is safe because it seems to set it prior to the processing of the environment files and won't break the situation where we enable DVR. I'll test though of course. 2. create an environment file that enables it, and document that you need to include it
A possible 3rd choice came to mind that has a pretty decent chance of working. If it doesn't pan out, I'll propose a revert to the client downstream to cover us until I we can figure out a more "proper" solution.
With the new custom roles feature we're going to be able to create separate nodes for hosting the l3 agent so one could deploy 1 controller + 2 x networker nodes hosting the l3 agent. Would such a scenario be valid from the dataplane HA perspective?
Merged in upstream.
Is there anything that you need help with as far as QE verification?
(In reply to Assaf Muller from comment #11) > Is there anything that you need help with as far as QE verification? It looks good: openstack-tripleo-heat-templates-5.0.0-0.2016100801535 neutron l3-agent-list-hosting-router stack-21-tenant_net_ext_tagged-kdeljrwfvqdh-router-kivnciq6tlrr +--------------------------------------+------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+------------------------------------+----------------+-------+----------+ | def66ae8-6d6f-4a0a-be89-48abcbb6512c | overcloud-controller-2.localdomain | True | :-) | active | | 98c5013b-6aa3-4888-93b3-c17a760b28fe | overcloud-controller-1.localdomain | True | :-) | standby | | 4dba3138-d789-48ea-871e-d1aa05e7ac1b | overcloud-controller-0.localdomain | True | :-) | standby | +--------------------------------------+------------------------------------+----------------+-------+----------+
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2948.html