Description of problem: I deployed HA-neutron on bare-metal with bond 802.3ad using latest staypuft puddle. when checked the router state (vrrp) i saw that on all controller the state is master. [root@mac441ea173366b ~]# cat /var/lib/neutron/ha_confs/ed9f9ebf-ca42-4f61-9ea6-2369ef69c268/state master[root@mac441ea173366b ~]# [root@mac441ea1733991 ~]# cat /var/lib/neutron/ha_confs/ed9f9ebf-ca42-4f61-9ea6-2369ef69c268/state master[root@mac441ea1733991 ~]# [root@mac441ea1733d43 ~]# cat /var/lib/neutron/ha_confs/ed9f9ebf-ca42-4f61-9ea6-2369ef69c268/state master[root@mac441ea1733d43 ~]# How reproducible: 2/2 Expected results: only 1 router should be on master state, other on backup mode.
Happens also on setup without bonding.
I poked around in the setup and this is what I found. HA routers send VRRP traffic over what we call 'HA ports'. The HA routers are supposed to be able to ping each other over these interfaces but in this case they can't. Thus, no VRRP traffic and everyone is master. It looks like binding has failed on 2 out of the 3 machines involved for these HA ports (ovs-vsctl show, shows VLAN 4095). The Neutron server has warnings about binding failures for these ports as would be expected. One reason you'd get binding failures is because of a mismatch between the 'host' values in the different agents and Neutron server within a single machine. I checked, and the host values are not configured properly. Host 1: http://pastebin.com/fR6yUHAE Host 2: http://pastebin.com/PSENgvkb Host 3: http://pastebin.com/4BhJxX0h
neutron scale configures /etc/neutron/neutron.conf the same way it does for the other agents, so my guess here is that some puppet module is altering the host id for neutron.conf too? (and setting the host=neutron-n-0 instead of "host = neutron-n-$i") Can we get the puppet agent log in the host to see what's happening?
find /var/log -name "*puppet*" /var/log/puppet ll /var/log/puppet shows it's empty.
Ok, lets see if a pcs resource disable neutron-scale; sleep 20; pcs resource enable neutron-scale rewrites /etc/neutron/neutron.conf to the proper value? I'm trying to understand if something comes after and breaks the value.
disabling and enabling neutron-scale via pcs wrote the proper host values to all neutron confs. At this point creating a new HA router succeeds. The question remains how did this setup end up with the wrong host value written to neutron.conf.
Created attachment 997592 [details] logs from one host
This line appears a couple of times in the puppet logs Ofer attached: Mar 3 09:02:20 mac441ea173366b puppet-agent[23903]: (/Stage[main]/Quickstack::Neutron::All/Neutron_config[DEFAULT/host]/value) value changed 'neutron-n-2' to 'neutron-n-0' That shouldn't be happening.
as you said: some other puppet change the host after it is being set by NeutronScale 3 12:03:35 mac441ea1733991 NeutronScale(neutron-scale:1)[6954]: INFO: neutron-scale: host neutron-n-1 set for /etc/neutron/dhcp_agent.ini Mar 3 12:03:36 mac441ea1733991 NeutronScale(neutron-scale:1)[6954]: INFO: neutron-scale: host neutron-n-1 set for /etc/neutron/fwaas_driver.ini Mar 3 12:03:36 mac441ea1733991 NeutronScale(neutron-scale:1)[6954]: INFO: neutron-scale: host neutron-n-1 set for /etc/neutron/l3_agent.ini Mar 3 12:03:36 mac441ea1733991 NeutronScale(neutron-scale:1)[6954]: INFO: neutron-scale: host neutron-n-1 set for /etc/neutron/lbaas_agent.ini Mar 3 12:03:36 mac441ea1733991 NeutronScale(neutron-scale:1)[6954]: INFO: neutron-scale: host neutron-n-1 set for /etc/neutron/metadata_agent.ini Mar 3 12:03:36 mac441ea1733991 NeutronScale(neutron-scale:1)[6954]: INFO: neutron-scale: host neutron-n-1 set for /etc/neutron/neutron.conf Mar 3 12:03:37 mac441ea1733991 NeutronScale(neutron-scale:1)[6954]: INFO: neutron-scale: host neutron-n-1 set for /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini Mar 3 12:03:47 mac441ea1733991 crmd[3013]: notice: process_lrm_event: Operation neutron-netns-cleanup_start_0: ok (node=pcmk-mac441ea1733991, call=350, rc=0, cib-update=118, confirmed=true) Mar 3 12:03:47 mac441ea1733991 crmd[3013]: notice: process_lrm_event: Operation neutron-netns-cleanup_monitor_10000: ok (node=pcmk-mac441ea1733991, call=353, rc=0, cib-update=119, confirmed=false) Mar 3 12:05:37 mac441ea1733991 puppet-agent[3073]: (/Stage[main]/Quickstack::Neutron::All/Neutron_config[DEFAULT/host]/value) value changed 'neutron-n-1' to 'neutron-n-0'
(In reply to Assaf Muller from comment #12) > This line appears a couple of times in the puppet logs Ofer attached: > > Mar 3 09:02:20 mac441ea173366b puppet-agent[23903]: > (/Stage[main]/Quickstack::Neutron::All/Neutron_config[DEFAULT/host]/value) > value changed 'neutron-n-2' to 'neutron-n-0' > > That shouldn't be happening. This is left over from OSP 5 when we did not have neutron scale to set the host value, I guess I missed removing it when we moved the OSP 6. I am moving this to A2 though, A1 is done. Should be as simple as removing 3 line from that manifest you reference here, just need to test it to make sure things still get set up without it.
We won't delay the A1 release, but handle this as a single post update for A1
Patch posted: https://github.com/redhat-openstack/astapor/pull/484
Merged
Verified on A2.not reproduced. used same deployment(3 controllers, 1 compute) rhel-osp-installer-client-0.5.7-1.el7ost.noarch foreman-installer-1.6.0-0.3.RC1.el7ost.noarch openstack-foreman-installer-3.0.17-1.el7ost.noarch rhel-osp-installer-0.5.7-1.el7ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-0791.html