Description of problem: - During scale out in an OSP 10 environment; puppet-tripleo incorrectly sets [DEFAULT]/host value to short host name in nova.conf and neutron.conf for scale-out nodes. Nodes deployed during initial deployment correctly use fqdn values. - This issue occurs when overcloud nodes use short names as the default hostname (default). See my example below. - Seems to be an issue with how current_nova_host & current_neutron_host puppet facts are created when hiera('stack_action') == UPDATE (scale up operation) and [DEFAULT]/host is empty (initial config for scaled out nodes). - This becomes a critical issue in environments being upgraded. I traced this specific issue down because of problems during testing of fast-forward upgrade to OSP 13. This scaled-up node will get re-registered with FQDN values causing issues with nova and neutron resources currently being hosting on it. Version-Release number of selected component (if applicable): [root@overcloud-compute-0 ~]# rpm -q puppet-tripleo puppet-tripleo-5.6.4-3.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy current OSP 10 overcloud. Ensure dhcp_domain is empty in nova.conf on undercloud (default config) [stack@undercloud10 ~]$ sudo grep ^dhcp_domain /etc/nova/nova.conf dhcp_domain= 2. Deploy overcloud (1 controller, 1 compute is fine) 3. Show config after deployment, everything is correct: [root@overcloud-compute-0 ~]# hostname overcloud-compute-0 [root@overcloud-compute-0 ~]# hostname -f overcloud-compute-0.localdomain [root@overcloud-controller-0 ~]# hostname overcloud-controller-0 [root@overcloud-controller-0 ~]# hostname -f overcloud-controller-0.localdomain [root@overcloud-controller-0 ~]# hiera stack_action CREATE [root@overcloud-controller-0 ~]# set -o vi [root@overcloud-controller-0 ~]# grep ^host /etc/nova/nova.conf host=overcloud-controller-0.localdomain [root@overcloud-controller-0 ~]# grep ^host /etc/neutron/neutron.conf host=overcloud-controller-0.localdomain [root@overcloud-controller-0 ~]# rpm -q puppet-tripleo puppet-tripleo-5.6.8-23.el7ost.noarch [stack@undercloud10 ~]$ neutron agent-list +--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+ | id | agent_type | host | availability_zone | alive | admin_state_up | binary | +--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+ | 06cbaefd-d6bf-4ea7-b14d-72efb5aaf441 | Open vSwitch agent | overcloud-controller-0.localdomain | | :-) | True | neutron-openvswitch-agent | | 6b451601-7bd4-4194-9316-7405fabcac24 | DHCP agent | overcloud-controller-0.localdomain | nova | :-) | True | neutron-dhcp-agent | | 81ec5bf6-9bb4-4288-bfcb-96f6ef82f71a | L3 agent | overcloud-controller-0.localdomain | nova | :-) | True | neutron-l3-agent | | 9afbb6db-0539-4592-899f-204c9034bbe3 | Metadata agent | overcloud-controller-0.localdomain | | :-) | True | neutron-metadata-agent | | 9bb32350-5bac-460b-9a7e-80f9d13b47f4 | Open vSwitch agent | overcloud-compute-0.localdomain | | :-) | True | neutron-openvswitch-agent | +--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+ [stack@undercloud10 ~]$ nova service-list +----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+ | 3 | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up | 2019-06-12T21:07:39.000000 | - | | 4 | nova-scheduler | overcloud-controller-0.localdomain | internal | enabled | up | 2019-06-12T21:07:38.000000 | - | | 5 | nova-conductor | overcloud-controller-0.localdomain | internal | enabled | up | 2019-06-12T21:07:32.000000 | - | | 6 | nova-compute | overcloud-compute-0.localdomain | nova | enabled | up | 2019-06-12T21:07:36.000000 | - | +----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+ 4. scale-out one additional node, ComputeCount: 2 5. Inspect new node and see short name. [root@overcloud-compute-1 ~]# hostname overcloud-compute-1 [root@overcloud-compute-1 ~]# hostname -f overcloud-compute-1.localdomain [root@overcloud-compute-1 ~]# hiera stack_action UPDATE [root@overcloud-compute-1 ~]# grep ^host /etc/nova/nova.conf host=overcloud-compute-1 [root@overcloud-compute-1 ~]# grep ^host /etc/neutron/neutron.conf host=overcloud-compute-1 [root@overcloud-compute-1 ~]# rpm -q puppet-tripleo puppet-tripleo-5.6.8-23.el7ost.noarch [stack@undercloud10 ~]$ neutron agent-list +--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+ | id | agent_type | host | availability_zone | alive | admin_state_up | binary | +--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+ | 06cbaefd-d6bf-4ea7-b14d-72efb5aaf441 | Open vSwitch agent | overcloud-controller-0.localdomain | | :-) | True | neutron-openvswitch-agent | | 63c96701-b0e3-46a2-86e4-39f09df7f864 | Open vSwitch agent | overcloud-compute-1 | | :-) | True | neutron-openvswitch-agent | | 6b451601-7bd4-4194-9316-7405fabcac24 | DHCP agent | overcloud-controller-0.localdomain | nova | :-) | True | neutron-dhcp-agent | | 81ec5bf6-9bb4-4288-bfcb-96f6ef82f71a | L3 agent | overcloud-controller-0.localdomain | nova | :-) | True | neutron-l3-agent | | 9afbb6db-0539-4592-899f-204c9034bbe3 | Metadata agent | overcloud-controller-0.localdomain | | :-) | True | neutron-metadata-agent | | 9bb32350-5bac-460b-9a7e-80f9d13b47f4 | Open vSwitch agent | overcloud-compute-0.localdomain | | :-) | True | neutron-openvswitch-agent | +--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+ [stack@undercloud10 ~]$ nova service-list +----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | +----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+ | 3 | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up | 2019-06-12T22:04:40.000000 | - | | 4 | nova-scheduler | overcloud-controller-0.localdomain | internal | enabled | up | 2019-06-12T22:04:40.000000 | - | | 5 | nova-conductor | overcloud-controller-0.localdomain | internal | enabled | up | 2019-06-12T22:04:43.000000 | - | | 6 | nova-compute | overcloud-compute-0.localdomain | nova | enabled | up | 2019-06-12T22:04:36.000000 | - | | 7 | nova-compute | overcloud-compute-1 | nova | enabled | up | 2019-06-12T22:04:39.000000 | - | +----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+ Actual results: new nodes are setup with short names for neutron and nova [DEFAULT]/host values Expected results: fqdn should be used just like nodes setup in the initial deployment
*** Bug 1719732 has been marked as a duplicate of this bug. ***
The issue seems to be here in tripleo/lib/facter/current_config_hosts.rb: def get_nova_live_value Tempfile.open('get-nova-host') do |nova_stdin| File.open(nova_stdin, 'w') do |nova_cmd| nova_cmd.puts("import nova.conf\nprint nova.conf.CONF.host") end Facter::Core::Execution.execute("nova-manage shell python 2>/dev/null < #{nova_stdin.path} | sed -e 's/^[> ]*//'") end end When manually running this code with an empty [DEFAULT]/host value: [root@overcloud-compute-0 ~]# grep ^host /etc/nova/nova.conf (nil) [root@overcloud-compute-0 ~]# nova-manage shell python Option "rpc_backend" from group "DEFAULT" is deprecated for removal. Its value may be silently ignored in the future. Python 2.7.5 (default, Mar 26 2019, 22:13:06) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2 Type "help", "copyright", "credits" or "license" for more information. (InteractiveConsole) >>> import nova.conf >>> print nova.conf.CONF.host overcloud-compute-0 >>> It seems that the easy fix would be to change the logic in tripleo/manifests/profile/base/nova.pp and also use "hiera('nova::host')" as host_real if [DEFAULT]/host is empty. if $step >= 4 or ($step >= 3 and $sync_db) { if hiera('stack_action', undef) == 'UPDATE' { if empty($::current_nova_host) { # We fail instead of blindly changing that value as it can # break the overcloud. fail("We couldn't get the live value of the nova agent, please contact support.") } else { $host_real = $::current_nova_host } } else { $host_real = hiera('nova::host') }
(In reply to Matt Flusche from comment #2) > The issue seems to be here in tripleo/lib/facter/current_config_hosts.rb: > > def get_nova_live_value > Tempfile.open('get-nova-host') do |nova_stdin| > File.open(nova_stdin, 'w') do |nova_cmd| > nova_cmd.puts("import nova.conf\nprint nova.conf.CONF.host") > end > Facter::Core::Execution.execute("nova-manage shell python 2>/dev/null < > #{nova_stdin.path} | sed -e 's/^[> ]*//'") > end > end > > When manually running this code with an empty [DEFAULT]/host value: > > [root@overcloud-compute-0 ~]# grep ^host /etc/nova/nova.conf > (nil) > > [root@overcloud-compute-0 ~]# nova-manage shell python > Option "rpc_backend" from group "DEFAULT" is deprecated for removal. Its > value may be silently ignored in the future. > Python 2.7.5 (default, Mar 26 2019, 22:13:06) > [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > (InteractiveConsole) > >>> import nova.conf > >>> print nova.conf.CONF.host > overcloud-compute-0 > >>> > > > It seems that the easy fix would be to change the logic in > tripleo/manifests/profile/base/nova.pp and also use "hiera('nova::host')" as > host_real if [DEFAULT]/host is empty. > > > if $step >= 4 or ($step >= 3 and $sync_db) { > > if hiera('stack_action', undef) == 'UPDATE' { That ^^^ is the issue I think. It will be false for an initial deployment but true for a scale out. Chem WDYT? > if empty($::current_nova_host) { > # We fail instead of blindly changing that value as it can > # break the overcloud. > fail("We couldn't get the live value of the nova agent, please > contact support.") > } else { > $host_real = $::current_nova_host > } > } else { > $host_real = hiera('nova::host') > }
Hi Oliver and Matt >> if hiera('stack_action', undef) == 'UPDATE' { > That ^^^ is the issue I think. It will be false for an initial deployment but true for a scale out. Chem WDYT? So, the problem here was that there was no way to differentiate between update/upgrade and scale out. Which can lead to this type of issue. But ... the initial problem (scale out node get short name) may be caused by a misconfiguration of the undercloud, basically what happen is that the proper parameters to get cloud init set a fqdn during scale out may have been overwritten. So, Matt, could you check that knowledge base article[1], basically this manifest in the cloud-init.log by setting a short name to the scale out node. This should be enough to not have the issue with scale out node. As your reproducer suggest, this is the first issue. Then the complete solution is to calculate all host parameters in heat instead of depending on the fqdn calculation on the host. > It seems that the easy fix would be to change the logic in tripleo/manifests/profile/base/nova.pp and also use "hiera('nova::host')" as host_real if [DEFAULT]/host is empty. I don't want to got into the rabbit hole again and try to add another conditional here. The original problem was osp9/10 upgrade where that value could be empty as osp9 didn't set that parameter explicitly, thus we had to keep it that way with whatever the code snippet you highlighted was returning. Why we cannot change the host parameter lightly is detailed in that bugzilla's comment[2]. Basically changing this value on already used compute node or networker requires a non trivial manual procedure. Good news is that the manual procedure is now complete[3] just need a little more time to verify it. Then all the patches attached to https://bugzilla.redhat.com/show_bug.cgi?id=1657692 will be merged and that issue will disappear altogether. One of the patches is to remove all calculations from the puppet code/fact because as it turned out it was fine for osp9/10 upgrade but it the end caused too much trouble to get it right for all cases. Another set of patches is to check if we are susceptible to get the issue during update and fail pointing to the kb article. Eventually the last patch set the host parameter to what is calculated inside the template, shielding us of whatever the host believe its hostname is. Sorry for the lengthy reply, I hope I was clear enough. [1] https://access.redhat.com/solutions/2089051 [2] https://bugzilla.redhat.com/show_bug.cgi?id=1657692#c21 [3] https://access.redhat.com/solutions/4066521
(In reply to Sofer Athlan-Guyot from comment #4) > Hi Oliver and Matt Thanks for looking at this Oliver & Sofer! > > >> if hiera('stack_action', undef) == 'UPDATE' { > > > That ^^^ is the issue I think. It will be false for an initial deployment but true for a scale out. Chem WDYT? > > So, the problem here was that there was no way to differentiate between > update/upgrade and scale out. Which can lead > to this type of issue. > Correct, the logic here has issues as it doesn't consider scale-out situations correctly. > But ... the initial problem (scale out node get short name) may be caused by > a misconfiguration of the undercloud, basically > what happen is that the proper parameters to get cloud init set a fqdn > during scale out may have been overwritten. > > So, Matt, could you check that knowledge base article[1], basically this > manifest in the cloud-init.log by setting a short > name to the scale out node. This should be enough to not have the issue > with scale out node. As your reproducer suggest, this > is the first issue. Correct, if dhcp_domain is set on the undercloud (nova.conf) then this is not an issue. However, this is not the default or validated or documented outside of this KCS as far as I know. There are many deployments with the default configuration that run into this issue. The bug I described here will occur with the default configuration. > > Then the complete solution is to calculate all host parameters in heat > instead of depending on the fqdn calculation on the host. > The fqdn calculation is correct; the issue is it is not used during scale-up due the the logic of the puppet code (if hiera('stack_action', undef) == 'UPDATE') and how current_nova_host/current_neutron_host facts are created when [DEFAULT]/host is empty in nova.conf or neutron.conf during the initial config as I described in comment #2. > > It seems that the easy fix would be to change the logic in tripleo/manifests/profile/base/nova.pp and also use "hiera('nova::host')" as host_real if [DEFAULT]/host is empty. > > I don't want to got into the rabbit hole again and try to add another > conditional here. The original problem was osp9/10 upgrade > where that value could be empty as osp9 didn't set that parameter > explicitly, thus we had to keep it that way with whatever the > code snippet you highlighted was returning. > > Why we cannot change the host parameter lightly is detailed in that > bugzilla's comment[2]. Basically changing this value on already used compute > node or networker requires a non trivial manual procedure. > > Good news is that the manual procedure is now complete[3] just need a little > more time to verify it. Then all the patches attached > to https://bugzilla.redhat.com/show_bug.cgi?id=1657692 will be merged and > that issue will disappear altogether. > > One of the patches is to remove all calculations from the puppet code/fact > because as it turned out it was fine for osp9/10 upgrade but it the end > caused too much trouble to get it right for all cases. Another set of > patches is to check if we are susceptible to get the issue during > update and fail pointing to the kb article. Eventually the last patch set > the host parameter to what is calculated inside the > template, shielding us of whatever the host believe its hostname is. > Very good; this becomes a big issue with 10->13 ffu as the host parameter is changed causing the many associated issues with that. > Sorry for the lengthy reply, I hope I was clear enough. > > [1] https://access.redhat.com/solutions/2089051 > [2] https://bugzilla.redhat.com/show_bug.cgi?id=1657692#c21 > [3] https://access.redhat.com/solutions/4066521
*** Bug 1559366 has been marked as a duplicate of this bug. ***
Marking as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1657692 as they are both about scaled out node that got [DEFAULT]/host set to short name and get fixed by the same set of packages. See https://bugzilla.redhat.com/show_bug.cgi?id=1657692#c25 for more about the patches. *** This bug has been marked as a duplicate of bug 1657692 ***