Bug 1720005 - [OSP 10] For scaled-out nodes, puppet-tripleo incorrectly sets [DEFAULT]/host to short name rather than fqdn
Summary: [OSP 10] For scaled-out nodes, puppet-tripleo incorrectly sets [DEFAULT]/host...
Keywords:
Status: CLOSED DUPLICATE of bug 1657692
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 10.0 (Newton)
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: async
: 10.0 (Newton)
Assignee: Rajesh Tailor
QA Contact: Sasha Smolyak
URL:
Whiteboard:
: 1719732 (view as bug list)
Depends On: 1559366
Blocks: 1596760
TreeView+ depends on / blocked
 
Reported: 2019-06-12 22:08 UTC by Matt Flusche
Modified: 2023-09-07 20:08 UTC (History)
16 users (show)

Fixed In Version: puppet-tripleo-5.6.8-29 openstack-tripleo-heat-templates-5.3.10-30.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-09-18 15:55:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 563650 0 'None' MERGED Explicitly set nova/neutron/ceilometer host to expected fqdn 2021-01-28 22:43:33 UTC
Red Hat Issue Tracker OSP-11764 0 None None None 2021-12-10 18:35:33 UTC
Red Hat Issue Tracker UPG-540 0 None None None 2021-12-10 18:35:28 UTC

Description Matt Flusche 2019-06-12 22:08:00 UTC
Description of problem:

- During scale out in an OSP 10 environment; puppet-tripleo incorrectly sets [DEFAULT]/host value to short host name in nova.conf and neutron.conf for scale-out nodes.  Nodes deployed during initial deployment correctly use fqdn values.

- This issue occurs when overcloud nodes use short names as the default hostname (default).  See my example below.

- Seems to be an issue with how current_nova_host & current_neutron_host puppet facts are created when hiera('stack_action') == UPDATE (scale up operation) and [DEFAULT]/host is empty (initial config for scaled out nodes).

- This becomes a critical issue in environments being upgraded.  I traced this specific issue down because of problems during testing of fast-forward upgrade to OSP 13.  This scaled-up node will get re-registered with FQDN values causing issues with nova and neutron resources currently being hosting on it.


Version-Release number of selected component (if applicable):
[root@overcloud-compute-0 ~]# rpm -q puppet-tripleo
puppet-tripleo-5.6.4-3.el7ost.noarch


How reproducible:
100%


Steps to Reproduce:
1. Deploy current OSP 10 overcloud.  Ensure dhcp_domain is empty in nova.conf on undercloud (default config)
[stack@undercloud10 ~]$ sudo grep ^dhcp_domain /etc/nova/nova.conf
dhcp_domain=

2. Deploy overcloud (1 controller, 1 compute is fine)

3. Show config after deployment, everything is correct:

[root@overcloud-compute-0 ~]# hostname
overcloud-compute-0
[root@overcloud-compute-0 ~]# hostname -f
overcloud-compute-0.localdomain


[root@overcloud-controller-0 ~]# hostname
overcloud-controller-0
[root@overcloud-controller-0 ~]# hostname -f
overcloud-controller-0.localdomain


[root@overcloud-controller-0 ~]# hiera stack_action
CREATE
[root@overcloud-controller-0 ~]# set -o vi
[root@overcloud-controller-0 ~]# grep ^host /etc/nova/nova.conf 
host=overcloud-controller-0.localdomain
[root@overcloud-controller-0 ~]# grep ^host /etc/neutron/neutron.conf 
host=overcloud-controller-0.localdomain

[root@overcloud-controller-0 ~]# rpm -q puppet-tripleo
puppet-tripleo-5.6.8-23.el7ost.noarch



[stack@undercloud10 ~]$ neutron agent-list
+--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host                               | availability_zone | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+
| 06cbaefd-d6bf-4ea7-b14d-72efb5aaf441 | Open vSwitch agent | overcloud-controller-0.localdomain |                   | :-)   | True           | neutron-openvswitch-agent |
| 6b451601-7bd4-4194-9316-7405fabcac24 | DHCP agent         | overcloud-controller-0.localdomain | nova              | :-)   | True           | neutron-dhcp-agent        |
| 81ec5bf6-9bb4-4288-bfcb-96f6ef82f71a | L3 agent           | overcloud-controller-0.localdomain | nova              | :-)   | True           | neutron-l3-agent          |
| 9afbb6db-0539-4592-899f-204c9034bbe3 | Metadata agent     | overcloud-controller-0.localdomain |                   | :-)   | True           | neutron-metadata-agent    |
| 9bb32350-5bac-460b-9a7e-80f9d13b47f4 | Open vSwitch agent | overcloud-compute-0.localdomain    |                   | :-)   | True           | neutron-openvswitch-agent |
+--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+
[stack@undercloud10 ~]$ nova service-list
+----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+
| Id | Binary           | Host                               | Zone     | Status  | State | Updated_at                 | Disabled Reason |
+----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+
| 3  | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up    | 2019-06-12T21:07:39.000000 | -               |
| 4  | nova-scheduler   | overcloud-controller-0.localdomain | internal | enabled | up    | 2019-06-12T21:07:38.000000 | -               |
| 5  | nova-conductor   | overcloud-controller-0.localdomain | internal | enabled | up    | 2019-06-12T21:07:32.000000 | -               |
| 6  | nova-compute     | overcloud-compute-0.localdomain    | nova     | enabled | up    | 2019-06-12T21:07:36.000000 | -               |
+----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+

4. scale-out one additional node, ComputeCount: 2

5. Inspect new node and see short name.

[root@overcloud-compute-1 ~]# hostname
overcloud-compute-1
[root@overcloud-compute-1 ~]# hostname -f
overcloud-compute-1.localdomain
[root@overcloud-compute-1 ~]# hiera stack_action
UPDATE
[root@overcloud-compute-1 ~]# grep ^host /etc/nova/nova.conf 
host=overcloud-compute-1
[root@overcloud-compute-1 ~]# grep ^host /etc/neutron/neutron.conf
host=overcloud-compute-1
[root@overcloud-compute-1 ~]# rpm -q puppet-tripleo
puppet-tripleo-5.6.8-23.el7ost.noarch

[stack@undercloud10 ~]$ neutron agent-list
+--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+
| id                                   | agent_type         | host                               | availability_zone | alive | admin_state_up | binary                    |
+--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+
| 06cbaefd-d6bf-4ea7-b14d-72efb5aaf441 | Open vSwitch agent | overcloud-controller-0.localdomain |                   | :-)   | True           | neutron-openvswitch-agent |
| 63c96701-b0e3-46a2-86e4-39f09df7f864 | Open vSwitch agent | overcloud-compute-1                |                   | :-)   | True           | neutron-openvswitch-agent |
| 6b451601-7bd4-4194-9316-7405fabcac24 | DHCP agent         | overcloud-controller-0.localdomain | nova              | :-)   | True           | neutron-dhcp-agent        |
| 81ec5bf6-9bb4-4288-bfcb-96f6ef82f71a | L3 agent           | overcloud-controller-0.localdomain | nova              | :-)   | True           | neutron-l3-agent          |
| 9afbb6db-0539-4592-899f-204c9034bbe3 | Metadata agent     | overcloud-controller-0.localdomain |                   | :-)   | True           | neutron-metadata-agent    |
| 9bb32350-5bac-460b-9a7e-80f9d13b47f4 | Open vSwitch agent | overcloud-compute-0.localdomain    |                   | :-)   | True           | neutron-openvswitch-agent |
+--------------------------------------+--------------------+------------------------------------+-------------------+-------+----------------+---------------------------+
[stack@undercloud10 ~]$ nova service-list
+----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+
| Id | Binary           | Host                               | Zone     | Status  | State | Updated_at                 | Disabled Reason |
+----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+
| 3  | nova-consoleauth | overcloud-controller-0.localdomain | internal | enabled | up    | 2019-06-12T22:04:40.000000 | -               |
| 4  | nova-scheduler   | overcloud-controller-0.localdomain | internal | enabled | up    | 2019-06-12T22:04:40.000000 | -               |
| 5  | nova-conductor   | overcloud-controller-0.localdomain | internal | enabled | up    | 2019-06-12T22:04:43.000000 | -               |
| 6  | nova-compute     | overcloud-compute-0.localdomain    | nova     | enabled | up    | 2019-06-12T22:04:36.000000 | -               |
| 7  | nova-compute     | overcloud-compute-1                | nova     | enabled | up    | 2019-06-12T22:04:39.000000 | -               |
+----+------------------+------------------------------------+----------+---------+-------+----------------------------+-----------------+

Actual results:
new nodes are setup with short names for neutron and nova [DEFAULT]/host values


Expected results:
fqdn should be used just like nodes setup in the initial deployment

Comment 1 Matthew Booth 2019-06-14 14:43:22 UTC
*** Bug 1719732 has been marked as a duplicate of this bug. ***

Comment 2 Matt Flusche 2019-06-20 17:19:10 UTC
The issue seems to be here in tripleo/lib/facter/current_config_hosts.rb:

def get_nova_live_value
  Tempfile.open('get-nova-host') do |nova_stdin|
    File.open(nova_stdin, 'w') do |nova_cmd|
      nova_cmd.puts("import nova.conf\nprint nova.conf.CONF.host")
    end
    Facter::Core::Execution.execute("nova-manage shell python 2>/dev/null < #{nova_stdin.path} | sed -e 's/^[> ]*//'")
  end
end

When manually running this code with an empty [DEFAULT]/host value:

[root@overcloud-compute-0 ~]# grep ^host /etc/nova/nova.conf 
(nil)

[root@overcloud-compute-0 ~]# nova-manage shell python
Option "rpc_backend" from group "DEFAULT" is deprecated for removal.  Its value may be silently ignored in the future.
Python 2.7.5 (default, Mar 26 2019, 22:13:06) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> import nova.conf
>>> print nova.conf.CONF.host
overcloud-compute-0
>>> 


It seems that the easy fix would be to change the logic in tripleo/manifests/profile/base/nova.pp and also use "hiera('nova::host')" as host_real if [DEFAULT]/host is empty.


  if $step >= 4 or ($step >= 3 and $sync_db) {

    if hiera('stack_action', undef) == 'UPDATE' {
      if empty($::current_nova_host) {
        # We fail instead of blindly changing that value as it can
        # break the overcloud.
        fail("We couldn't get the live value of the nova agent, please contact support.")
      } else {
        $host_real = $::current_nova_host
      }
    }    else {
      $host_real = hiera('nova::host')
    }

Comment 3 Ollie Walsh 2019-06-24 09:09:47 UTC
(In reply to Matt Flusche from comment #2)
> The issue seems to be here in tripleo/lib/facter/current_config_hosts.rb:
> 
> def get_nova_live_value
>   Tempfile.open('get-nova-host') do |nova_stdin|
>     File.open(nova_stdin, 'w') do |nova_cmd|
>       nova_cmd.puts("import nova.conf\nprint nova.conf.CONF.host")
>     end
>     Facter::Core::Execution.execute("nova-manage shell python 2>/dev/null <
> #{nova_stdin.path} | sed -e 's/^[> ]*//'")
>   end
> end
> 
> When manually running this code with an empty [DEFAULT]/host value:
> 
> [root@overcloud-compute-0 ~]# grep ^host /etc/nova/nova.conf 
> (nil)
> 
> [root@overcloud-compute-0 ~]# nova-manage shell python
> Option "rpc_backend" from group "DEFAULT" is deprecated for removal.  Its
> value may be silently ignored in the future.
> Python 2.7.5 (default, Mar 26 2019, 22:13:06) 
> [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> (InteractiveConsole)
> >>> import nova.conf
> >>> print nova.conf.CONF.host
> overcloud-compute-0
> >>> 
> 
> 
> It seems that the easy fix would be to change the logic in
> tripleo/manifests/profile/base/nova.pp and also use "hiera('nova::host')" as
> host_real if [DEFAULT]/host is empty.
> 
> 
>   if $step >= 4 or ($step >= 3 and $sync_db) {
> 
>     if hiera('stack_action', undef) == 'UPDATE' {

That ^^^ is the issue I think. It will be false for an initial deployment but true for a scale out. Chem WDYT?

>       if empty($::current_nova_host) {
>         # We fail instead of blindly changing that value as it can
>         # break the overcloud.
>         fail("We couldn't get the live value of the nova agent, please
> contact support.")
>       } else {
>         $host_real = $::current_nova_host
>       }
>     }    else {
>       $host_real = hiera('nova::host')
>     }

Comment 4 Sofer Athlan-Guyot 2019-06-24 10:36:15 UTC
Hi Oliver and Matt

>>     if hiera('stack_action', undef) == 'UPDATE' {

> That ^^^ is the issue I think. It will be false for an initial deployment but true for a scale out. Chem WDYT?

So, the problem here was that there was no way to differentiate between update/upgrade and scale out.  Which can lead
to this type of issue.

But ... the initial problem (scale out node get short name) may be caused by a misconfiguration of the undercloud, basically
what happen is that the proper parameters to get cloud init set a fqdn during scale out may have been overwritten.

So, Matt, could you check that knowledge base article[1], basically this manifest in the cloud-init.log by setting a short
name to the scale out node.  This should be enough to not have the issue with scale out node. As your reproducer suggest, this
is the first issue.

Then the complete solution is to calculate all host parameters in heat instead of depending on the fqdn calculation on the host.

> It seems that the easy fix would be to change the logic in tripleo/manifests/profile/base/nova.pp and also use "hiera('nova::host')" as host_real if [DEFAULT]/host is empty.

I don't want to got into the rabbit hole again and try to add another conditional here.  The original problem was osp9/10 upgrade
where that value could be empty as osp9 didn't set that parameter explicitly, thus we had to keep it that way with whatever the 
code snippet you highlighted was returning.

Why we cannot change the host parameter lightly is detailed in that bugzilla's comment[2]. Basically changing this value on already used compute
node or networker requires a non trivial manual procedure.

Good news is that the manual procedure is now complete[3] just need a little more time to verify it.  Then all the patches attached
to https://bugzilla.redhat.com/show_bug.cgi?id=1657692 will be merged and that issue will disappear altogether. 

One of the patches is to remove all calculations from the puppet code/fact because as it turned out it was fine for osp9/10 upgrade but it the end
caused too much trouble to get it right for all cases. Another set of patches is to check if we are susceptible to get the issue during
update and fail pointing to the kb article.  Eventually the last patch set the host parameter to what is calculated inside the 
template, shielding us of whatever the host believe its hostname is.

Sorry for the lengthy reply, I hope I was clear enough.

[1] https://access.redhat.com/solutions/2089051
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1657692#c21
[3] https://access.redhat.com/solutions/4066521

Comment 5 Matt Flusche 2019-06-24 14:44:45 UTC
(In reply to Sofer Athlan-Guyot from comment #4)
> Hi Oliver and Matt

Thanks for looking at this Oliver & Sofer!

> 
> >>     if hiera('stack_action', undef) == 'UPDATE' {
> 
> > That ^^^ is the issue I think. It will be false for an initial deployment but true for a scale out. Chem WDYT?
> 
> So, the problem here was that there was no way to differentiate between
> update/upgrade and scale out.  Which can lead
> to this type of issue.
> 
Correct, the logic here has issues as it doesn't consider scale-out situations correctly.

> But ... the initial problem (scale out node get short name) may be caused by
> a misconfiguration of the undercloud, basically
> what happen is that the proper parameters to get cloud init set a fqdn
> during scale out may have been overwritten.
> 
> So, Matt, could you check that knowledge base article[1], basically this
> manifest in the cloud-init.log by setting a short
> name to the scale out node.  This should be enough to not have the issue
> with scale out node. As your reproducer suggest, this
> is the first issue.

Correct, if dhcp_domain is set on the undercloud (nova.conf) then this is not an issue.  However, this is not the default or validated or documented outside of this KCS as far as I know.  There are many deployments with the default configuration that run into this issue.

The bug I described here will occur with the default configuration.

> 
> Then the complete solution is to calculate all host parameters in heat
> instead of depending on the fqdn calculation on the host.
> 

The fqdn calculation is correct; the issue is it is not used during scale-up due the the logic of the puppet code (if hiera('stack_action', undef) == 'UPDATE') and how current_nova_host/current_neutron_host facts are created when [DEFAULT]/host is empty in nova.conf or neutron.conf during the initial config as I described in comment #2.
 
> > It seems that the easy fix would be to change the logic in tripleo/manifests/profile/base/nova.pp and also use "hiera('nova::host')" as host_real if [DEFAULT]/host is empty.
> 
> I don't want to got into the rabbit hole again and try to add another
> conditional here.  The original problem was osp9/10 upgrade
> where that value could be empty as osp9 didn't set that parameter
> explicitly, thus we had to keep it that way with whatever the 
> code snippet you highlighted was returning.
> 
> Why we cannot change the host parameter lightly is detailed in that
> bugzilla's comment[2]. Basically changing this value on already used compute
> node or networker requires a non trivial manual procedure.
> 
> Good news is that the manual procedure is now complete[3] just need a little
> more time to verify it.  Then all the patches attached
> to https://bugzilla.redhat.com/show_bug.cgi?id=1657692 will be merged and
> that issue will disappear altogether. 
> 
> One of the patches is to remove all calculations from the puppet code/fact
> because as it turned out it was fine for osp9/10 upgrade but it the end
> caused too much trouble to get it right for all cases. Another set of
> patches is to check if we are susceptible to get the issue during
> update and fail pointing to the kb article.  Eventually the last patch set
> the host parameter to what is calculated inside the 
> template, shielding us of whatever the host believe its hostname is.
> 
Very good; this becomes a big issue with 10->13 ffu as the host parameter is changed causing the many associated issues with that.

> Sorry for the lengthy reply, I hope I was clear enough.
> 
> [1] https://access.redhat.com/solutions/2089051
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1657692#c21
> [3] https://access.redhat.com/solutions/4066521

Comment 10 Rajesh Tailor 2019-07-30 13:30:57 UTC
*** Bug 1559366 has been marked as a duplicate of this bug. ***

Comment 11 Sofer Athlan-Guyot 2019-09-18 15:55:41 UTC
Marking as duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1657692 as they are both about scaled out node that got [DEFAULT]/host set to short name and get fixed by the same set of packages.  See https://bugzilla.redhat.com/show_bug.cgi?id=1657692#c25 for more about the patches.

*** This bug has been marked as a duplicate of bug 1657692 ***


Note You need to log in before you can comment on or make changes to this bug.