Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1902230

Summary: server addresses and interface_list inconsistent after mac change
Product: Red Hat OpenStack Reporter: Maciej Relewicz <mrelewicz>
Component: openstack-heatAssignee: Harald Jensås <hjensas>
Status: CLOSED ERRATA QA Contact: David Rosenfeld <drosenfe>
Severity: high Docs Contact:
Priority: high    
Version: 16.1 (Train)CC: bfournie, dasmith, eglynn, hbrock, hjensas, jhakimra, jschluet, jslagle, kchamart, mburns, ramishra, rurena, sbaker, sbauza, sgordon, smooney, tmurray, vkoul, vromanso
Target Milestone: Upstream M1Keywords: Reopened, Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-heat-13.1.0-1.20220227033356.48b730a.el8ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2059341 (view as bug list) Environment:
Last Closed: 2022-12-07 20:24:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2059341    
Attachments:
Description Flags
Python reproducer script
none
Heat reproducer template none

Description Maciej Relewicz 2020-11-27 12:53:29 UTC
Description of problem:
During overcloud deployment configuration for os-net-config is generated incorrectly. The problem usually occurs for one node from the role, for the rest the problem does not occur. Example below, two nodes overcloudlek-cadb-0 and overcloudlek-cadb-1, the same role: ContrailAnalyticsDatabase, the same network configuration in yaml, differences are in os-net-config. On overcloudlek-cadb-0 is broken configuration. Configuration was deliver by undercloud broken. 

on nodes:

[root@overcloudlek-cadb-0 ~]# cat /etc/os-net-config/config.json
{"network_config": [{"addresses": [{"ip_netmask": "192.168.213.62/None"}], "dns_servers": ["10.10.11.50", "10.10.12.50", "10.10.11.60", "10.10.12.60"], "mtu": 1500, "name": "nic1", "routes": [{"default": true, "next_hop": ""}], "type": "interface", "use_dhcp": false}, {"addresses": [{"ip_netmask": "172.16.0.150/24"}], "device": "nic1", "mtu": 1500, "type": "vlan", "vlan_id": 226}, {"name": "nic2", "type": "interface", "use_dhcp": false}, {"addresses": [{"ip_netmask": "172.16.81.186/24"}], "mtu": 1500, "name": "nic3", "type": "interface"}]}

[root@overcloudlek-cadb-1 ~]# cat /etc/os-net-config/config.json
{"network_config": [{"addresses": [{"ip_netmask": "192.168.213.184/24"}], "dns_servers": ["10.10.11.50", "10.10.12.50", "10.10.11.60", "10.10.12.60"], "mtu": 1500, "name": "nic1", "routes": [{"default": true, "next_hop": "192.168.213.1"}], "type": "interface", "use_dhcp": false}, {"addresses": [{"ip_netmask": "172.16.0.132/24"}], "device": "nic1", "mtu": 1500, "type": "vlan", "vlan_id": 226}, {"name": "nic2", "type": "interface", "use_dhcp": false}, {"addresses": [{"ip_netmask": "172.16.81.156/24"}], "mtu": 1500, "name": "nic3", "type": "interface"}]}

on undercloud:

(undercloud) [stack@undercloud ansible]$ diff /var/lib/mistral/overcloud/ContrailAnalyticsDatabase/overcloudlek-cadb-0/NetworkConfig /var/lib/mistral/overcloud/ContrailAnalyticsDatabase/overcloudlek-cadb-1/NetworkConfig
11c11
< # {"network_config": [{"addresses": [{"ip_netmask": "{{ ctlplane_ip }}/None"}], "dns_servers": ["10.10.11.50", "10.10.12.50", "10.10.11.60", "10.10.12.60"], "mtu": 1500, "name": "nic1", "routes": [{"default": true, "next_hop": ""}], "type": "interface", "use_dhcp": false}, {"addresses": [{"ip_netmask": "{{ internal_api_ip ~ '/' ~ internal_api_cidr }}"}], "device": "nic1", "mtu": 1500, "type": "vlan", "vlan_id": 226}, {"name": "nic2", "type": "interface", "use_dhcp": false}, {"addresses": [{"ip_netmask": "{{ tenant_ip ~ '/' ~ tenant_cidr }}"}], "mtu": 1500, "name": "nic3", "type": "interface"}]} : the json serialized os-net-config config to apply
---
> # {"network_config": [{"addresses": [{"ip_netmask": "{{ ctlplane_ip }}/24"}], "dns_servers": ["10.10.11.50", "10.10.12.50", "10.10.11.60", "10.10.12.60"], "mtu": 1500, "name": "nic1", "routes": [{"default": true, "next_hop": "192.168.213.1"}], "type": "interface", "use_dhcp": false}, {"addresses": [{"ip_netmask": "{{ internal_api_ip ~ '/' ~ internal_api_cidr }}"}], "device": "nic1", "mtu": 1500, "type": "vlan", "vlan_id": 226}, {"name": "nic2", "type": "interface", "use_dhcp": false}, {"addresses": [{"ip_netmask": "{{ tenant_ip ~ '/' ~ tenant_cidr }}"}], "mtu": 1500, "name": "nic3", "type": "interface"}]} : the json serialized os-net-config config to apply
67c67
< if [ -n '{"network_config": [{"addresses": [{"ip_netmask": "{{ ctlplane_ip }}/None"}], "dns_servers": ["10.10.11.50", "10.10.12.50", "10.10.11.60", "10.10.12.60"], "mtu": 1500, "name": "nic1", "routes": [{"default": true, "next_hop": ""}], "type": "interface", "use_dhcp": false}, {"addresses": [{"ip_netmask": "{{ internal_api_ip ~ '/' ~ internal_api_cidr }}"}], "device": "nic1", "mtu": 1500, "type": "vlan", "vlan_id": 226}, {"name": "nic2", "type": "interface", "use_dhcp": false}, {"addresses": [{"ip_netmask": "{{ tenant_ip ~ '/' ~ tenant_cidr }}"}], "mtu": 1500, "name": "nic3", "type": "interface"}]}' ]; then
---
> if [ -n '{"network_config": [{"addresses": [{"ip_netmask": "{{ ctlplane_ip }}/24"}], "dns_servers": ["10.10.11.50", "10.10.12.50", "10.10.11.60", "10.10.12.60"], "mtu": 1500, "name": "nic1", "routes": [{"default": true, "next_hop": "192.168.213.1"}], "type": "interface", "use_dhcp": false}, {"addresses": [{"ip_netmask": "{{ internal_api_ip ~ '/' ~ internal_api_cidr }}"}], "device": "nic1", "mtu": 1500, "type": "vlan", "vlan_id": 226}, {"name": "nic2", "type": "interface", "use_dhcp": false}, {"addresses": [{"ip_netmask": "{{ tenant_ip ~ '/' ~ tenant_cidr }}"}], "mtu": 1500, "name": "nic3", "type": "interface"}]}' ]; then
80c80
<     echo '{"network_config": [{"addresses": [{"ip_netmask": "{{ ctlplane_ip }}/None"}], "dns_servers": ["10.10.11.50", "10.10.12.50", "10.10.11.60", "10.10.12.60"], "mtu": 1500, "name": "nic1", "routes": [{"default": true, "next_hop": ""}], "type": "interface", "use_dhcp": false}, {"addresses": [{"ip_netmask": "{{ internal_api_ip ~ '/' ~ internal_api_cidr }}"}], "device": "nic1", "mtu": 1500, "type": "vlan", "vlan_id": 226}, {"name": "nic2", "type": "interface", "use_dhcp": false}, {"addresses": [{"ip_netmask": "{{ tenant_ip ~ '/' ~ tenant_cidr }}"}], "mtu": 1500, "name": "nic3", "type": "interface"}]}' > /etc/os-net-config/config.json
---
>     echo '{"network_config": [{"addresses": [{"ip_netmask": "{{ ctlplane_ip }}/24"}], "dns_servers": ["10.10.11.50", "10.10.12.50", "10.10.11.60", "10.10.12.60"], "mtu": 1500, "name": "nic1", "routes": [{"default": true, "next_hop": "192.168.213.1"}], "type": "interface", "use_dhcp": false}, {"addresses": [{"ip_netmask": "{{ internal_api_ip ~ '/' ~ internal_api_cidr }}"}], "device": "nic1", "mtu": 1500, "type": "vlan", "vlan_id": 226}, {"name": "nic2", "type": "interface", "use_dhcp": false}, {"addresses": [{"ip_netmask": "{{ tenant_ip ~ '/' ~ tenant_cidr }}"}], "mtu": 1500, "name": "nic3", "type": "interface"}]}' > /etc/os-net-config/config.json


Version-Release number of selected component (if applicable):

rhosp16.1

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:

os-net-config generated correctly

Additional info:

Comment 2 Steve Baker 2020-12-01 21:17:38 UTC
Could you please supply the following:
- the overcloud deploy command used
- the enviroment files used for deployment
- the generated config-download ansible playbook directory

Later on we may also need an sosreport.

Are you maybe using IP-from-pool templates?

Comment 4 Maciej Relewicz 2020-12-03 13:36:14 UTC
Hi,

We are not using ip-from-pool. Lab was redeployed, so i cant send to you config-download directory. Situation occurred several times in various our labs, usually redeployment solved the problem. Lab configuration wasnt changed so I can send templates. Do you need any specific templates? Deployment command:

```
openstack overcloud deploy --timeout 240 --stack overcloud \
  --libvirt-type kvm --templates /home/stack/tripleo-heat-templates \
  -r /home/stack/tripleo-heat-templates/environments/contrail/roles_data.yaml \
  -n /home/stack/tripleo-heat-templates/environments/contrail/network_data.yaml \
  -e /home/stack/tripleo-heat-templates/environments/overcloud_containers.yaml \
  -e /home/stack/tripleo-heat-templates/environments/contrail/contrail-docker-registry.yaml \
  -e /home/stack/tripleo-heat-templates/environments/docker-ha.yaml \
  -e /home/stack/tripleo-heat-templates/environments/network-isolation.yaml \
  -e /home/stack/tripleo-heat-templates/environments/contrail/contrail-plugins.yaml \
  -e /home/stack/tripleo-heat-templates/environments/contrail/contrail-services.yaml \
  -e /home/stack/tripleo-heat-templates/environments/contrail/contrail-net.yaml \
  -e /home/stack/tripleo-heat-templates/environments/enable-tls.yaml \
  -e /home/stack/tripleo-heat-templates/environments/inject-trust-anchor-hiera.yaml \
  -e /home/stack/tripleo-heat-templates/environments/ssl/tls-endpoints-public-ip.yaml \
  -e /home/stack/tripleo-heat-templates/environments/contrail-tls.yaml \
  -e /home/stack/tripleo-heat-templates/environments/contrail/disable-telemetry.yaml \
  -e /home/stack/tripleo-heat-templates/environments/contrail/deployment-artifacts.yaml \
  -e /home/stack/tripleo-heat-templates/environments/contrail/storage-environment.yaml \
  -e /home/stack/tripleo-heat-templates/environments/contrail/environment-extra.yaml \
```

Maciej

Comment 6 Alex Schultz 2020-12-14 16:46:31 UTC
Please provide a copy of teh templates being used.

Comment 7 Maciej Relewicz 2020-12-22 11:50:37 UTC
below template which was used to create network config.

```
heat_template_version: queens

description: >
  Software Config to drive os-net-config to configure multiple interfaces.

parameters:
  ControlPlaneIp:
    default: '192.168.213.0/24'
    description: IP address/subnet on the ctlplane network
    type: string
  ControlPlaneSubnetCidr: # Override this via parameter_defaults
    default: '24'
    description: The subnet CIDR of the control plane network.
    type: string
  ControlPlaneDefaultRoute: # Override this via parameter_defaults
    default: '192.168.213.1'
    description: The default route of the control plane network.
    type: string
  ControlPlaneMtu:
    default: '1500'
    description: MTU of the Control Plane Network
    type: number
  ControlPlaneNetworkMtu:
    default: '1500'
    description: MTU of the Control Plane Network (rhosp13 comp)
    type: number
  ControlPlaneStaticRoutes:
    default: []
    description: >
      Routes for the ctlplane network traffic.
      JSON route e.g. [{'destination':'10.0.0.0/16', 'nexthop':'10.0.0.1'}]
      Unless the default is changed, the parameter is automatically resolved
      from the subnet host_routes attribute.
    type: json
  InternalApiIpSubnet:
    default: '172.16.0.0/24'
    description: IP address/subnet on the InternalApi network
    type: string
  InternalApiNetworkVlanID:
    default: 226
    description: Vlan ID for the InternalApi network traffic.
    type: number
  InternalApiInterfaceDefaultRoute: # Not used by default in this template
    default: '172.16.0.1'
    description: The default route of the InternalApi  network.
    type: string
  InternalApiMtu:
    default: '1500'
    description: MTU of the InternalApi Network
    type: number
  InternalApiNetworkMtu:
    default: '1500'
    description: MTU of the InternalApi Network
    type: number
  InternalApiSupernet:
    default: ''
    description: Supernet on the InternalApi network
    type: string
  InternalApiInterfaceRoutes:
    default: []
    description: >
      Routes for the internal_api network traffic.
      JSON route e.g. [{'destination':'10.0.0.0/16', 'nexthop':'10.0.0.1'}]
      Unless the default is changed, the parameter is automatically resolved
      from the subnet host_routes attribute.
    type: json
  ManagementIpSubnet:
    default: '192.168.1.0/24'
    description: IP address/subnet on the Management network
    type: string
  ManagementNetworkVlanID:
    default: 225
    description: Vlan ID for the Management network traffic.
    type: number
  ManagementInterfaceDefaultRoute: # Not used by default in this template
    default: '192.168.1.1'
    description: The default route of the Management  network.
    type: string
  ManagementMtu:
    default: '1500'
    description: MTU of the Management Network
    type: number
  ManagementNetworkMtu:
    default: '1500'
    description: MTU of the Management Network
    type: number
  ManagementSupernet:
    default: ''
    description: Supernet on the Management network
    type: string
  ManagementInterfaceRoutes:
    default: []
    description: >
      Routes for the internal_api network traffic.
      JSON route e.g. [{'destination':'10.0.0.0/16', 'nexthop':'10.0.0.1'}]
      Unless the default is changed, the parameter is automatically resolved
      from the subnet host_routes attribute.
    type: json
  StorageIpSubnet:
    default: '172.16.1.0/24'
    description: IP address/subnet on the Storage network
    type: string
  StorageNetworkVlanID:
    default: 227
    description: Vlan ID for the Storage network traffic.
    type: number
  StorageInterfaceDefaultRoute: # Not used by default in this template
    default: '172.16.1.1'
    description: The default route of the Storage  network.
    type: string
  StorageMtu:
    default: '1500'
    description: MTU of the Storage Network
    type: number
  StorageNetworkMtu:
    default: '1500'
    description: MTU of the Storage Network
    type: number
  StorageSupernet:
    default: ''
    description: Supernet on the Storage network
    type: string
  StorageInterfaceRoutes:
    default: []
    description: >
      Routes for the internal_api network traffic.
      JSON route e.g. [{'destination':'10.0.0.0/16', 'nexthop':'10.0.0.1'}]
      Unless the default is changed, the parameter is automatically resolved
      from the subnet host_routes attribute.
    type: json
  StorageMgmtIpSubnet:
    default: '172.16.3.0/24'
    description: IP address/subnet on the StorageMgmt network
    type: string
  StorageMgmtNetworkVlanID:
    default: 224
    description: Vlan ID for the StorageMgmt network traffic.
    type: number
  StorageMgmtInterfaceDefaultRoute: # Not used by default in this template
    default: '172.16.3.1'
    description: The default route of the StorageMgmt  network.
    type: string
  StorageMgmtMtu:
    default: '1500'
    description: MTU of the StorageMgmt Network
    type: number
  StorageMgmtNetworkMtu:
    default: '1500'
    description: MTU of the StorageMgmt Network
    type: number
  StorageMgmtSupernet:
    default: ''
    description: Supernet on the StorageMgmt network
    type: string
  StorageMgmtInterfaceRoutes:
    default: []
    description: >
      Routes for the internal_api network traffic.
      JSON route e.g. [{'destination':'10.0.0.0/16', 'nexthop':'10.0.0.1'}]
      Unless the default is changed, the parameter is automatically resolved
      from the subnet host_routes attribute.
    type: json
  TenantIpSubnet:
    default: '172.16.81.0/24'
    description: IP address/subnet on the Tenant network
    type: string
  TenantNetworkVlanID:
    default: 228
    description: Vlan ID for the Tenant network traffic.
    type: number
  TenantInterfaceDefaultRoute: # Not used by default in this template
    default: '172.16.81.1'
    description: The default route of the Tenant  network.
    type: string
  TenantMtu:
    default: '1500'
    description: MTU of the Tenant Network
    type: number
  TenantNetworkMtu:
    default: '1500'
    description: MTU of the Tenant Network
    type: number
  TenantSupernet:
    default: ''
    description: Supernet on the Tenant network
    type: string
  TenantInterfaceRoutes:
    default: []
    description: >
      Routes for the internal_api network traffic.
      JSON route e.g. [{'destination':'10.0.0.0/16', 'nexthop':'10.0.0.1'}]
      Unless the default is changed, the parameter is automatically resolved
      from the subnet host_routes attribute.
    type: json
  ExternalIpSubnet:
    default: '10.87.4.61/25'
    description: IP address/subnet on the External network
    type: string
  ExternalNetworkVlanID:
    default: 1008
    description: Vlan ID for the External network traffic.
    type: number
  ExternalInterfaceDefaultRoute: # Not used by default in this template
    default: '10.87.4.126'
    description: The default route of the External  network.
    type: string
  ExternalMtu:
    default: '1500'
    description: MTU of the External Network
    type: number
  ExternalNetworkMtu:
    default: '1500'
    description: MTU of the External Network
    type: number
  ExternalSupernet:
    default: ''
    description: Supernet on the External network
    type: string
  ExternalInterfaceRoutes:
    default: []
    description: >
      Routes for the internal_api network traffic.
      JSON route e.g. [{'destination':'10.0.0.0/16', 'nexthop':'10.0.0.1'}]
      Unless the default is changed, the parameter is automatically resolved
      from the subnet host_routes attribute.
    type: json
  DnsServers: # Override this via parameter_defaults
    default: ["172.29.131.50","172.29.143.50","172.29.143.60","172.29.139.60"]
    description: A list of DNS servers (2 max for some implementations) that will be added to resolv.conf.
    type: comma_delimited_list
  EC2MetadataIp: # Override this via parameter_defaults
    default: '192.168.213.1'
    description: The IP address of the EC2 metadata server.
    type: string

resources:
  OsNetConfigImpl:
    type: OS::Heat::SoftwareConfig
    properties:
      group: script
      inputs:
        - name: disable_configure_safe_defaults
          default: true
      config:
        str_replace:
          template:
            get_file: ../../network/scripts/run-os-net-config.sh
          params:
            $network_config:
              network_config:
                - addresses:
                  - ip_netmask:
                      list_join:
                      - /
                      - - get_param: ControlPlaneIp
                        - get_param: ControlPlaneSubnetCidr
                  dns_servers:
                    get_param: DnsServers
                  mtu:
                    get_param: ControlPlaneMtu
                  name: nic1
                  routes:
                  - default: true
                    next_hop:
                      get_param: ControlPlaneDefaultRoute
                  type: interface
                  use_dhcp: false
                - addresses:
                  - ip_netmask:
                      get_param: InternalApiIpSubnet
                  device: nic1
                  mtu:
                    get_param: InternalApiMtu
                  type: vlan
                  vlan_id:
                    get_param: InternalApiNetworkVlanID
                - name: nic2
                  type: interface
                  use_dhcp: false
                - addresses:
                  - ip_netmask:
                      get_param: TenantIpSubnet
                  mtu:
                    get_param: TenantMtu
                  name: nic3
                  type: interface


outputs:
  OS::stack_id:
    description: The OsNetConfigImpl resource.
    value:
      get_resource: OsNetConfigImpl
```

Comment 8 Alex Schultz 2021-01-04 21:03:19 UTC
This is likely an issue with the yaql query to try and figure out the ctlplane cidr or ControlPlaneSubnetCidr is set to None for some reason.

https://github.com/openstack/tripleo-heat-templates/blob/stable/train/puppet/role.role.j2.yaml#L415-L421

I don't suppose you could provide a tar of the plan from swift that hit this?

Comment 9 Maciej Relewicz 2021-01-08 09:28:55 UTC
I dont have plan. It appears from time to time, but i think its not an environment specific.

Comment 10 Alex Schultz 2021-01-08 14:09:48 UTC
Ok the next time it happens, reopen this bug and please capture the plan. If you could provide all templates used that will help. I'm going to close this for now because we've not seen this in any of our testing.

Comment 11 Maciej Relewicz 2021-01-20 11:39:51 UTC
We hit a problem again. Different lab, different role (openstack controller) but the same situation. One node from role got missing netmask. Reported officialy here: Case #02848223, files attached to case.

Comment 13 Alex Schultz 2021-01-21 21:52:34 UTC
ControlPlaneSubnetCidr does not appear to be specified anywhere in the templates as a parameter so it's inheriting the default, however we have '' and '24' used in different places as defaults. So my assumption is that based on which file gets loaded first, which value is used.  '' would result in None while '24' would be correct.  It should be noted that we provide '' as the default for all of our files while the environment/contrail/* files have '24' as the defaults. This variable is defined in environments/contrail/contrail-net-single.yaml but I don't believe that file is used. That being said we've actually dropped usage of this parameter in the future so I'm not certain if there's a different way to handle this.  Perhaps Harold has some additional views on this.

Comment 14 Rabi Mishra 2021-01-22 04:43:07 UTC
So, if ControlPlaneSubnetCidr/ControlPlaneDefaultRoute has not been passed in parameter_defaults, they are fetched from the server attributes.


#cat contrailanalyticsdatabase-role.yaml
.....
  NetworkConfig:
    type: OS::TripleO::ContrailAnalyticsDatabase::Net::SoftwareConfig
    properties:
      ControlPlaneIp: "{{ ctlplane_ip }}"
      ControlPlaneSubnetCidr:
        if:
          - ctlplane_subnet_cidr_set
          - {get_param: ControlPlaneSubnetCidr}
          - yaql:
              expression: str("{0}".format($.data).split("/")[-1])
              data: {get_attr: [ContrailAnalyticsDatabase, addresses, ctlplane, 0, subnets, 0, cidr]} << here

      ControlPlaneDefaultRoute:
        if:
          - ctlplane_default_route_set
          - {get_param: ControlPlaneDefaultRoute}
          - {get_attr: [ContrailAnalyticsDatabase, addresses, ctlplane, 0, subnets, 0, gateway_ip]}  << here
.....

What you specify in the nic config templates (ex. contrail-nic-config-Contrail.yaml used for this role) as the default value for ControlPlaneSubnetCidr/ControlPlaneDefaultRoute parameters is irrelevant.

if the attributes come as None after the server has been created (or at least nova tells us that), then you would have the inconsistencies that you notice. I still don't know why that should be the case, may be something to do with the contrail neutron plugin?

The easier way to workaround in this case is to set the ControlPlaneSubnetCidr and ControlPlaneDefaultRoute in contrail-net.yaml (like it's in contrail-single-net.yaml)

Comment 15 Maciej Relewicz 2021-01-22 13:06:39 UTC
ok, we can implement this WA. But please notice that this templates are working correctly with rhosp13, something changed which caused problem.

Comment 16 Steve Baker 2021-01-26 21:20:23 UTC
This change is required due to the simplification work that has been done in the spine and leaf parameters.

Comment 17 Harald Jensås 2021-01-26 23:38:49 UTC
I had another look at this, I want to investigate this a bit further.

There seem to be a race, as the following lookups return None for one server in the role, and the proper value for the other servers in the role.
  {get_attr: [ContrailAnalyticsDatabase, addresses, ctlplane, 0, subnets, 0, cidr]}
  {get_attr: [ContrailAnalyticsDatabase, addresses, ctlplane, 0, subnets, 0, gateway_ip]}

In the cidr case, the yaql returns "None" as a string.
In the gateway_ip, the result of None is "" (empty string in the template).

We can actually see that Heat fails to fetch the resource attributes here:
sosreport-undercloud-2021-01-19-rowsoys/var/log/containers/heat/heat-engine.log.1:2021-01-19 14:37:06.816 23 WARNING heat.engine.resources.openstack.nova.server [req-a2ff4a7d-03f8-4a13-8a6b-a41a667e5081 - admin - default default]  Failed to fetch resource attributes: Port None could not be found.

This logging is from here in the code: https://opendev.org/openstack/heat/src/branch/master/heat/engine/resources/openstack/nova/server.py#L1156

So it seems we are passing "None" to the neutron show_port call on L1152.


AFICT, from Nova log's the "server.interface_list()" call at L1127 succeded.

[root@hjensas nova]# egrep -R "2021-01-19 14:3" | grep os-interface
nova-api.log.1:2021-01-19 14:30:25.795 22 INFO nova.api.openstack.requestlog [req-5d93c974-110d-449d-a08e-baedcd197fb8 d9720f950cfb46ae83ccd19991d11659 61c79c312700420b999a25bd0dceee62 - default default] 192.168.213.1 "GET /v2.1/servers/8e0b3798-5ff7-42a4-8fdc-6329bca39080/os-interface" status: 200 len: 300 microversion: 2.79 time: 0.242941

nova-api.log.1:2021-01-19 14:37:03.717 21 INFO nova.api.openstack.requestlog [req-c6050784-e617-4135-bbd8-0bfab0a9a454 d9720f950cfb46ae83ccd19991d11659 61c79c312700420b999a25bd0dceee62 - default default] 192.168.213.1 "GET /v2.1/servers/87c7fd6e-1189-4995-a359-5204031d0b56/os-interface" status: 200 len: 302 microversion: 2.79 time: 0.160901
nova-api.log.1:2021-01-19 14:37:04.187 21 INFO nova.api.openstack.requestlog [req-33f92513-de1c-4549-b432-e4805048d084 d9720f950cfb46ae83ccd19991d11659 61c79c312700420b999a25bd0dceee62 - default default] 192.168.213.1 "GET /v2.1/servers/87c7fd6e-1189-4995-a359-5204031d0b56/os-interface" status: 200 len: 302 microversion: 2.79 time: 0.182955
nova-api.log.1:2021-01-19 14:37:05.776 27 INFO nova.api.openstack.requestlog [req-61dcfe5b-f591-458f-b0f9-8b681e51bd8f d9720f950cfb46ae83ccd19991d11659 61c79c312700420b999a25bd0dceee62 - default default] 192.168.213.1 "GET /v2.1/servers/301a47f5-2619-4f36-8c28-d454546a5ca1/os-interface" status: 200 len: 301 microversion: 2.79 time: 0.192470
nova-api.log.1:2021-01-19 14:37:06.192 24 INFO nova.api.openstack.requestlog [req-62922e60-4e4c-407b-889a-32a81e129d45 d9720f950cfb46ae83ccd19991d11659 61c79c312700420b999a25bd0dceee62 - default default] 192.168.213.1 "GET /v2.1/servers/1aac7f3b-e9e2-4d0b-8c4d-c0054f1a2c84/os-interface" status: 200 len: 301 microversion: 2.79 time: 0.189072
nova-api.log.1:2021-01-19 14:37:06.752 21 INFO nova.api.openstack.requestlog [req-c74bc9ef-f741-46e4-9760-26916bd35304 d9720f950cfb46ae83ccd19991d11659 61c79c312700420b999a25bd0dceee62 - default default] 192.168.213.1 "GET /v2.1/servers/1aac7f3b-e9e2-4d0b-8c4d-c0054f1a2c84/os-interface" status: 200 len: 301 microversion: 2.79 time: 0.146652
nova-api.log.1:2021-01-19 14:37:06.753 26 INFO nova.api.openstack.requestlog [req-b3532a9c-4dbf-4c95-bd26-0205b0d7e086 d9720f950cfb46ae83ccd19991d11659 61c79c312700420b999a25bd0dceee62 - default default] 192.168.213.1 "GET /v2.1/servers/301a47f5-2619-4f36-8c28-d454546a5ca1/os-interface" status: 200 len: 301 microversion: 2.79 time: 0.169607



I think we would have to add some debug logging in Heat to try to figure this out.


@Maciej, since this seems to be an intermittent problem that we are not able to reproduce internally.
  Could you please add the debug logging as shown below to the heat running on your undercloud and try to reproduce? (add each line starting with LOG.debug('BZ1902230 ...)

The steps to edit the file is:
(undercloud) [centos@undercloud ~]$ sudo su -
[root@undercloud ~]# podman mount heat_engine
/var/lib/containers/storage/overlay/9a5f2ccd8d7d79eb6ec688fca72bbe837d58d0b89775e6af5f6e8e8fc797228d/merged
[root@undercloud ~]# vim /var/lib/containers/storage/overlay/9a5f2ccd8d7d79eb6ec688fca72bbe837d58d0b89775e6af5f6e8e8fc797228d/merged/usr/lib/python3.6/site-packages/heat/engine/resources/openstack/nova/server.py                                                                                                         

  *** make the changes and save the file ***

[root@undercloud ~]# podman umount heat_engine
[root@undercloud ~]# systemctl restart tripleo_heat_engine

  NOTE: The long random string is uniq to your deployment so you can't simply copy and paste the commands above.


    def _add_attrs_for_address(self, server, extend_networks=True):
        """Adds port id, subnets and network attributes to addresses list.
        This method is used only for resolving attributes.
        :param server: The server resource
        :param extend_networks: When False the network is not extended, i.e
                                the net is returned without replacing name on
                                id.
        """
        LOG.debug('BZ1902230 - type server: %s', type(server))
        nets = copy.deepcopy(server.addresses) or {}
        LOG.debug('BZ1902230 - nets: %s', nets)
        ifaces = server.interface_list()
        LOG.debug('BZ1902230 - ifaces: %s', ifaces)
        ip_mac_mapping_on_port_id = dict(((iface.fixed_ips[0]['ip_address'],
                                           iface.mac_addr), iface.port_id)
                                         for iface in ifaces)
        LOG.debug('BZ1902230 - ip_mac_mapping_on_port_id: %s', ip_mac_mapping_on_port_id)
        for net_name in nets:
            for addr in nets[net_name]:
                addr['port'] = ip_mac_mapping_on_port_id.get(
                    (addr['addr'], addr['OS-EXT-IPS-MAC:mac_addr']))
                # _get_live_networks() uses this method to get reality_nets.
                # We don't need to get subnets and network in that case. Only
                # do the external calls if extend_networks is true, i.e called
                # from _resolve_attribute()
                if not extend_networks:
                    continue
                try:
                    port = self.client('neutron').show_port(
                        addr['port'])['port']
                except Exception as ex:
                    addr['subnets'], addr['network'] = None, None
                    LOG.warning("Failed to fetch resource attributes: %s", ex)
                    continue
                addr['subnets'] = self._get_subnets_attr(port['fixed_ips'])
                addr['network'] = self._get_network_attr(port['network_id'])

        if extend_networks:
            return self._extend_networks(nets)
        else:
            return nets

Comment 21 Maciej Relewicz 2021-02-01 07:07:36 UTC
Hi,

Environment was redeployed with WA. Currently problem doesnt exist. When we hit it agan I wll back to you.

Comment 22 Harald Jensås 2021-02-03 18:57:41 UTC
(In reply to Maciej Relewicz from comment #21)
> Hi,
> 
> Environment was redeployed with WA. Currently problem doesnt exist. When we
> hit it agan I wll back to you.

Ok, I will do some internal testing to see if I can re-produce the issue.

Comment 23 Harald Jensås 2021-02-08 10:04:16 UTC
I was not able to reproduce this issue by using a Heat stack, I did however succeed creating a reproducer for this issue using python and just a snippet of Heat code. There is a race in the Nova+Ironic+Neutron as it only happens for a few instances when creating ~10 instances at the same time, less than 20% occurrence on my test system. (I will attach the Heat based reproduces as well as the python reproducer, in theory the Heat template reproducer should trigger it as well but it did'nt on my test environment.)

The problem is that there is a MAC address mismatch in what is returned in the nova "server.addresses" call at L1136 [1] and the the server.interface_list() call at L1137 [2].

Heat builds a dict[3] "ip_mac_mapping_on_port_id" with a tuple as keys from the result of the server.interface_list() call, the dict keys are: (ip_address, mac_address). The dict is used to resolve the neutron port id. Then the IP address and mac address from the "server.addresses" call at L1136 [1] is used to do a lookup from the "ip_mac_mapping_on_port_id" dict at L1143-L1144 [4]. This lookup fails because the mac address (OS-EXT-IPS-MAC:mac_addr) from "server.addresses" does not always match the mac address on the neutron port.


This issue is likely to only happen when Ironic is used, since Ironic does a MAC address update on the neutron port setting the neutron ports MAC address to match the physical network interface.

 I.e :
  1. Nova creates neutron port
  2. Neutron creates a port and auto-generates a MAC address
  3. Port info is passed from Nova to Ironic
  4. Ironic changes the MAC address of the neutron port to match physical hardware
  5. In some cases, Nova still returns the "original" auto-generated MAC address in "server.addresses"



Output of the reproducer
------------------------

(undercloud) [centos@undercloud reproducer]$ python3 reproducer.py
2021-02-08 09:26:29.246874 :: Started building Server: test-server-0
2021-02-08 09:26:30.305732 :: Started building Server: test-server-4
2021-02-08 09:26:30.346258 :: Started building Server: test-server-3
2021-02-08 09:26:29.214739 :: Started building Server: test-server-1
2021-02-08 09:26:32.525734 :: Started building Server: test-server-2
2021-02-08 09:26:31.730258 :: Started building Server: test-server-5
2021-02-08 09:26:34.174303 :: Started building Server: test-server-6
2021-02-08 09:26:35.317740 :: Started building Server: test-server-8
2021-02-08 09:26:37.816171 :: Started building Server: test-server-7
2021-02-08 09:26:38.352557 :: Started building Server: test-server-9

2021-02-08 09:33:36.471098 :: Server: test-server-1 :: >>>> OK <<<<
    addr['port'] == 43abfccc-3908-4aee-9d16-caf7216715c7
    nets == {'ctlplane': [{'version': 4, 'addr': '192.168.24.29', 'OS-EXT-IPS:type': 'fixed', 'OS-EXT-IPS-MAC:mac_addr': 'fa:16:3e:81:f9:dc', 'port': '43abfccc-3908-4aee-9d16-caf7216715c7'}]}
    ifaces == [<NetworkInterface: 43abfccc-3908-4aee-9d16-caf7216715c7>]
    ip_mac_mapping_on_port_id == {('192.168.24.29', 'fa:16:3e:81:f9:dc'): '43abfccc-3908-4aee-9d16-caf7216715c7'}

2021-02-08 09:33:56.018447 :: Server: test-server-5 :: >>>> REPRODUCED <<<<
!!!! addr['port'] == None
     nets == {'ctlplane': [{'version': 4, 'addr': '192.168.24.11', 'OS-EXT-IPS:type': 'fixed', 'OS-EXT-IPS-MAC:mac_addr': 'fa:16:3e:75:06:86', 'port': None}]}
     ifaces == [<NetworkInterface: edd51ad8-bf58-456c-b134-773e41dcb64f>]
     ip_mac_mapping_on_port_id == {('192.168.24.11', 'fa:16:3e:e3:83:ba'): 'edd51ad8-bf58-456c-b134-773e41dcb64f'}

2021-02-08 09:34:08.696165 :: Server: test-server-4 :: >>>> REPRODUCED <<<<
!!!! addr['port'] == None
     nets == {'ctlplane': [{'version': 4, 'addr': '192.168.24.13', 'OS-EXT-IPS:type': 'fixed', 'OS-EXT-IPS-MAC:mac_addr': 'fa:16:3e:3d:88:72', 'port': None}]}
     ifaces == [<NetworkInterface: 24dcbefb-6d73-4fb9-b553-12f4fd44e98e>]
     ip_mac_mapping_on_port_id == {('192.168.24.13', 'fa:16:3e:d9:8c:f0'): '24dcbefb-6d73-4fb9-b553-12f4fd44e98e'}

2021-02-08 09:34:10.301490 :: Server: test-server-3 :: >>>> OK <<<<
    addr['port'] == 13ea58c4-d149-4ea0-b1a4-36e96b378002
    nets == {'ctlplane': [{'version': 4, 'addr': '192.168.24.14', 'OS-EXT-IPS:type': 'fixed', 'OS-EXT-IPS-MAC:mac_addr': 'fa:16:3e:0f:25:91', 'port': '13ea58c4-d149-4ea0-b1a4-36e96b378002'}]}
    ifaces == [<NetworkInterface: 13ea58c4-d149-4ea0-b1a4-36e96b378002>]
    ip_mac_mapping_on_port_id == {('192.168.24.14', 'fa:16:3e:0f:25:91'): '13ea58c4-d149-4ea0-b1a4-36e96b378002'}

2021-02-08 09:34:34.959875 :: Server: test-server-9 :: >>>> OK <<<<
    addr['port'] == 9c2703be-b24a-4359-b690-62367427c665
    nets == {'ctlplane': [{'version': 4, 'addr': '192.168.24.24', 'OS-EXT-IPS:type': 'fixed', 'OS-EXT-IPS-MAC:mac_addr': 'fa:16:3e:73:eb:0d', 'port': '9c2703be-b24a-4359-b690-62367427c665'}]}
    ifaces == [<NetworkInterface: 9c2703be-b24a-4359-b690-62367427c665>]
    ip_mac_mapping_on_port_id == {('192.168.24.24', 'fa:16:3e:73:eb:0d'): '9c2703be-b24a-4359-b690-62367427c665'}

2021-02-08 09:34:36.400174 :: Server: test-server-0 :: >>>> OK <<<<
    addr['port'] == a42e342e-9b12-48ed-8917-b0c6e5931334
    nets == {'ctlplane': [{'version': 4, 'addr': '192.168.24.21', 'OS-EXT-IPS:type': 'fixed', 'OS-EXT-IPS-MAC:mac_addr': 'fa:16:3e:71:9e:25', 'port': 'a42e342e-9b12-48ed-8917-b0c6e5931334'}]}
    ifaces == [<NetworkInterface: a42e342e-9b12-48ed-8917-b0c6e5931334>]
    ip_mac_mapping_on_port_id == {('192.168.24.21', 'fa:16:3e:71:9e:25'): 'a42e342e-9b12-48ed-8917-b0c6e5931334'}

2021-02-08 09:34:39.679962 :: Server: test-server-2 :: >>>> OK <<<<
    addr['port'] == 55812703-9d06-4e34-8e2b-ed026c720217
    nets == {'ctlplane': [{'version': 4, 'addr': '192.168.24.20', 'OS-EXT-IPS:type': 'fixed', 'OS-EXT-IPS-MAC:mac_addr': 'fa:16:3e:c9:b4:f9', 'port': '55812703-9d06-4e34-8e2b-ed026c720217'}]}
    ifaces == [<NetworkInterface: 55812703-9d06-4e34-8e2b-ed026c720217>]
    ip_mac_mapping_on_port_id == {('192.168.24.20', 'fa:16:3e:c9:b4:f9'): '55812703-9d06-4e34-8e2b-ed026c720217'}

2021-02-08 09:34:46.754812 :: Server: test-server-6 :: >>>> OK <<<<
    addr['port'] == ed1281e8-3af6-4661-abd9-819e602eceb4
    nets == {'ctlplane': [{'version': 4, 'addr': '192.168.24.17', 'OS-EXT-IPS:type': 'fixed', 'OS-EXT-IPS-MAC:mac_addr': 'fa:16:3e:d7:39:e8', 'port': 'ed1281e8-3af6-4661-abd9-819e602eceb4'}]}
    ifaces == [<NetworkInterface: ed1281e8-3af6-4661-abd9-819e602eceb4>]
    ip_mac_mapping_on_port_id == {('192.168.24.17', 'fa:16:3e:d7:39:e8'): 'ed1281e8-3af6-4661-abd9-819e602eceb4'}

2021-02-08 09:34:47.008518 :: Server: test-server-8 :: >>>> OK <<<<
    addr['port'] == 79528a4b-1721-4725-a480-a04c41afc48e
    nets == {'ctlplane': [{'version': 4, 'addr': '192.168.24.15', 'OS-EXT-IPS:type': 'fixed', 'OS-EXT-IPS-MAC:mac_addr': 'fa:16:3e:2a:ed:ff', 'port': '79528a4b-1721-4725-a480-a04c41afc48e'}]}
    ifaces == [<NetworkInterface: 79528a4b-1721-4725-a480-a04c41afc48e>]
    ip_mac_mapping_on_port_id == {('192.168.24.15', 'fa:16:3e:2a:ed:ff'): '79528a4b-1721-4725-a480-a04c41afc48e'}

2021-02-08 09:34:47.108197 :: Server: test-server-7 :: >>>> OK <<<<
    addr['port'] == 99efe0ee-4321-40c9-953f-f898b67ea5bc
    nets == {'ctlplane': [{'version': 4, 'addr': '192.168.24.30', 'OS-EXT-IPS:type': 'fixed', 'OS-EXT-IPS-MAC:mac_addr': 'fa:16:3e:1c:50:42', 'port': '99efe0ee-4321-40c9-953f-f898b67ea5bc'}]}
    ifaces == [<NetworkInterface: 99efe0ee-4321-40c9-953f-f898b67ea5bc>]
    ip_mac_mapping_on_port_id == {('192.168.24.30', 'fa:16:3e:1c:50:42'): '99efe0ee-4321-40c9-953f-f898b67ea5bc'}



[1] https://opendev.org/openstack/heat/src/branch/master/heat/engine/resources/openstack/nova/server.py#L1136
[2] https://opendev.org/openstack/heat/src/branch/master/heat/engine/resources/openstack/nova/server.py#L1137
[3] https://opendev.org/openstack/heat/src/branch/master/heat/engine/resources/openstack/nova/server.py#L1138-L1140
[4] https://opendev.org/openstack/heat/src/branch/master/heat/engine/resources/openstack/nova/server.py#L1143-L1144

Comment 24 Harald Jensås 2021-02-08 10:05:20 UTC
Created attachment 1755673 [details]
Python reproducer script

Comment 25 Harald Jensås 2021-02-08 10:08:32 UTC
Created attachment 1755675 [details]
Heat reproducer template

NOTE: I was'nt able to reproduce whit this. But it should reproduce the issue, it seems on my lab the timeings never seem to add up to trigger the race.

Comment 49 errata-xmlrpc 2022-12-07 20:24:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.9 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8795