Description of problem: Today we hit an issue when the overcloud deployment timed out after 2 hours, caused by a broken network template for the compute role(containing port for the storage management network). It would be nice to get a sanity check before deployment that validates the heat environment so you don't get to wait 2 hours to fix a problem that shows up in the early stages of the deployment.
Please provide the exact steps and failure to see if we can add checks.
Deploy overcloud by passing -e ~/network-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml. /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml is the default network-environment.yaml includes this compute.yaml nic template[1]. The template contains a vlan interface with IP address from StorageMgmtIpSubnet but in network-isolation.yaml compute role doesn't have a port for StorageMgmt. As a result this error[2] shows up on the deployed compute nodes. [1] http://pastebin.test.redhat.com/295175 [2] http://pastebin.test.redhat.com/295181
Another check we should cover: When creating ovs bridges only one interface should be part of the ovs bridge if bonds are not used. Here's an example of bad template which might lead to loops: resources: OsNetConfigImpl: type: OS::Heat::StructuredConfig properties: group: os-apply-config config: os_net_config: network_config: - type: ovs_bridge name: br-storage use_dhcp: true members: - type: interface name: eth0 use_dhcp: false - type: interface name: eth1 use_dhcp: false - type: interface name: eth2 # force the MAC address of the bridge to this interface primary: true addresses: - ip_netmask: {get_param: StorageIpSubnet} - ip_netmask: {get_param: StorageMgmtIpSubnet}
This should be fixed with validations.