Bug 1239130

Summary: [RFE] Heat environment sanity check
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: rhosp-directorAssignee: Hugh Brock <hbrock>
Status: CLOSED CURRENTRELEASE QA Contact: Shai Revivo <srevivo>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.0 (Kilo)CC: dmacpher, jcoufal, mburns, mcornea, rhel-osp-director-maint
Target Milestone: ---Keywords: FutureFeature, Triaged
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
The director does not provide network validation before or during a deployment. This means a deployment with a bad network configuration can run for two hours with no output and can result in failure. A network validation script is currently in development and will be released in the future.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-30 07:58:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marius Cornea 2015-07-03 16:47:37 UTC
Description of problem:

Today we hit an issue when the overcloud deployment timed out after 2 hours, caused by a broken network template for the compute role(containing port for the storage management network). It would be nice to get a sanity check before deployment that validates the heat environment so you don't get to wait 2 hours to fix a problem that shows up in the early stages of the deployment.

Comment 3 chris alfonso 2015-07-06 16:09:23 UTC
Please provide the exact steps and failure to see if we can add checks.

Comment 5 Marius Cornea 2015-07-07 09:44:03 UTC
Deploy overcloud by passing -e ~/network-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml.

/usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml is the default

network-environment.yaml includes this compute.yaml nic template[1]. The template contains a vlan interface with IP address from StorageMgmtIpSubnet but in network-isolation.yaml compute role doesn't have a port for StorageMgmt. 

As a result this error[2] shows up on the deployed compute nodes.

[1] http://pastebin.test.redhat.com/295175
[2] http://pastebin.test.redhat.com/295181

Comment 6 Marius Cornea 2015-07-13 18:38:38 UTC
Another check we should cover:

When creating ovs bridges only one interface should be part of the ovs bridge if bonds are not used. 

Here's an example of bad template which might lead to loops:

resources:
  OsNetConfigImpl:
    type: OS::Heat::StructuredConfig
    properties:
      group: os-apply-config
      config:
        os_net_config:
          network_config:
            -
              type: ovs_bridge
              name: br-storage
              use_dhcp: true
              members:
                -
                  type: interface
                  name: eth0
                  use_dhcp: false
                -
                  type: interface
                  name: eth1
                  use_dhcp: false
                -
                  type: interface
                  name: eth2
                  # force the MAC address of the bridge to this interface
                  primary: true
                  addresses:
                  -
                    ip_netmask: {get_param: StorageIpSubnet}
                  -
                    ip_netmask: {get_param: StorageMgmtIpSubnet}

Comment 10 Jaromir Coufal 2016-09-30 07:58:45 UTC
This should be fixed with validations.