Bug 1316730

Summary: os-net-config fails to bring up VLANs on a Linux Bond without a bridge present
Product: Red Hat OpenStack Reporter: Dan Sneddon <dsneddon>
Component: os-net-configAssignee: Dan Sneddon <dsneddon>
Status: CLOSED ERRATA QA Contact: Ofer Blaut <oblaut>
Severity: high Docs Contact:
Priority: unspecified    
Version: 8.0 (Liberty)CC: achernet, athomas, ddomingo, dsneddon, hbrock, jslagle, kbasil, mburns, rhel-osp-director-maint
Target Milestone: ga   
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: os-net-config-0.2.2-1.el7ost Doc Type: Bug Fix
Doc Text:
In previous releases, when VLAN interfaces were placed directly on a Linux kernel bond with no bridge, it was possible for the VLANs to start before the bond. When this occurred, the VLANs failed to start. With this release, the os-net-config utility now starts the physical network (namely, bridges first, then bonds and interfaces) before VLANs. This ensures that the VLANs have the interfaces necessary to start properly.
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-07 21:49:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan Sneddon 2016-03-10 22:46:56 UTC
Description of problem:
os-net-config will ensure that bridges (and the member interfaces) are brought up before the VLANs are enabled. If there is no bridge, then the ordering can be incorrect, and os-net-config might try to bring up a member VLAN before the bond that the VLAN is on is up.

Version-Release number of selected component (if applicable):
os-net-config-0.1.6-1.el7ost

How reproducible:
Often, but less than 100%

Steps to Reproduce:
1. Create NIC configs with a Linux bond with member VLANs but no bridge
2. Deploy
3.

Actual results:
It is possible that one or more of the VLANs may be brought up before the bond. This can cause the VLAN to fail to be enabled, and the operator must manually enable the VLAN after deployment.

Expected results:
The VLANs should be enabled by os-net-config

Additional info:
I have a patch for this up for review here: https://review.openstack.org/291420

Comment 7 Dan Sneddon 2016-04-06 20:56:50 UTC
Here is a controller.yaml that would trigger the bug:

### Begin controller.yaml #######
heat_template_version: 2015-04-30

description: >
  Software Config to drive os-net-config with 2 bonded nics on a bridge
  with VLANs attached for the controller role.

parameters:
  ControlPlaneIp:
    default: ''
    description: IP address/subnet on the ctlplane network
    type: string
  ExternalIpSubnet:
    default: ''
    description: IP address/subnet on the external network
    type: string
  InternalApiIpSubnet:
    default: ''
    description: IP address/subnet on the internal API network
    type: string
  StorageIpSubnet:
    default: ''
    description: IP address/subnet on the storage network
    type: string
  StorageMgmtIpSubnet:
    default: ''
    description: IP address/subnet on the storage mgmt network
    type: string
  TenantIpSubnet:
    default: ''
    description: IP address/subnet on the tenant network
    type: string
  ManagementIpSubnet: # Only populated when including environments/network-management.yaml
    default: ''
    description: IP address/subnet on the management network
    type: string
  BondInterfaceOvsOptions:
    default: 'bond_mode=active-backup'
    description: The ovs_options string for the bond interface. Set things like
                 lacp=active and/or bond_mode=balance-slb using this option.
    type: string
  ExternalNetworkVlanID:
    default: 10
    description: Vlan ID for the external network traffic.
    type: number
  InternalApiNetworkVlanID:
    default: 20
    description: Vlan ID for the internal_api network traffic.
    type: number
  StorageNetworkVlanID:
    default: 30
    description: Vlan ID for the storage network traffic.
    type: number
  StorageMgmtNetworkVlanID:
    default: 40
    description: Vlan ID for the storage mgmt network traffic.
    type: number
  TenantNetworkVlanID:
    default: 50
    description: Vlan ID for the tenant network traffic.
    type: number
  ManagementNetworkVlanID:
    default: 60
    description: Vlan ID for the management network traffic.
  ExternalInterfaceDefaultRoute:
    default: '10.0.0.1'
    description: default route for the external network
    type: string
  ControlPlaneSubnetCidr: # Override this via parameter_defaults
    default: '24'
    description: The subnet CIDR of the control plane network.
    type: string
  DnsServers: # Override this via parameter_defaults
    default: []
    description: A list of DNS servers (2 max for some implementations) that will be added to resolv.conf.
    type: comma_delimited_list
  EC2MetadataIp: # Override this via parameter_defaults
    description: The IP address of the EC2 metadata server.
    type: string

resources:
  OsNetConfigImpl:
    type: OS::Heat::StructuredConfig
    properties:
      group: os-apply-config
      config:
        os_net_config:
          network_config:
            -
              type: interface
              name: nic1
              use_dhcp: false
              addresses:
                -
                  ip_netmask:
                    list_join:
                      - '/'
                      - - {get_param: ControlPlaneIp}
                        - {get_param: ControlPlaneSubnetCidr}
              routes:
                -
                  ip_netmask: 169.254.169.254/32
                  next_hop: {get_param: EC2MetadataIp}
            -
              type: linux_bond
              name: bond1
              bonding_options: {get_param: BondInterfaceOvsOptions}
              dns_servers: {get_param: DnsServers}
              members:
                -
                  type: interface
                  name: nic2
                  primary: true
                -
                  type: interface
                  name: nic3
            -
              type: vlan
              device: bond1
              vlan_id: {get_param: ExternalNetworkVlanID}
              addresses:
                -
                  ip_netmask: {get_param: ExternalIpSubnet}
              routes:
                -
                  default: true
                  next_hop: {get_param: ExternalInterfaceDefaultRoute}
            -
              type: vlan
              device: bond1
              vlan_id: {get_param: InternalApiNetworkVlanID}
              addresses:
                -
                  ip_netmask: {get_param: InternalApiIpSubnet}
            -
              type: vlan
              device: bond1
              vlan_id: {get_param: StorageNetworkVlanID}
              addresses:
                -
                  ip_netmask: {get_param: StorageIpSubnet}
            -
              type: vlan
              device: bond1
              vlan_id: {get_param: StorageMgmtNetworkVlanID}
              addresses:
                -
                  ip_netmask: {get_param: StorageMgmtIpSubnet}
            -
              type: vlan
              device: bond1
              vlan_id: {get_param: TenantNetworkVlanID}
              addresses:
                -
                  ip_netmask: {get_param: TenantIpSubnet}
            # Uncomment when including environments/network-management.yaml
            #-
            #  type: vlan
            #  device: bond1
            #  vlan_id: {get_param: ManagementNetworkVlanID}
            #  addresses:
            #    -
            #      ip_netmask: {get_param: ManagementIpSubnet}

outputs:
  OS::stack_id:
    description: The OsNetConfigImpl resource.
    value: {get_resource: OsNetConfigImpl}
##### End controller.yaml #####

Also, here is a config.yaml that can be fed to os-net-config to test. This is a YAML translation of the /etc/os-net-config/config.json from a Controller node with this configuration. When the bug was triggered, one of the VLANs would be brought up before the bond and the interfaces. This can be tested by copying this file to an overcloud node and running "sudo os-net-config --noop --debug -c config.yaml" and looking at the order in which the interfaces would be brought up (it's just a dry run, nothing will be changed). If the bond and the bond slave interfaces are brought up first, then the bug is fixed. Note that this must be tested on a controller with at least 3 NICs, otherwise the nic1, nic2, nic3 abstractions will fail.

#### Begin config.yaml #####
network_config: 
  - 
    routes: 
      - 
        ip_netmask: "169.254.169.254/32"
        next_hop: "192.0.2.1"
    use_dhcp: false
    type: "interface"
    name: "nic1"
    addresses: 
      - 
        ip_netmask: "192.0.2.10/24"
  - 
    type: "linux_bond"
    name: "bond1"
    members: 
      - 
        type: "interface"
        name: "nic2"
        primary: true
      - 
        type: "interface"
        name: "nic3"
    bonding_options: "mode=active-backup"
  - 
    device: "bond1"
    routes: 
      - 
        ip_netmask: "0.0.0.0/0"
        next_hop: "10.0.0.1"
    type: "vlan"
    addresses: 
      - 
        ip_netmask: "10.0.0.5/24"
    vlan_id: 10
  - 
    device: "bond1"
    type: "vlan"
    addresses: 
      - 
        ip_netmask: "172.16.2.7/24"
    vlan_id: 20
  - 
    device: "bond1"
    type: "vlan"
    addresses: 
      - 
        ip_netmask: "172.16.1.7/24"
    vlan_id: 30
  - 
    device: "bond1"
    type: "vlan"
    addresses: 
      - 
        ip_netmask: "172.16.3.5/24"
    vlan_id: 40
  - 
    device: "bond1"
    type: "vlan"
    addresses: 
      - 
        ip_netmask: "172.16.0.4/24"
    vlan_id: 50

#### End config.yaml #####

Comment 9 errata-xmlrpc 2016-04-07 21:49:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0604.html