Bug 1390630 - [Networking] deployments with network isolation fail during controller config
Summary: [Networking] deployments with network isolation fail during controller config
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: os-net-config
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: Shai Revivo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-01 14:48 UTC by John Fulton
Modified: 2016-11-02 00:12 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-11-02 00:12:46 UTC
Target Upstream Version:


Attachments (Terms of Use)
ceph-storage-nics.yaml (5.49 KB, text/plain)
2016-11-01 17:43 UTC, John Fulton
no flags Details

Description John Fulton 2016-11-01 14:48:36 UTC
I. Description of problem:

Deployments which use Heat templates with network isolation [1][2] fail
with the following seen in /var/log/messages of a controller: 

 [ERROR] Error running /var/lib/heat-config/heat-config-script/2a7eea5c-766b-4958-8fc9-0ac7fdc0394a.

The above file is a script to configure and test networking [3]. 

This problem arrived for me with the 2016-10-25.2 puddle. It did not exist
in the 10-21.3 puddle and earlier. I have used the same switches and hardware
with these same templates [1][2] without a problem since OSP{9,8,7}. 

I have reproduced it with the 2016-10-25.2 and 2016-10-28.2 puddles with 
(3 control + 3 ceph-storage + 1 compute) [4][5] and with
(3 control + 4 hci compute/osd) [6][7] with bare metal and kschinck
reproduced a variation of it with a virtual deploy.

Follow up comment will contain attempted debug steps. 

II. Version-Release number of selected component (if applicable):
    Problem seen in 10.0-RHEL-7/2016-10-25.2 but not 10.0-RHEL-7/2016-10-21.3 

III. How reproducible:
     Deterministic

IV. Steps to Reproduce:
    Use `openstack overcloud deploy` with network isolation

V. Actual results:
   Deployment fails and overcloud nodes did not get IPs from isolated network pools defined in network isolation heat templates. 

Expected results:
   Deployment does not fail and overcloud nodes did get IPs from isolated network pools defined in network isolation heat templates. 

VI. Additional info:

[1] https://github.com/RHsyseng/hci/blob/master/custom-templates/network.yaml
[2] https://github.com/RHsyseng/hci/tree/master/custom-templates/nic-configs
[3] Script which failed:
[root overcloud-controller-0 ~]# head -20 /var/lib/heat-config/heat-config-script/2a7eea5c-766b-4958-8fc9-0ac7fdc0394a

#!/bin/bash
set -e

function ping_retry() {
  local IP_ADDR=$1
  local TIMES=${2:-'10'}
  local COUNT=0
  local PING_CMD=ping
  if [[ $IP_ADDR =~ ":" ]]; then
    PING_CMD=ping6
  fi
  until [ $COUNT -ge $TIMES ]; do
    if $PING_CMD -w 300 -c 1 $IP_ADDR &> /dev/null; then
      echo "Ping to $IP_ADDR succeeded."
      return 0
    fi
    echo "Ping to $IP_ADDR failed. Retrying..."
    COUNT=$(($COUNT + 1))
  done
  return 1
[root overcloud-controller-0 ~]#

[4] Deploy command
time openstack overcloud deploy --templates \
-e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \
-e ~/custom-templates/custom-no-hci.yaml
--control-flavor control \
--control-scale 3 \
--compute-flavor compute \
--compute-scale 1 \
--ceph-storage-flavor ceph-storage \
--ceph-storage-scale 3 \
--ntp-server 10.5.26.10 \
--neutron-bridge-mappings datacentre:br-ex,tenant:br-tenant \
--neutron-network-type vlan \
--neutron-network-vlan-ranges tenant:4051:4060 \
--neutron-disable-tunneling 

[5] 
[stack@hci-director ~]$ cat ~/custom-templates/custom-no-hci.yaml
resource_registry:
  OS::TripleO::NodeUserData: /home/stack/custom-templates/firstboot/first-boot-template.yaml
  OS::TripleO::NodeExtraConfigPost: /home/stack/custom-templates/post-deploy-template.yaml
  OS::TripleO::Controller::Net::SoftwareConfig: /home/stack/custom-templates/nic-configs/controller-nics.yaml
  OS::TripleO::Compute::Net::SoftwareConfig: /home/stack/custom-templates/nic-configs/compute-nics.yaml  
  OS::TripleO::CephStorage::Net::SoftwareConfig: /home/stack/custom-templates/nic-configs/ceph-storage-nics.yaml
  
parameter_defaults:
  # 1. NETWORK
  # Internal API used for private OpenStack Traffic
  InternalApiNetCidr: 192.168.2.0/24
  InternalApiAllocationPools: [{'start': '192.168.2.10', 'end': '192.168.2.200'}]
  InternalApiNetworkVlanID: 4049

  # Tenant Network Traffic - will be used for VXLAN over VLAN
  TenantNetCidr: 192.168.3.0/24
  TenantAllocationPools: [{'start': '192.168.3.10', 'end': '192.168.3.200'}]
  TenantNetworkVlanID: 4050

  # Public Storage Access - e.g. Nova/Glance <--> Ceph
  StorageNetCidr: 172.16.1.0/24
  StorageAllocationPools: [{'start': '172.16.1.10', 'end': '172.16.1.200'}]
  StorageNetworkVlanID: 4046

  # Private Storage Access - i.e. Ceph background cluster/replication
  StorageMgmtNetCidr: 172.16.2.0/24
  StorageMgmtAllocationPools: [{'start': '172.16.2.10', 'end': '172.16.2.200'}]
  StorageMgmtNetworkVlanID: 4047

  # External Networking Access - Public API Access
  ExternalNetCidr: 10.19.137.0/21
  # Leave room for floating IPs in the External allocation pool (if required)
  ExternalAllocationPools: [{'start': '10.19.139.37', 'end': '10.19.139.48'}]
  # Set to the router gateway on the external network
  ExternalInterfaceDefaultRoute: 10.19.143.254

  # Gateway router for the provisioning network (or Undercloud IP)
  ControlPlaneDefaultRoute: 192.168.1.1
  # The IP address of the EC2 metadata server. Generally the IP of the Undercloud
  EC2MetadataIp: 192.168.1.1
  # Define the DNS servers (maximum 2) for the overcloud nodes
  DnsServers: ["10.19.143.247","10.19.143.248"]

  # Set to "br-ex" if using floating IPs on native VLAN on bridge br-ex
  #NeutronExternalNetworkBridge: "''"
  NeutronExternalNetworkBridge: "br-ex"

  # 2. PID0 processes get more resources
  MysqlMaxConnections: 8192
  RabbitFDLimit: 65436
  ControllerExtraConfig:
    tripleo::loadbalancer::haproxy_default_maxconn: 8192

  # 3. CEPH
  CephStorageExtraConfig:
    ceph::profile::params::osd_journal_size: 5120
    ceph::profile::params::osd_pool_default_pg_num: 256
    ceph::profile::params::osd_pool_default_pgp_num: 256
    ceph::profile::params::osd_pool_default_size: 3
    ceph::profile::params::osd_pool_default_min_size: 2
    ceph::profile::params::osd_recovery_max_active: 3
    ceph::profile::params::osd_max_backfills: 1
    ceph::profile::params::osd_recovery_op_priority: 2
    ceph::profile::params::osds:
      '/dev/sda':
        journal: '/dev/sdm'
      '/dev/sdb':
        journal: '/dev/sdm'  
      '/dev/sdc':
        journal: '/dev/sdm'
      '/dev/sdd':
        journal: '/dev/sdm'
      '/dev/sde':
        journal: '/dev/sdn'
      '/dev/sdf':
        journal: '/dev/sdn'
      '/dev/sdg':
        journal: '/dev/sdn'
      '/dev/sdh':
        journal: '/dev/sdn'
      '/dev/sdi':
        journal: '/dev/sdo'
      '/dev/sdj':
        journal: '/dev/sdo'
      '/dev/sdk':
        journal: '/dev/sdo'
      '/dev/sdl':
        journal: '/dev/sdo'
[stack@hci-director ~]$ 

[6] https://github.com/RHsyseng/hci/blob/master/scripts/deploy.sh
[7] https://github.com/RHsyseng/hci/tree/master/custom-templates

Comment 1 John Fulton 2016-11-01 15:25:45 UTC
I ruled out that this was a network environment issue by
verifying that I could manually configure the NICs.

Below are my notes from finding and configuring them. 


Here is what I found from Director.  

[stack@hci-director ~]$ ./deploy-no-hci.sh 
...
2016-10-29 17:22:27Z [overcloud.AllNodesDeploySteps.ComputeDeployment_Step1]: CREATE_COMPLETE  state changed
2016-10-29 21:03:35Z [overcloud.AllNodesDeploySteps]: CREATE_FAILED  CREATE aborted
2016-10-29 21:03:35Z [overcloud]: CREATE_FAILED  Create timed out
2016-10-29 21:03:36Z [overcloud.AllNodesDeploySteps.ControllerDeployment_Step1]: CREATE_FAILED  CREATE aborted
2016-10-29 21:03:36Z [overcloud.AllNodesDeploySteps]: CREATE_FAILED  Resource CREATE failed: Operation cancelled

 Stack overcloud CREATE_FAILED

Heat Stack create failed.

real    241m16.983s
user    0m11.979s
sys     0m0.495s
[stack@hci-director ~]$ 

This is the heat template that reported failure as per shardy's blog: 

+---------------+---------------------------------------------------+
| Field         | Value                                             |
+---------------+---------------------------------------------------+
| id            | 15ab37ee-2cb4-4f17-be0d-a849b25357ea              |
| server_id     | ed459122-fdb7-4033-9867-15a97526534c              |
| config_id     | d9658595-519b-4f4d-830c-1254562e32a5              |
| creation_time | 2016-10-29T17:21:50Z                              |
| updated_time  |                                                   |
| status        | IN_PROGRESS                                       |
| status_reason | Deploy data available                             |
| input_values  | {u'step': 1, u'update_identifier': u'1477760595'} |
| action        | CREATE                                            |
+---------------+---------------------------------------------------+

It was controller0

[stack@hci-director stackhacks]$ openstack server list | grep ed459122-fdb7-4033-9867-15a97526534c
| ed459122-fdb7-4033-9867-15a97526534c | overcloud-controller-0  | ACTIVE | ctlplane=192.168.1.23 | overcloud-full |
[stack@hci-director stackhacks]$ 

I found it with a connection to the external network (10.19) and the
network that was used to provision it (192.168.1): 

[root@overcloud-controller-0 ~]# ip a | egrep "192|172|10." | grep -v qlen
    inet 192.168.1.23/24 brd 192.168.1.255 scope global dynamic p2p1
    inet 10.19.143.130/21 brd 10.19.143.255 scope global dynamic p2p2
[root@overcloud-controller-0 ~]#

It normally looks more like this:

[root@overcloud-controller-0 network-scripts]# ip a | egrep -B 2 '192|172' 
3: p2p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 1c:40:24:99:5d:0b brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.23/24 brd 192.168.1.255 scope global dynamic p2p1
--
13: em2.4047@em2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether 1c:40:24:99:5d:09 brd ff:ff:ff:ff:ff:ff
    inet 172.16.2.1/24 brd 172.16.2.255 scope global em2.4047
--
16: em2.4046@em2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether 1c:40:24:99:5d:09 brd ff:ff:ff:ff:ff:ff
    inet 172.16.1.1/24 brd 172.16.1.255 scope global em2.4046
--
22: em1.4049@em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether 1c:40:24:99:5d:07 brd ff:ff:ff:ff:ff:ff
    inet 192.168.2.1/24 brd 192.168.2.255 scope global em1.4049
--
23: em1.4050@em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether 1c:40:24:99:5d:07 brd ff:ff:ff:ff:ff:ff
    inet 192.168.3.1/24 brd 192.168.3.255 scope global em1.4050
[root@overcloud-controller-0 network-scripts]# 

the above are used like this:

em1:
- VLAN 4049 (hci-api) 192.168.2.0/24
- VLAN 4050 (hci-tenant) 192.168.3.0/24
- VLANs 4051-4060 (hci-tenant1-10) ad hoc range

em2:
- VLAN 4046    172.16.1.0/24	ceph public
- VLAN 4047    172.16.2.0/24	ceph private

How did os-net-config update my ifcfg scripts?

[root@overcloud-controller-0 network-scripts]# for nic in `ls ifcfg-{em*,p2*}`; do echo $nic; echo "-----"; cat $nic; echo -e "\n"; done 
ifcfg-em1
-----
# This file is autogenerated by os-net-config
DEVICE=em1
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
DEVICETYPE=ovs
TYPE=OVSPort
OVS_BRIDGE=br-ex
BOOTPROTO=none


ifcfg-em2
-----
DEVICE="em2"
BOOTPROTO="dhcp"
ONBOOT="yes"
TYPE="Ethernet"

ifcfg-em3
-----
DEVICE="em3"
BOOTPROTO="dhcp"
ONBOOT="yes"
TYPE="Ethernet"

ifcfg-em4
-----
DEVICE="em4"
BOOTPROTO="dhcp"
ONBOOT="yes"
TYPE="Ethernet"

ifcfg-p2p1
-----
DEVICE="p2p1"
BOOTPROTO="dhcp"
ONBOOT="yes"
TYPE="Ethernet"

ifcfg-p2p2
-----
DEVICE="p2p2"
BOOTPROTO="dhcp"
ONBOOT="yes"
TYPE="Ethernet"

ifcfg-p2p3
-----
DEVICE="p2p3"
BOOTPROTO="dhcp"
ONBOOT="yes"
TYPE="Ethernet"

ifcfg-p2p4
-----
DEVICE="p2p4"
BOOTPROTO="dhcp"
ONBOOT="yes"
TYPE="Ethernet"

[root@overcloud-controller-0 network-scripts]#

link detected

[root@overcloud-controller-0 network-scripts]# ethtool em2
Settings for em2:
	Supported ports: [ Backplane ]
	Supported link modes:   1000baseKX/Full 
	                        10000baseKR/Full 
	Supported pause frame use: Symmetric
	Supports auto-negotiation: Yes
	Advertised link modes:  1000baseKX/Full 
	                        10000baseKR/Full 
	Advertised pause frame use: No
	Advertised auto-negotiation: Yes
	Speed: 10000Mb/s
	Duplex: Full
	Port: None
	PHYAD: 0
	Transceiver: external
	Auto-negotiation: on
	Supports Wake-on: g
	Wake-on: g
	Current message level: 0x0000000f (15)
			       drv probe link timer
	Link detected: yes
[root@overcloud-controller-0 network-scripts]#

em2 didn't have the following problem but when manually configuring em1 I saw
the following: 

[root@overcloud-controller-0 network-scripts]# ifup em1.4049
Error: Connection activation failed: Connection 'Vlan em1.4049' is not available on the device em1.4049 at this time.
[root@overcloud-controller-0 network-scripts]#

[root@overcloud-controller-0 network-scripts]# ifup em1.4050 
Error: Connection activation failed: Connection 'Vlan em1.4050' is not available on the device em1.4050 at this time.
[root@overcloud-controller-0 network-scripts]#

I had made the following changes but os-net-config used to do the same
config:  

[stack@hci-director ifcfg-630-b16]$ cat ifcfg-em1.4049
DEVICE=em1.4049
BOOTPROTO=none
ONBOOT=yes
IPADDR=192.168.2.1
PREFIX=24
NETWORK=192.168.2.0
VLAN=yes
[stack@hci-director ifcfg-630-b16]$ cat ifcfg-em1.4050 
DEVICE=em1.4050
BOOTPROTO=none
ONBOOT=yes
IPADDR=192.168.3.1
PREFIX=24
NETWORK=192.168.3.0
VLAN=yes
[stack@hci-director ifcfg-630-b16]$ 

OVS was in the following status: 

[root@overcloud-controller-0 network-scripts]# ovs-vsctl show 
34b70c03-e483-4a9e-a64c-c1ae43348131
    Bridge br-ex
        Port "em1"
            Interface "em1"
        Port br-ex
            Interface br-ex
                type: internal
    ovs_version: "2.5.0"
[root@overcloud-controller-0 network-scripts]#

Made two changes:

A. Disable NetworkManager
B. s/OVSBOOTPROTO=dhcp/OVSBOOTPROTO=none in ifcfg-br-ex

I was then able to bring up em1.{4050,4049}: 

[root@overcloud-controller-0 network-scripts]# ifdown em1; ifdown br-ex; 
[root@overcloud-controller-0 network-scripts]# ifup em1; ifup br-ex; 

[root@overcloud-controller-0 network-scripts]# systemctl stop NetworkManager
[root@overcloud-controller-0 network-scripts]# ifup em1.4050
[root@overcloud-controller-0 network-scripts]# ip a s em1.4050
23: em1.4050@em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether 1c:40:24:99:5d:07 brd ff:ff:ff:ff:ff:ff
    inet 192.168.3.1/24 brd 192.168.3.255 scope global em1.4050
       valid_lft forever preferred_lft forever
    inet6 fe80::1e40:24ff:fe99:5d07/64 scope link 
       valid_lft forever preferred_lft forever
[root@overcloud-controller-0 network-scripts]#

I manually configured a compute node similarly and they were able to
ping each other, ruling out a switch or network issue. Seems to be a
matter of how network was configed on host. 

Next steps:
- Made changes A and B to controller1 and controller2
- Restarted deployment hoping they would be configured
- controller0 failed to make next step as it couldn't reach the other controllers

Oct 31 15:31:56 overcloud-controller-0 os-collect-config: [2016-10-31 19:31:56,351] (heat-config) [INFO] {"deploy_stdout": "Matching apachectl 'Server version: Apache/2.4.6 (Red Hat Enterprise Linux)\nServer built:   Aug  3 2016 08:33:27'\n\u001b[mNotice: Scope(Class[Tripleo::Firewall::Post]): At this stage, all network traffic is blocked.\u001b[0m\n\u001b[mNotice: Compiled catalog for overcloud-controller-0.localdomain in environment production in 4.24 seconds\u001b[0m\n\u001b[mNotice: /Stage[setup]/Firewall::Linux::Redhat/File[/etc/sysconfig/iptables]/seluser: seluser changed 'system_u' to 'unconfined_u'\u001b[0m\n\u001b[mNotice: /File[/etc/sysconfig/iptables]/seltype: seltype changed 'etc_t' to 'system_conf_t'\u001b[0m\n\u001b[mNotice: /Stage[main]/Memcached/Service[memcached]/ensure: ensure changed 'stopped' to 'running'\u001b[0m\n\u001b[mNotice: /Stage[main]/Ntp::Config/File[/etc/ntp.conf]/content: content changed '{md5}4ba99b963af6b0cb0c8b114106b56947' to '{md5}ff273154a520ecc0e21c25db677de324'\u001b[0m\n\u001b[mNotice: /Stage[main]/Ntp::Service/Service[ntp]: Triggered 'refresh' from 1 events\u001b[0m\n\u001b[mNotice: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql/Tripleo::Pacemaker::Resource_restart_flag[galera-master]/File[/var/lib/tripleo/pacemaker-restarts]/ensure: created\u001b[0m\n\u001b[mNotice: /File[/etc/localtime]/seltype: seltype changed 'locale_t' to 'etc_t'\u001b[0m\n\u001b[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]/returns: Error: overcloud-controller-0: error checking node availability: Unable to connect to overcloud-controller-0 ([Errno 113] No route to host)\u001b[0m\n\u001b[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]/returns: Error: overcloud-controller-1: error checking node availability: Unable to connect to overcloud-controller-1 ([Errno 113] No route to host)\u001b[0m\n\u001b[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]/returns: Error: overcloud-controller-2: error checking node availability: Unable to connect to overcloud-contr
Oct 31 15:31:56 overcloud-controller-0 os-collect-config: oller-2 ([Errno 113] No route to host)\u001b[0m\n\u001b[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]/returns: Error: nodes availability check failed, use --force to override. WARNING: This will destroy existing cluster on the nodes.\u001b[0m\n\u001b[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[Start Cluster tripleo_cluster]: Dependency Exec[Create Cluster tripleo_cluster] has failures: true\u001b[0m\n\u001b[mNotice: /Stage[main]/Pacemaker::Service/Service[pacemaker]: Dependency Exec[Create Cluster tripleo_cluster] has failures: true\u001b[0m\n\u001b[mNotice: /Stage[main]/Pacemaker::Service/Service[corosync]: Dependency Exec[Create Cluster tripleo_cluster] has failures: true\u001b[0m\n\u001b[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]: Dependency Exec[Create Cluster tripleo_cluster] has failures: true\u001b[0m\n\u001b[mNotice: /Stage[main]/Pacemaker::Stonith/Pacemaker::Property[Disable STONITH]/Exec[Creating cluster-wide property stonith-enabled]: Dependency Exec[Create Cluster tripleo_cluster] has failures: true\u001b[0m\n\u001b[mNotice: /Stage[main]/Haproxy/Haproxy::Instance[haproxy]/Haproxy::Config[haproxy]/Concat[/etc/haproxy/haproxy.cfg]/File[/etc/haproxy/haproxy.cfg]/content: content changed '{md5}1f337186b0e1ba5ee82760cb437fb810' to '{md5}9872c705c5b9bf619a08bcb86e1667e1'\u001b[0m\n\u001b[mNotice: /File[/etc/haproxy/haproxy.cfg]/seluser: seluser changed 'unconfined_u' to 'system_u'\u001b[0m\n\u001b[mNotice: /Stage[main]/Tripleo::Profile::Pacemaker::Haproxy/Tripleo::Pacemaker::Resource_restart_flag[haproxy-clone]/Exec[haproxy-clone resource restart flag]: Triggered 'refresh' from 1 events\u001b[0m\n\u001b[mNotice: /Stage[main]/Tripleo::Profile::Base::Haproxy/Exec[haproxy-reload]: Triggered 'refresh' from 1 events\u001b[0m\n\u001b[mNotice: /Firewall[998 log all]: Dependency Exec[Create Cluster tripleo_cluster] has failures: true\u001b[0m\n\u001b[mNotice: /Firewall[999 drop all]: Dependency Exec[Create Cluster tripleo_cluster] has failures: true\u001b[0m\
Oct 31 15:31:56 overcloud-controller-0 os-collect-config: n\u001b[mNotice: Finished catalog run in 11.54 seconds\u001b[0m\n", "deploy_stderr": "exception: connect failed\n\u001b[1;31mWarning: Scope(Haproxy::Config[haproxy]): haproxy: The $merge_options parameter will default to true in the next major release. Please review the documentation regarding the implications.\u001b[0m\n\u001b[1;31mError: Could not prefetch mysql_user provider 'mysql': Execution of '/usr/bin/mysql -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)\u001b[0m\n\u001b[1;31mError: Could not prefetch mysql_database provider 'mysql': Execution of '/usr/bin/mysql -NBe show databases' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)\u001b[0m\n\u001b[1;31mError: /sbin/pcs cluster setup --name tripleo_cluster overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 --token 10000 returned 1 instead of one of [0]\u001b[0m\n\u001b[1;31mError: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]/returns: change from notrun to 0 failed: /sbin/pcs cluster setup --name tripleo_cluster overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 --token 10000 returned 1 instead of one of [0]\u001b[0m\n\u001b[1;31mWarning: /Stage[main]/Pacemaker::Corosync/Exec[Start Cluster tripleo_cluster]: Skipping because of failed dependencies\u001b[0m\n\u001b[1;31mWarning: /Stage[main]/Pacemaker::Service/Service[pacemaker]: Skipping because of failed dependencies\u001b[0m\n\u001b[1;31mWarning: /Stage[main]/Pacemaker::Service/Service[corosync]: Skipping because of failed dependencies\u001b[0m\n\u001b[1;31mWarning: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]: Skipping because of failed dependencies\u001b[0m\n\u001b[1;31mWarning: /Stage[main]/Pacemaker::Stonith/Pacemaker::Property[Disable STONITH]/Exec[Creating cluster-wide property stonith-enabled]: Skipping because of failed dependencies\u001b[
Oct 31 15:31:56 overcloud-controller-0 os-collect-config: 0m\n\u001b[1;31mWarning: /Firewall[998 log all]: Skipping because of failed dependencies\u001b[0m\n\u001b[1;31mWarning: /Firewall[999 drop all]: Skipping because of failed dependencies\u001b[0m\n", "deploy_status_code": 6}

Comment 2 Bob Fournier 2016-11-01 16:25:26 UTC
One thing I noticed in the deploy command is that the Neutron arguments used have been deprecated.  Specifically "--neutron-disable-tunneling" will no longer disable tunneling, all other neutron commands arguments should work as expected but they should be defined in heat template files instead of deploy command arguments.

If necessary to disable tunneling, set this in parameter_defaults:
NeutronTunnelTypes: ''

Comment 3 Bob Fournier 2016-11-01 17:15:18 UTC
John - can you please provide the following (after making change to disable tunneling as above) when the problem occurs:
1. ceph-storage-nics.yaml - it is referenced in custom-no-hci.yaml but not at https://github.com/RHsyseng/hci/tree/master/custom-templates/nic-configs
2. Full contents of /var/log/messages on the controller
3. Full output of 'ip a' on controllers
4. Output of "openstack stack failures list" on overcloud
5. Any files with errors in /var/log/heat/ on overcloud
6. Output of "nova list" on overcloud
7. Output of "openstack baremetal node list" on overcloud

Would like to see this deployed with a "clean" config, i.e. not with changes made to controllers.

Also can you describe the connectivity after the problem occurs?  Are you able to both ping and ssh to the controller nodes?  Are you able to ping one controller node from another (it seems like you cannot from notes above).

Thank you.

Comment 4 John Fulton 2016-11-01 17:43:16 UTC
Created attachment 1216214 [details]
ceph-storage-nics.yaml

Comment 5 Bob Fournier 2016-11-01 17:47:18 UTC
To expand on the previous comments, please do a "heat stack-delete overcloud" and wait for successful deletion before rerunning the deploy command. Thanks.

Comment 6 John Fulton 2016-11-01 18:30:54 UTC
(In reply to Bob Fournier from comment #3)
> John - can you please provide the following (after making change to disable
> tunneling as above) when the problem occurs:
> 1. ceph-storage-nics.yaml - it is referenced in custom-no-hci.yaml but not
> at https://github.com/RHsyseng/hci/tree/master/custom-templates/nic-configs

Attached. 

> 2. Full contents of /var/log/messages on the controller
> 3. Full output of 'ip a' on controllers
> 4. Output of "openstack stack failures list" on overcloud
> 5. Any files with errors in /var/log/heat/ on overcloud
> 6. Output of "nova list" on overcloud
> 7. Output of "openstack baremetal node list" on overcloud
> 
> Would like to see this deployed with a "clean" config, i.e. not with changes
> made to controllers.

OK, I've started a new deploy and will send you 2-7 when it is finished. 

[stack@hci-director ~]$ cat deploy-no-hci.sh 
source ~/stackrc
time openstack overcloud deploy --templates \
-e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \
-e ~/custom-templates/custom-no-hci.yaml \
--control-flavor control \
--control-scale 3 \
--compute-flavor compute \
--compute-scale 1 \
--ceph-storage-flavor ceph-storage \
--ceph-storage-scale 3 \
--ntp-server 10.5.26.10 \
[stack@hci-director ~]$ head -20 custom-templates/custom-no-hci.yaml 
resource_registry:
  OS::TripleO::NodeUserData: /home/stack/custom-templates/first-boot-template.yaml
  OS::TripleO::NodeExtraConfigPost: /home/stack/custom-templates/post-deploy-template.yaml
  OS::TripleO::Controller::Net::SoftwareConfig: /home/stack/custom-templates/nic-configs/controller-nics.yaml
  OS::TripleO::Compute::Net::SoftwareConfig: /home/stack/custom-templates/nic-configs/compute-nics.yaml  
  OS::TripleO::CephStorage::Net::SoftwareConfig: /home/stack/custom-templates/nic-configs/ceph-storage-nics.yaml
  
parameter_defaults:
  # 0. REPLACE DEPRECATED CLI OPTIONS WITH HEAT PARAMS
  # --neutron-bridge-mappings datacentre:br-ex,tenant:br-tenant \
  NeutronBridgeMappings: 'datacentre:br-ex,tenant:br-tenant'
  # --neutron-network-type vlan \
  NeutronNetworkType: 'vlan'
  # --neutron-network-vlan-ranges tenant:4051:4060 \
  NeutronNetworkVlanRanges: 'tenant:4051:4060'
  # --neutron-disable-tunneling 
  NeutronTunnelType: ''

  # 1. NETWORK ISOLATION
  # Internal API used for private OpenStack Traffic
[stack@hci-director ~]$ 
[stack@hci-director ~]$ ./deploy-no-hci.sh 
Creating Swift container to store the plan
Creating plan from template files in: /tmp/tripleoclient-OyXtOq/tripleo-heat-templates
Started Mistral Workflow. Execution ID: 0c165210-c104-41f9-b137-738f191addca
...
 
> Also can you describe the connectivity after the problem occurs?  Are you
> able to both ping and ssh to the controller nodes?  Are you able to ping one
> controller node from another (it seems like you cannot from notes above).

All hosts (control, compute, ceph) can ping each other on their provisioning network (192.168.1.0/24) and I can SSH into any host from OSPd. 

Hosts are unable to reach eachother using other networks however; i.e. nothing in /etc/hosts matching internalapi, storage, storagemgmt, tenant returns pings. 

[stack@hci-director ~]$ openstack server list
+--------------------------------------+-------------------------+--------+-----------------------+----------------+
| ID                                   | Name                    | Status | Networks              | Image Name     |
+--------------------------------------+-------------------------+--------+-----------------------+----------------+
| 8f7f0d17-a0e3-4fbc-8beb-3eb62714035d | overcloud-compute-0     | ACTIVE | ctlplane=192.168.1.34 | overcloud-full |
| 10af4f26-d6e5-45d6-acbf-764e03e12b30 | overcloud-controller-2  | ACTIVE | ctlplane=192.168.1.32 | overcloud-full |
| a094229f-f6d5-42ee-b2a5-5639c4b6c1ae | overcloud-cephstorage-2 | ACTIVE | ctlplane=192.168.1.30 | overcloud-full |
| ed459122-fdb7-4033-9867-15a97526534c | overcloud-controller-0  | ACTIVE | ctlplane=192.168.1.23 | overcloud-full |
| a356f889-3dfa-45a4-80e2-7c956123c442 | overcloud-cephstorage-0 | ACTIVE | ctlplane=192.168.1.21 | overcloud-full |
| f1bc0d58-3cd7-44a7-bedb-d2d43893cca5 | overcloud-cephstorage-1 | ACTIVE | ctlplane=192.168.1.26 | overcloud-full |
| fa644ae4-a4c5-4130-a347-672311d6a378 | overcloud-controller-1  | ACTIVE | ctlplane=192.168.1.25 | overcloud-full |
+--------------------------------------+-------------------------+--------+-----------------------+----------------+
[stack@hci-director ~]$ 

(In reply to Bob Fournier from comment #5)
> To expand on the previous comments, please do a "heat stack-delete
> overcloud" and wait for successful deletion before rerunning the deploy
> command. Thanks.

Yes, I ran the following: 

 heat stack-delete overcloud
 openstack overcloud plan delete overcloud

Before starting the new deploy. 

Thanks Bob! 

  John

Comment 7 John Fulton 2016-11-01 18:52:32 UTC
(In reply to Bob Fournier from comment #3)
> John - can you please provide the following 
[snip]
> 3. Full output of 'ip a' on controllers

As per the following the issue is no longer occurring. 

stack@hci-director ~]$ ansible mons -b -m shell -a "ip a"
192.168.1.36 | SUCCESS | rc=0 >>
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: p2p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 1c:40:24:99:5c:f1 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.36/24 brd 192.168.1.255 scope global p2p1
       valid_lft forever preferred_lft forever
    inet6 fe80::1e40:24ff:fe99:5cf1/64 scope link 
       valid_lft forever preferred_lft forever
3: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP qlen 1000
    link/ether 1c:40:24:99:5c:ed brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:5ced/64 scope link 
       valid_lft forever preferred_lft forever
4: p2p2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP qlen 1000
    link/ether 1c:40:24:99:5c:f2 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:5cf2/64 scope link 
       valid_lft forever preferred_lft forever
5: em2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000
    link/ether 1c:40:24:99:5c:ef brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:5cef/64 scope link 
       valid_lft forever preferred_lft forever
6: p2p3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 1c:40:24:99:5c:f3 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:5cf3/64 scope link 
       valid_lft forever preferred_lft forever
7: em3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 1c:40:24:99:65:b7 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:65b7/64 scope link 
       valid_lft forever preferred_lft forever
8: p2p4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 1c:40:24:99:5c:f4 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:5cf4/64 scope link 
       valid_lft forever preferred_lft forever
9: em4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 1c:40:24:99:65:b9 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:65b9/64 scope link 
       valid_lft forever preferred_lft forever
10: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether da:be:35:d3:94:85 brd ff:ff:ff:ff:ff:ff
11: br-tenant: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 1c:40:24:99:5c:ed brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:5ced/64 scope link 
       valid_lft forever preferred_lft forever
12: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 1c:40:24:99:5c:f2 brd ff:ff:ff:ff:ff:ff
    inet 10.19.139.45/21 brd 10.19.143.255 scope global br-ex
       valid_lft forever preferred_lft forever
    inet6 2620:52:0:1388:1e40:24ff:fe99:5cf2/64 scope global mngtmpaddr dynamic 
       valid_lft 2591936sec preferred_lft 604736sec
    inet6 fe80::1e40:24ff:fe99:5cf2/64 scope link 
       valid_lft forever preferred_lft forever
13: vlan4050: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether f2:09:87:0c:f4:fa brd ff:ff:ff:ff:ff:ff
    inet 192.168.3.20/24 brd 192.168.3.255 scope global vlan4050
       valid_lft forever preferred_lft forever
    inet6 fe80::f009:87ff:fe0c:f4fa/64 scope link 
       valid_lft forever preferred_lft forever
14: vlan4047@em2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP qlen 1000
    link/ether 1c:40:24:99:5c:ef brd ff:ff:ff:ff:ff:ff
    inet 172.16.2.12/24 brd 172.16.2.255 scope global vlan4047
       valid_lft forever preferred_lft forever
    inet6 fe80::1e40:24ff:fe99:5cef/64 scope link 
       valid_lft forever preferred_lft forever
15: vlan4046@em2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP qlen 1000
    link/ether 1c:40:24:99:5c:ef brd ff:ff:ff:ff:ff:ff
    inet 172.16.1.12/24 brd 172.16.1.255 scope global vlan4046
       valid_lft forever preferred_lft forever
    inet6 fe80::1e40:24ff:fe99:5cef/64 scope link 
       valid_lft forever preferred_lft forever
16: vlan4049@em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether 1c:40:24:99:5c:ed brd ff:ff:ff:ff:ff:ff
    inet 192.168.2.21/24 brd 192.168.2.255 scope global vlan4049
       valid_lft forever preferred_lft forever
    inet6 fe80::1e40:24ff:fe99:5ced/64 scope link 
       valid_lft forever preferred_lft forever

192.168.1.21 | SUCCESS | rc=0 >>
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP qlen 1000
    link/ether 1c:40:24:99:5d:07 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:5d07/64 scope link 
       valid_lft forever preferred_lft forever
3: p2p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 1c:40:24:99:5d:0b brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.21/24 brd 192.168.1.255 scope global p2p1
       valid_lft forever preferred_lft forever
    inet6 fe80::1e40:24ff:fe99:5d0b/64 scope link 
       valid_lft forever preferred_lft forever
4: em2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000
    link/ether 1c:40:24:99:5d:09 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:5d09/64 scope link 
       valid_lft forever preferred_lft forever
5: p2p2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP qlen 1000
    link/ether 1c:40:24:99:5d:0c brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:5d0c/64 scope link 
       valid_lft forever preferred_lft forever
6: em3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 1c:40:24:99:67:25 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:6725/64 scope link 
       valid_lft forever preferred_lft forever
7: em4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 1c:40:24:99:67:27 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:6727/64 scope link 
       valid_lft forever preferred_lft forever
8: p2p3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 1c:40:24:99:5d:0d brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:5d0d/64 scope link 
       valid_lft forever preferred_lft forever
9: p2p4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 1c:40:24:99:5d:0e brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:5d0e/64 scope link 
       valid_lft forever preferred_lft forever
10: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether e2:86:f3:86:be:c7 brd ff:ff:ff:ff:ff:ff
11: br-tenant: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 1c:40:24:99:5d:07 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:5d07/64 scope link 
       valid_lft forever preferred_lft forever
12: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 1c:40:24:99:5d:0c brd ff:ff:ff:ff:ff:ff
    inet 10.19.139.40/21 brd 10.19.143.255 scope global br-ex
       valid_lft forever preferred_lft forever
    inet6 2620:52:0:1388:1e40:24ff:fe99:5d0c/64 scope global mngtmpaddr dynamic 
       valid_lft 2591936sec preferred_lft 604736sec
    inet6 fe80::1e40:24ff:fe99:5d0c/64 scope link 
       valid_lft forever preferred_lft forever
13: vlan4050: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 4e:86:75:46:1c:11 brd ff:ff:ff:ff:ff:ff
    inet 192.168.3.15/24 brd 192.168.3.255 scope global vlan4050
       valid_lft forever preferred_lft forever
    inet6 fe80::4c86:75ff:fe46:1c11/64 scope link 
       valid_lft forever preferred_lft forever
14: vlan4047@em2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP qlen 1000
    link/ether 1c:40:24:99:5d:09 brd ff:ff:ff:ff:ff:ff
    inet 172.16.2.16/24 brd 172.16.2.255 scope global vlan4047
       valid_lft forever preferred_lft forever
    inet6 fe80::1e40:24ff:fe99:5d09/64 scope link 
       valid_lft forever preferred_lft forever
15: vlan4046@em2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP qlen 1000
    link/ether 1c:40:24:99:5d:09 brd ff:ff:ff:ff:ff:ff
    inet 172.16.1.15/24 brd 172.16.1.255 scope global vlan4046
       valid_lft forever preferred_lft forever
    inet6 fe80::1e40:24ff:fe99:5d09/64 scope link 
       valid_lft forever preferred_lft forever
16: vlan4049@em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether 1c:40:24:99:5d:07 brd ff:ff:ff:ff:ff:ff
    inet 192.168.2.16/24 brd 192.168.2.255 scope global vlan4049
       valid_lft forever preferred_lft forever
    inet6 fe80::1e40:24ff:fe99:5d07/64 scope link 
       valid_lft forever preferred_lft forever

192.168.1.23 | SUCCESS | rc=0 >>
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP qlen 1000
    link/ether 1c:40:24:99:5c:fa brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:5cfa/64 scope link tentative 
       valid_lft forever preferred_lft forever
3: p2p1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 1c:40:24:99:5c:fe brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.23/24 brd 192.168.1.255 scope global p2p1
       valid_lft forever preferred_lft forever
    inet6 fe80::1e40:24ff:fe99:5cfe/64 scope link 
       valid_lft forever preferred_lft forever
4: em2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000
    link/ether 1c:40:24:99:5c:fc brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:5cfc/64 scope link 
       valid_lft forever preferred_lft forever
5: p2p2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP qlen 1000
    link/ether 1c:40:24:99:5c:ff brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:5cff/64 scope link 
       valid_lft forever preferred_lft forever
6: em3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 1c:40:24:99:66:6e brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:666e/64 scope link 
       valid_lft forever preferred_lft forever
7: p2p3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 1c:40:24:99:5d:00 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:5d00/64 scope link 
       valid_lft forever preferred_lft forever
8: em4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 1c:40:24:99:66:70 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:6670/64 scope link 
       valid_lft forever preferred_lft forever
9: p2p4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 1c:40:24:99:5d:01 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:5d01/64 scope link 
       valid_lft forever preferred_lft forever
10: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 5a:a3:39:19:0d:9a brd ff:ff:ff:ff:ff:ff
11: br-tenant: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 1c:40:24:99:5c:fa brd ff:ff:ff:ff:ff:ff
    inet6 fe80::1e40:24ff:fe99:5cfa/64 scope link 
       valid_lft forever preferred_lft forever
12: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 1c:40:24:99:5c:ff brd ff:ff:ff:ff:ff:ff
    inet 10.19.139.39/21 brd 10.19.143.255 scope global br-ex
       valid_lft forever preferred_lft forever
    inet6 fe80::1e40:24ff:fe99:5cff/64 scope link 
       valid_lft forever preferred_lft forever
13: vlan4050: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether da:7c:1f:d2:07:5b brd ff:ff:ff:ff:ff:ff
    inet6 fe80::d87c:1fff:fed2:75b/64 scope link tentative 
       valid_lft forever preferred_lft forever

[stack@hci-director ~]$

Comment 8 John Fulton 2016-11-01 18:59:55 UTC
Bob,

When I removed the deprecated options from my openstack overcloud deploy command [1] and expressed them parameter_defaults in heat instead, the problem went away. 

Anything else you want done here (any point in getting you the other output you requested) or should I close this WORKSFORME? 

  John

[1] 
 --neutron-bridge-mappings datacentre:br-ex,tenant:br-tenant \
 --neutron-network-type vlan \
 --neutron-network-vlan-ranges tenant:4051:4060 \
 --neutron-disable-tunneling 

[2] 
parameter_defaults:
   # --neutron-bridge-mappings datacentre:br-ex,tenant:br-tenant \
   NeutronBridgeMappings: 'datacentre:br-ex,tenant:br-tenant'
   # --neutron-network-type vlan \
   NeutronNetworkType: 'vlan'
   # --neutron-network-vlan-ranges tenant:4051:4060 \
   NeutronNetworkVlanRanges: 'tenant:4051:4060'
   # --neutron-disable-tunneling 
   NeutronTunnelType: ''

Comment 9 Dan Sneddon 2016-11-01 19:20:44 UTC
Glad to hear it worked this time, but the NIC templates looks incorrect. I am looking at the controller NIC configuration here:

https://github.com/RHsyseng/hci/blob/master/custom-templates/nic-configs/controller-nics.yaml

I see that the Internal API network is listed as a VLAN with device: em1. This is incorrect, since em1 is bound to br-tenant. It should instead appear like this:

            -
              # VLAN tenant networking
              type: ovs_bridge
              name: br-tenant
              mtu: 1500
              use_dhcp: false
              members:
                -
                  type: interface
                  name: em1
                  mtu: 1500
                  use_dhcp: false
                  # force the MAC address of the bridge to this interface
                  primary: true
                -
                  type: vlan
                  mtu: 1500
                  vlan_id: {get_param: TenantNetworkVlanID}
                  addresses:
                  -
                    ip_netmask: {get_param: TenantIpSubnet}
                -
                  type: vlan
                  mtu: 1500
                  vlan_id: {get_param: InternalApiNetworkVlanID}
                  addresses:
                  -
                    ip_netmask: {get_param: InternalApiIpSubnet}

I see a similar mistake in the compute NIC config. It may work the way it is set up, but you will be running a completely untested configuration. I recommend using device: <iface> when creating VLANs directly 

Some other things to point out. Typically when there is more than one bridge on the controllers, we use NeutronExternalNetworkBridge: "''" (that's single quotes inside double quotes). This will result in Neutron setting up patches between br-int and any bridges where external networks are set up. This allows the flexibility to create additional external networks on VLANs of br-ex or br-tenant. It won't break anything to use "br-ex", as long as external networks are only placed on br-ex on the native VLAN, but I thought it deserved a mention.

If you are still having issues after making these changes, please include the contents of ifcfg-br-ex.

Also, if you can run "sudo grep os-net-config /var/log/messages" and attach the output in a text file, that would also be helpful.

Comment 10 Bob Fournier 2016-11-01 19:28:23 UTC
John - glad to hear that it is working now after making the disable-tunneling change
and deleting the stack. Yes, if you could close it as WORKSFORME that would be great, if you could make the nic-config changes as Dan recommended also. Thanks.

Comment 11 John Fulton 2016-11-02 00:12:46 UTC
Thank you Dan and Bob,

I've switched all VLANs on em1 to using OVS only [0] and will continue using the switch from legacy options [1] to defining them in Heat instead [2]. Closing, works for me. 

  John

[0] https://github.com/RHsyseng/hci/commit/478cb9d0ad42fab3b03639ae9c713b8620b6f9c0

[1]
openstack overcloud deploy --templates \
... 
 --neutron-bridge-mappings datacentre:br-ex,tenant:br-tenant \
 --neutron-network-type vlan \
 --neutron-network-vlan-ranges tenant:4051:4060 \
 --neutron-disable-tunneling

[2]
parameter_defaults:
   NeutronBridgeMappings: 'datacentre:br-ex,tenant:br-tenant'
   NeutronNetworkType: 'vlan'
   NeutronNetworkVlanRanges: 'tenant:4051:4060'
   NeutronTunnelType: ''


Note You need to log in before you can comment on or make changes to this bug.