Bug 1753196

Summary: ACI OpenStack Neutron Server Timeout When Scaling Heat Stack
Product: Red Hat OpenStack Reporter: Luigi Tamagnone <ltamagno>
Component: openstack-heatAssignee: smooney
Status: CLOSED NOTABUG QA Contact: nlevinki <nlevinki>
Severity: medium Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: aschultz, mburns, sbaker, shardy
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-09-20 11:27:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Luigi Tamagnone 2019-09-18 11:55:19 UTC
Description of problem:

When scaling up an existing heat stack with many nodes (> 50)  by more than 10 nodes, some of the VMs are stuck in BUILD (spawning state) and the heat stack eventually times out.  
This seems to be due to a neutron timeout happening in the Cisco ACI/APIC mechanism driver. 
We create a ticket with Cisco but they referred us to RedHat because they state that this is an issue in a neutron/nova wsgi call.

Version-Release number of selected component (if applicable):
openstack-heat-engine-10.0.3-5.el7ost.noarch
openstack-heat-common-10.0.3-5.el7ost.noarch
openstack-heat-api-10.0.3-5.el7ost.noarch
openstack-heat-api-cfn-10.0.3-5.el7ost.noarch
openstack-tripleo-heat-templates-8.3.1-54.el7ost.noarch
python-heat-agent-1.5.4-1.el7ost.noarch
python2-heatclient-1.14.1-1.el7ost.noarch
puppet-heat-12.4.1-0.20190214021237.a7ed720.el7ost.noarch
heat-cfntools-1.3.0-2.el7ost.noarch

How reproducible:
Every time you try to scale up RHOPS13 more than 50 nodes


Steps to Reproduce:
1. scale-up RHOPS13 more than 50 nodes


Actual results:
Some of the VMs are stuck in BUILD (spawning state) and the heat stack eventually times out.

Expected results:
The stack scale-up properly

Additional info:

The current solution is to scale up the stack multiple times by less than 10 nodes to avoid the issue. 

Another workaround that reduces it a bit but is not a proper fix:
Increase the timeout value in nova.conf [neutron] on all hypervisor from 30 to 300)
Increase the ha-proxy timeout value to 300s