Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1613200

Summary: [Scale][HA] Unable to Spawn all the 500 VMs due to PortStatus not getting updated to ACTIVE for certain VMs
Product: Red Hat OpenStack Reporter: Sridhar Gaddam <sgaddam>
Component: opendaylightAssignee: Josh Hershberg <jhershbe>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Noam Manos <nmanos>
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: aadam, jhershbe, mkolesni, mpeterso, nyechiel, smalleni
Target Milestone: z4Keywords: Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: HA
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-07 06:59:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
neutron-controller-0-iter1
none
opendaylight-controller-0-iter1
none
neutron-controller-1-iter1
none
opendaylight-controller-1-iter1
none
neutron-controller-2-iter1
none
overcloud-controller-2-iter1 none

Description Sridhar Gaddam 2018-08-07 08:38:36 UTC
Description of problem:

Deployment with 3 Controllers and 45 computes.

While running the Browbeat network_nova_boot (concurrency 10 and times set to 500), it was observed that we were able to spawn only 156 VMs out of 500 VMs.
Since we have some other issues like ODL getting killed due to OOM (which is addressed via other RHBZs) and "ODL L2" agent status becoming flaky, we had applied the following tweaks in the setup before triggering the tests.

1. Configured the inactivity_probe value to 180 secs (for both manager and controller connections) on all the nodes.
2. Changed the restconf_poll_interval=15 secs in /var/lib/config-data/puppet-generated/neutron/etc/neutron/plugins/ml2/ml2_conf.ini file 
3. Enabled debugs (i.e., /var/lib/config-data/puppet-generated/neutron/etc/neutron/neutron.conf) in neutron logs.

With these tweaks we could see that ODL L2 agent was pretty stable during the test-run.
However, we could not spawn the necessary VMs (i.e., 500 with concurrency of 10).

Version-Release number of selected component (if applicable):
opendaylight-8.3.0-1.el7ost.noarch

How reproducible:
Easily reproducible

Steps to Reproduce:
1. Deploy OSP with ODL 3 Controllers and 45 Computes
2. Run Browbeat network_nova_boot scenario with concurrency 10 and times set to 500 (i.e., spawn 500 VMs)
3. Once the browbeat tests are completed, check the number of VMs that are spawned.

Actual results:
We could see that during the test-run, many VMs could not transition to ACTIVE state and were subsequently deleted by Nova (after some timeout of 5 mins). Suspect that this could be an issue with PortStatus not getting updated to ACTIVE in a Cluster Setup.

Expected results:
We should be able to spawn all the 500 VMs successfully.

Additional info:

Comment 1 Sridhar Gaddam 2018-08-07 08:44:43 UTC
Some additional notes:
During the test-run JAVA_HEAP was also tweaked to 8GB (to avoid OOM), so there was no OOM during the entire test-run.

Comment 3 Sridhar Gaddam 2018-08-07 09:55:12 UTC
Created attachment 1473924 [details]
neutron-controller-0-iter1

Comment 4 Sridhar Gaddam 2018-08-07 09:56:18 UTC
Created attachment 1473925 [details]
opendaylight-controller-0-iter1

Comment 5 Sridhar Gaddam 2018-08-07 09:59:48 UTC
Created attachment 1473926 [details]
neutron-controller-1-iter1

Comment 6 Sridhar Gaddam 2018-08-07 10:01:34 UTC
Created attachment 1473927 [details]
opendaylight-controller-1-iter1

Comment 7 Sridhar Gaddam 2018-08-07 10:09:12 UTC
Created attachment 1473929 [details]
neutron-controller-2-iter1

Comment 8 Sridhar Gaddam 2018-08-07 10:10:36 UTC
Created attachment 1473930 [details]
overcloud-controller-2-iter1

Comment 10 Josh Hershberg 2018-08-09 04:18:41 UTC
I spent some time analyzing this and I can confidently say that there is certainly no reason to assume there is a problem with the port-status update mechanism. Here's the breakdown of what I found.

* The neutron logs indicate there are 1913 ports in total
* Of those only 401 ports never transition to active
* Of those only 260 are VM ports
* Of those 248 ports have no log line indicating that genius received the openflow port-status event and created the InterfaceState. This usually means that the VM port was not attached to the switch.
* A remaining 6 ports have a span > 5 minutes between when the neutron port shows up in the karaf to when then smac and dmac flows are programmed (port status is set to ACTIVE immediately following that). Five minutes is the amount of time nova waits for a VM's port to go active, so nova gives up waiting and sets the VM to error state. 
* A remaining 6 ports never get smac and dmac flows configured...why not requires some research.
That covers all the failed ports, 248+6+6 = 260 vm ports.

Obviously, this requires more research to determine why all these failures are happening.

Comment 11 Josh Hershberg 2018-08-09 11:54:35 UTC
In all honesty, it seems that the system was quite simply very, very overloaded, all queues were backed up, and just generally hosed. I am not sure this is a bug, it could just be too far over our performance capabilities.

Comment 12 Ariel Adam 2018-08-09 11:58:26 UTC
If OVS can support it then ODL should as well.
Do we need to increase the memory or CPU cores?

Comment 13 Josh Hershberg 2018-10-07 06:59:01 UTC
After consultation with Sridhar we agreed to close this one. Basically, it was opened initially because there was a suspicion that it somehow was related to port status. At this point it is clear that this is not a specific bug but rather cluster overload, i.e., ask timeout etc. These issues are already being worked on in various forums.