Bug 2041606

Summary: [OSP16.2] Timeout when creating multiple instances at the same time (on OVN)
Product: Red Hat OpenStack Reporter: ggrimaux
Component: openstack-neutronAssignee: OSP Team <rhos-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Eran Kuris <ekuris>
Severity: high Docs Contact:
Priority: unspecified    
Version: 16.2 (Train)CC: bcafarel, chrisw, dhill, jlibosva, mlavalle, rurena, scohen
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-07-21 13:59:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description ggrimaux 2022-01-17 20:23:47 UTC
Description of problem:
Client got many issues spawning multiple instances at the same time.
At our request client is running latest version of OSP16.2 (so z1).
They also run the following hotfixes on top of this:
https://bugzilla.redhat.com/show_bug.cgi?id=1969349
https://bugzilla.redhat.com/show_bug.cgi?id=2015619

Yet there are still issues.

To understand where the client is coming from, they currently run a much heavier load on an older version of Openstack with OVS (versus now which is OVN).

The controller nodes have 64 cores, 385GB of RAM, run on physical hardware.

They deploy test instances with https://opendev.org/zuul/nodepool.
In their old cloud deployment they can deploy up to 6000 instances a day. They can spawn 500 instances in a few minutes.
Now in their new OSP16.2.1 with OVN even with a load of barely 100 instances it fails.
It only spawn regular instances. No FIPs, default Security Group with only 6 rules (all ports open for IPv4/v6).
Their environment is fully integrated with Grafana so we can see the progress of things and behaviors.

Today I assisted at a deployment test of 250 instances.
They track both the number of libvirt instances running and the number of instances from OpenStack point of view.
Before starting it had around 160~ instances. When the call to have 250 instances was launched, within a few minutes it reached 260~ from OSP point of view but libvirt metrics peaked at 190~. Barerly 30 instances were created before nova complaining of 504 to some calls.

Client will be including the hotfix from this BZ tomorrow (January 18) and try again see if it helps or not:
https://bugzilla.redhat.com/show_bug.cgi?id=2037332

They will also enable neutron logs and provide sosreport from the controller nodes.

I will get back to you when I get the logs with  and extract the current error.

The client is wondering if we will be able to stabilize the environment in the next 2 weeks. If not they are thinking of wiping out the setup and reinstall 16.2 but with OVS since it works fine with older releases.

They are currently installing 16.2 with OVS to see if it runs fine with latest release.

For the record 16.2 with OVN for their other internal client work just fine. No issue.
It's really with this particular workload it fails.

Version-Release number of selected component (if applicable):
OSP16.2.1

How reproducible:
100%

Steps to Reproduce:
1. Try to deploy 100+ instances with nodepool
2.
3.

Actual results:
Currently 504 in nova calls

Expected results:
Having as good deployment as with OVS in earlier release.

Additional info:
We will get sosreport with neutron in debug mode.

Comment 2 ggrimaux 2022-01-17 20:31:21 UTC
Forgot to mention that when nodepool is running for creating instances, we see the following processes on the controller nodes reaching 99/100% and staying there non-stop.
neutron-server: rpc worker
neutron-server: api worker

We even saw octavia reaching 99% for a little while.
Yet the load is not touching octavia at all.

Comment 8 Miguel Lavalle 2022-07-21 13:59:24 UTC
Parameters were tuned in the debugging session with BMW have become the default in the product now.

Comment 9 Miguel Lavalle 2022-07-21 13:59:46 UTC
Parameters were tuned in the debugging session with BMW have become the default in the product now.