Description of problem: Client got many issues spawning multiple instances at the same time. At our request client is running latest version of OSP16.2 (so z1). They also run the following hotfixes on top of this: https://bugzilla.redhat.com/show_bug.cgi?id=1969349 https://bugzilla.redhat.com/show_bug.cgi?id=2015619 Yet there are still issues. To understand where the client is coming from, they currently run a much heavier load on an older version of Openstack with OVS (versus now which is OVN). The controller nodes have 64 cores, 385GB of RAM, run on physical hardware. They deploy test instances with https://opendev.org/zuul/nodepool. In their old cloud deployment they can deploy up to 6000 instances a day. They can spawn 500 instances in a few minutes. Now in their new OSP16.2.1 with OVN even with a load of barely 100 instances it fails. It only spawn regular instances. No FIPs, default Security Group with only 6 rules (all ports open for IPv4/v6). Their environment is fully integrated with Grafana so we can see the progress of things and behaviors. Today I assisted at a deployment test of 250 instances. They track both the number of libvirt instances running and the number of instances from OpenStack point of view. Before starting it had around 160~ instances. When the call to have 250 instances was launched, within a few minutes it reached 260~ from OSP point of view but libvirt metrics peaked at 190~. Barerly 30 instances were created before nova complaining of 504 to some calls. Client will be including the hotfix from this BZ tomorrow (January 18) and try again see if it helps or not: https://bugzilla.redhat.com/show_bug.cgi?id=2037332 They will also enable neutron logs and provide sosreport from the controller nodes. I will get back to you when I get the logs with and extract the current error. The client is wondering if we will be able to stabilize the environment in the next 2 weeks. If not they are thinking of wiping out the setup and reinstall 16.2 but with OVS since it works fine with older releases. They are currently installing 16.2 with OVS to see if it runs fine with latest release. For the record 16.2 with OVN for their other internal client work just fine. No issue. It's really with this particular workload it fails. Version-Release number of selected component (if applicable): OSP16.2.1 How reproducible: 100% Steps to Reproduce: 1. Try to deploy 100+ instances with nodepool 2. 3. Actual results: Currently 504 in nova calls Expected results: Having as good deployment as with OVS in earlier release. Additional info: We will get sosreport with neutron in debug mode.
Forgot to mention that when nodepool is running for creating instances, we see the following processes on the controller nodes reaching 99/100% and staying there non-stop. neutron-server: rpc worker neutron-server: api worker We even saw octavia reaching 99% for a little while. Yet the load is not touching octavia at all.
Parameters were tuned in the debugging session with BMW have become the default in the product now.