Bug 2219693

Summary: [OSP 16.2][neutron] slow api reply for tenant ports listing is causing instance creation to fails
Product: Red Hat OpenStack Reporter: Flavio Piccioni <fpiccion>
Component: openstack-neutronAssignee: Rodolfo Alonso <ralonsoh>
Status: NEW --- QA Contact: Eran Kuris <ekuris>
Severity: high Docs Contact:
Priority: high    
Version: 16.2 (Train)CC: astillma, chrisw, enothen, gregraka, ihrachys, jamsmith, ralonsoh, scohen
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Flavio Piccioni 2023-07-04 19:52:47 UTC
Description of problem:
vm creation is failing, problem seems related to nova getting timeout trying to retrieve tenant's ports:

2023-06-14 10:39:12.837 30 ERROR nova.api.openstack.wsgi keystoneauth1.exceptions.connection.ConnectTimeout: Request to http://$ip:9696/v2.0/ports?tenant_id=$uuid&fields=id timed out

Running openstack port list --tenant $tenant takes 1min and 50 seconds for something like 900 ports.

Version-Release number of selected component (if applicable):
RHOSP 16.1.8 witch opflex and cisco's plugins
RHOSP 16.2.5 witch opflex and cisco's plugins


How reproducible:

Steps to Reproduce:
1. find (or create) a tenant with a lot of ports from different subnets (>900 in this case)
2. try to create a vm (openstack server create --flavor xxx --image xxx --network test-net test-vm)

Actual results:
listing tenant's ports is taking something like 2 minutes, causing nova's instance creation workflow to timing out.

Expected results:
listing tenant's ports to be more efficient (referring for example to https://bugs.launchpad.net/neutron/+bug/2016704
or https://review.opendev.org/c/openstack/neutron/+/790691)

Comment 2 Ihar Hrachyshka 2023-07-05 14:04:30 UTC
To confirm, there's nothing specific about the ports in question used for the scenario. right? E.g. these are not trunk ports, or ports with a large number of additional SGs / address pairs... Anything particular about the ports in question? Just want to make sure we don't miss something specific about the ports.

===

Another question I have for the scenario is - why does nova list all (?) ports as part of a VM boot? Shouldn't it list just the ports that belong to the VM? Perhaps there's some nova backport missing to optimize it from compute side. Perhaps something to check on with Compute team.

Comment 3 Flavio Piccioni 2023-07-05 19:03:00 UTC
(In reply to Ihar Hrachyshka from comment #2)
> To confirm, there's nothing specific about the ports in question used for
> the scenario. right? E.g. these are not trunk ports, or ports with a large
> number of additional SGs / address pairs... Anything particular about the
> ports in question? Just want to make sure we don't miss something specific
> about the ports.
> 

from customer reply: "all these VMs have single connection and no trunks. We have very simple VMs.Regarding security groups, we do not have more than 2-3"


> 
> Another question I have for the scenario is - why does nova list all (?)
> ports as part of a VM boot? Shouldn't it list just the ports that belong to
> the VM? Perhaps there's some nova backport missing to optimize it from
> compute side. Perhaps something to check on with Compute team.

from a very quick query to DFG:compute: there is proably a historical reason, prehaps quotas, as enforicng port quotas

Comment 4 Rodolfo Alonso 2023-07-10 15:47:18 UTC
Hello Flavio:

Some questions about this issue:
1) These 900 ports, to what subnets/networks they belong? Do these networks belong to other projects?
2) Has "tenant" (from the "port list" command) admin powers or is a regular user?
3) How many RBAC rules are there in the project? To what project these RBAC entries belong? *NOTE* I'm not asking the target project in the RBAC rule but from what project these RBAC entries were created. In particular what I'm wondering if these rules belong to the same project as "tenant". A "select * from networkrbacs" will help.
4) Did the customer enabled the "slow_query_log" in the database engine? If that is the case, the output will be very helpful. This can be enabled/disable on real-time with [1].

Regards.

[1]https://access.redhat.com/solutions/321003

Comment 9 Ihar Hrachyshka 2023-07-11 19:30:50 UTC
Reported https://bugzilla.redhat.com/show_bug.cgi?id=2222102 to track nova optimization to avoid fetching all ports on VM creation.

Comment 12 Ihar Hrachyshka 2023-07-12 15:52:09 UTC
Rodolfo, I don't think get_ports calls to ml2 drivers, does it? I believe I checked it in code when I was looking at the 16.1 counterpart of this bz, and I couldn't find where ml2 driver could be called by ml2 plugin. Perhaps I'm missing something?

Comment 15 Rodolfo Alonso 2023-07-13 08:31:40 UTC
Hello Ihar:

I can't confirm that for 'apic_aim' mechanism driver. Actually using ML2/OVS or ML2/OVN and testing in a similar environment (RBACs, users, networks, ports), the Neutron API response time is around 5-6 seconds. Because we don't provide support for this driver and I don't know what extensions this mechanism driver is loading (that could affect to the "_make_port_dict" method and most probably this is what is happening), I can't triage nor analyse this issue.

Regards.