Description of problem: vm creation is failing, problem seems related to nova getting timeout trying to retrieve tenant's ports: 2023-06-14 10:39:12.837 30 ERROR nova.api.openstack.wsgi keystoneauth1.exceptions.connection.ConnectTimeout: Request to http://$ip:9696/v2.0/ports?tenant_id=$uuid&fields=id timed out Running openstack port list --tenant $tenant takes 1min and 50 seconds for something like 900 ports. Version-Release number of selected component (if applicable): RHOSP 16.1.8 witch opflex and cisco's plugins RHOSP 16.2.5 witch opflex and cisco's plugins How reproducible: Steps to Reproduce: 1. find (or create) a tenant with a lot of ports from different subnets (>900 in this case) 2. try to create a vm (openstack server create --flavor xxx --image xxx --network test-net test-vm) Actual results: listing tenant's ports is taking something like 2 minutes, causing nova's instance creation workflow to timing out. Expected results: listing tenant's ports to be more efficient (referring for example to https://bugs.launchpad.net/neutron/+bug/2016704 or https://review.opendev.org/c/openstack/neutron/+/790691)
To confirm, there's nothing specific about the ports in question used for the scenario. right? E.g. these are not trunk ports, or ports with a large number of additional SGs / address pairs... Anything particular about the ports in question? Just want to make sure we don't miss something specific about the ports. === Another question I have for the scenario is - why does nova list all (?) ports as part of a VM boot? Shouldn't it list just the ports that belong to the VM? Perhaps there's some nova backport missing to optimize it from compute side. Perhaps something to check on with Compute team.
(In reply to Ihar Hrachyshka from comment #2) > To confirm, there's nothing specific about the ports in question used for > the scenario. right? E.g. these are not trunk ports, or ports with a large > number of additional SGs / address pairs... Anything particular about the > ports in question? Just want to make sure we don't miss something specific > about the ports. > from customer reply: "all these VMs have single connection and no trunks. We have very simple VMs.Regarding security groups, we do not have more than 2-3" > > Another question I have for the scenario is - why does nova list all (?) > ports as part of a VM boot? Shouldn't it list just the ports that belong to > the VM? Perhaps there's some nova backport missing to optimize it from > compute side. Perhaps something to check on with Compute team. from a very quick query to DFG:compute: there is proably a historical reason, prehaps quotas, as enforicng port quotas
Hello Flavio: Some questions about this issue: 1) These 900 ports, to what subnets/networks they belong? Do these networks belong to other projects? 2) Has "tenant" (from the "port list" command) admin powers or is a regular user? 3) How many RBAC rules are there in the project? To what project these RBAC entries belong? *NOTE* I'm not asking the target project in the RBAC rule but from what project these RBAC entries were created. In particular what I'm wondering if these rules belong to the same project as "tenant". A "select * from networkrbacs" will help. 4) Did the customer enabled the "slow_query_log" in the database engine? If that is the case, the output will be very helpful. This can be enabled/disable on real-time with [1]. Regards. [1]https://access.redhat.com/solutions/321003
Reported https://bugzilla.redhat.com/show_bug.cgi?id=2222102 to track nova optimization to avoid fetching all ports on VM creation.
Rodolfo, I don't think get_ports calls to ml2 drivers, does it? I believe I checked it in code when I was looking at the 16.1 counterpart of this bz, and I couldn't find where ml2 driver could be called by ml2 plugin. Perhaps I'm missing something?
Hello Ihar: I can't confirm that for 'apic_aim' mechanism driver. Actually using ML2/OVS or ML2/OVN and testing in a similar environment (RBACs, users, networks, ports), the Neutron API response time is around 5-6 seconds. Because we don't provide support for this driver and I don't know what extensions this mechanism driver is loading (that could affect to the "_make_port_dict" method and most probably this is what is happening), I can't triage nor analyse this issue. Regards.