Description of problem: Nova as part of booting the VM, calls neutron API to get security groups by tenant_id. Randomly this call is taking more than 30 seconds, forcing nova to reset connection and failing VM booting. errors in nova - 2021-03-09 13:20:07.794 7 ERROR nova.compute.manager [req-04ebb6cb-2ca9-42e4-b180-2b342968ff6e 51068b87c80b4859b15d4c01dba26c50 dda154eb250e40168ed38c219cba7819 - default default] Instance failed network setup after 1 attempt(s): keystoneauth1.exceptions.connection.ConnectTimeout: Request to http://172.17.1.107:9696/v2.0/security-groups?tenant_id=dda154eb250e40168ed38c219cba7819 timed out 2021-03-09 13:20:07.796 7 ERROR nova.compute.manager [req-04ebb6cb-2ca9-42e4-b180-2b342968ff6e 51068b87c80b4859b15d4c01dba26c50 dda154eb250e40168ed38c219cba7819 - default default] [instance: d936728f-0933-48d9-94e2-dc9a91366bb1] Instance failed to spawn: keystoneauth1.exceptions.connection.ConnectTimeout: Request to http://172.17.1.107:9696/v2.0/security-groups?tenant_id=dda154eb250e40168ed38c219cba7819 timed out errors in neutron - 2021-03-09 13:20:07.390 25 INFO neutron.wsgi [req-95b8def9-d36d-421e-9bd9-13f54506fcfe 51068b87c80b4859b15d4c01dba26c50 dda154eb250e40168ed38c219cba7819 - default default] Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 591, in handle_one_response write(b''.join(towrite)) File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 537, in write wfile.writelines(towrite) File "/usr/lib64/python3.6/socket.py", line 604, in write return self._sock.send(b) File "/usr/lib/python3.6/site-packages/eventlet/greenio/base.py", line 397, in send return self._send_loop(self.fd.send, data, flags) File "/usr/lib/python3.6/site-packages/eventlet/greenio/base.py", line 384, in _send_loop return send_method(data, *args) ConnectionResetError: [Errno 104] Connection reset by peer 2021-03-09 13:20:07.391 25 INFO neutron.wsgi [req-95b8def9-d36d-421e-9bd9-13f54506fcfe 51068b87c80b4859b15d4c01dba26c50 dda154eb250e40168ed38c219cba7819 - default default] 172.17.1.48 "GET /v2.0/security-groups?tenant_id=dda154eb250e40168ed38c219cba7819 HTTP/1.1" status: 200 len: 0 time: 45.1947863 I have identified this as part of security group log scale testing. But able to reproduce this on a normal setup. List of neutron resources created when we start seeing these errors - [root@controller-0 ~]# podman exec -it -u root galera-bundle-podman-0 sh -c "mysql -e 'select COUNT(*) from ovs_neutron.securitygroupportbindings'"; +----------+ | COUNT(*) | +----------+ | 8310 | +----------+ [root@controller-0 ~]# podman exec -it -u root galera-bundle-podman-0 sh -c "mysql -e 'select COUNT(*) from ovs_neutron.securitygroups'"; +----------+ | COUNT(*) | +----------+ | 1790 | +----------+ [root@controller-0 ~]# podman exec -it -u root galera-bundle-podman-0 sh -c "mysql -e 'select COUNT(*) from ovs_neutron.securitygrouprules'"; +----------+ | COUNT(*) | +----------+ | 11445 | +----------+ [root@controller-0 ~]# podman exec -it -u root galera-bundle-podman-0 sh -c "date -u && mysql -e 'select COUNT(*) from ovs_neutron.securitygrouprules'"; Thu Mar 11 10:30:46 UTC 2021 +----------+ | COUNT(*) | +----------+ | 11477 | +----------+ [root@controller-0 ~]# podman exec -it -u root galera-bundle-podman-0 sh -c "date -u && mysql -e 'select COUNT(*) from ovs_neutron.ports'"; Thu Mar 11 10:34:09 UTC 2021 +----------+ | COUNT(*) | +----------+ | 3908 | +----------+ [root@controller-0 ~]# podman exec -it -u root galera-bundle-podman-0 sh -c "date -u && mysql -e 'select COUNT(*) from ovs_neutron.networks'"; Thu Mar 11 10:34:21 UTC 2021 +----------+ | COUNT(*) | +----------+ | 366 | +----------+ Version-Release number of selected component (if applicable): RHOS-16.1-RHEL-8-20210216.n.1 How reproducible: 1. Run rally scenario https://github.com/cloud-bulldozer/browbeat/blob/master/rally/rally-plugins/netcreate-boot/netcreate_nova_boot_fip_ping_sec_groups.py#L31 with the configuration Num_vms: 4 num_sg:2 num_sgr:10 concurrency:16 times: 500 Flavio has done some analysis https://docs.google.com/document/d/1h3JgXQgAOKI64WTXtiDqP47JEONJMNHlLxZbFN2-wUs/edit?ts=60494053#heading=h.innw5ujbmlza One user and one project is used by rally in this testing. So all the neutron resources are created using one tenant. (overcloud) [stack@undercloud ~]$ openstack project list +----------------------------------+---------------------------+ | ID | Name | +----------------------------------+---------------------------+ | 3acb775d30b24db5b195cd13788e6e33 | c_rally_8c9a5195_zTrSbWRr | | 4251cad0bb224f2b8d8f9b89c148c785 | admin | | 5a862093635e4d4c83e834fe1e3379bc | service | +----------------------------------+---------------------------+ (overcloud) [stack@undercloud ~]$ openstack user list +----------------------------------+---------------------------+ | ID | Name | +----------------------------------+---------------------------+ | 0d0700a464074354b57da6d3e44a2da0 | admin | | 8521bd5c7b194027b00ae7acd7623bb4 | cinder | | e6c48a55e72945f6b579a10f73b3ce5c | cinderv2 | | 24ea8943aabf4b9fa8c8c57172c1e268 | cinderv3 | | 3fd92baf9e2643e6af514023f2eb488e | glance | | 869f89fa9e7f49888b72d5bae2ae99a2 | heat | | 49297ac7d5fa4fa78938d9425bc33fb6 | heat_stack_domain_admin | | 0e974bf8f05c412fa2c97fc203060c6f | heat-cfn | | fac9913c2cac4e3580a6fa13097db635 | neutron | | 85c01593a35642a39a7d662d2e71e98d | nova | | 8453b9b3e4ad42cea9433cdada50691f | placement | | 1e3a5702c9c04d99bc1fc31cabe4c10e | swift | | f3304db57f394f8c98e5116117245228 | c_rally_8c9a5195_faAB98kr | +----------------------------------+---------------------------+ Neutron logs http://rdu-storage01.scalelab.redhat.com/anilvenkata/20210309-114754.tar.gz Rally logs http://rdu-storage01.scalelab.redhat.com/anilvenkata/20210309-114754-rally.tar.gz
Before we invest too much energy on this, it may be important to know the requirements in terms of what customers need. It may be possible that number of security group rules is more important than the number of security groups. @atrag may know. Should we need to address this BZ, the most viable way is to change the requester (nova in this case) to use pagination and leverage what is already implemented in neutron API. That basically involves the usage of the "limit" and "marker" parameters. @db_api.retry_if_session_inactive() def get_security_groups(self, context, filters=None, fields=None, sorts=None, limit=None, marker=None, page_reverse=False, default_sg=False) https://github.com/openstack/neutron/blob/98db3b9c80a6309f7b91624c5eeea159f0b57081/neutron/db/securitygroups_db.py#L161-L164 https://docs.openstack.org/api-ref/network/v2/index.html#security-groups-security-groups More info on pagination is available here: https://docs.openstack.org/api-ref/network/v2/#pagination
Slawek, As it is very important patch for the scale (helpful for many customers), is it possible to backport it to OSP13?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.7 (Train) bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3762