Bug 1937705 - GET API on SG takes more than 30 seconds at scale
Summary: GET API on SG takes more than 30 seconds at scale
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: z7
: 16.1 (Train on RHEL 8.2)
Assignee: Slawek Kaplonski
QA Contact: Alex Katz
URL:
Whiteboard:
Depends On: 1962032
Blocks: 1960566
TreeView+ depends on / blocked
 
Reported: 2021-03-11 11:35 UTC by anil venkata
Modified: 2021-12-09 20:18 UTC (History)
15 users (show)

Fixed In Version: openstack-nova-20.4.1-1.20210510133324.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1960566 1962032 (view as bug list)
Environment:
Last Closed: 2021-12-09 20:18:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 783275 0 None MERGED [neutron] Get only ID and name of the SGs from Neutron 2022-05-05 12:08:47 UTC
Red Hat Issue Tracker OSP-339 0 None None None 2021-11-18 11:29:50 UTC
Red Hat Product Errata RHBA-2021:3762 0 None None None 2021-12-09 20:18:44 UTC

Description anil venkata 2021-03-11 11:35:35 UTC
Description of problem:

Nova as part of booting the VM, calls neutron API to get security groups by tenant_id. Randomly this call is taking more than 30 seconds, forcing nova to reset connection and failing VM booting.

errors in nova -

2021-03-09 13:20:07.794 7 ERROR nova.compute.manager [req-04ebb6cb-2ca9-42e4-b180-2b342968ff6e 51068b87c80b4859b15d4c01dba26c50 dda154eb250e40168ed38c219cba7819 - default default] Instance failed network setup after 1 attempt(s): keystoneauth1.exceptions.connection.ConnectTimeout: Request to http://172.17.1.107:9696/v2.0/security-groups?tenant_id=dda154eb250e40168ed38c219cba7819 timed out

2021-03-09 13:20:07.796 7 ERROR nova.compute.manager [req-04ebb6cb-2ca9-42e4-b180-2b342968ff6e 51068b87c80b4859b15d4c01dba26c50 dda154eb250e40168ed38c219cba7819 - default default] [instance: d936728f-0933-48d9-94e2-dc9a91366bb1] Instance failed to spawn: keystoneauth1.exceptions.connection.ConnectTimeout: Request to http://172.17.1.107:9696/v2.0/security-groups?tenant_id=dda154eb250e40168ed38c219cba7819 timed out


errors in neutron -

2021-03-09 13:20:07.390 25 INFO neutron.wsgi [req-95b8def9-d36d-421e-9bd9-13f54506fcfe 51068b87c80b4859b15d4c01dba26c50 dda154eb250e40168ed38c219cba7819 - default default] Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 591, in handle_one_response
    write(b''.join(towrite))
  File "/usr/lib/python3.6/site-packages/eventlet/wsgi.py", line 537, in write
    wfile.writelines(towrite)
  File "/usr/lib64/python3.6/socket.py", line 604, in write
    return self._sock.send(b)
  File "/usr/lib/python3.6/site-packages/eventlet/greenio/base.py", line 397, in send
    return self._send_loop(self.fd.send, data, flags)
  File "/usr/lib/python3.6/site-packages/eventlet/greenio/base.py", line 384, in _send_loop
    return send_method(data, *args)
ConnectionResetError: [Errno 104] Connection reset by peer

2021-03-09 13:20:07.391 25 INFO neutron.wsgi [req-95b8def9-d36d-421e-9bd9-13f54506fcfe 51068b87c80b4859b15d4c01dba26c50 dda154eb250e40168ed38c219cba7819 - default default] 172.17.1.48 "GET /v2.0/security-groups?tenant_id=dda154eb250e40168ed38c219cba7819 HTTP/1.1" status: 200  len: 0 time: 45.1947863

I have identified this as part of security group log scale testing. But able to reproduce this on a normal setup.

List of neutron resources created when we start seeing these errors -

[root@controller-0 ~]# podman exec -it -u root galera-bundle-podman-0 sh -c "mysql -e 'select COUNT(*) from ovs_neutron.securitygroupportbindings'";
+----------+
| COUNT(*) |
+----------+
|     8310 |
+----------+
[root@controller-0 ~]# podman exec -it -u root galera-bundle-podman-0 sh -c "mysql -e 'select COUNT(*) from ovs_neutron.securitygroups'";
+----------+
| COUNT(*) |
+----------+
|     1790 |
+----------+
[root@controller-0 ~]# podman exec -it -u root galera-bundle-podman-0 sh -c "mysql -e 'select COUNT(*) from ovs_neutron.securitygrouprules'";
+----------+
| COUNT(*) |
+----------+
|    11445 |
+----------+
[root@controller-0 ~]# podman exec -it -u root galera-bundle-podman-0 sh -c "date -u && mysql -e 'select COUNT(*) from ovs_neutron.securitygrouprules'";
Thu Mar 11 10:30:46 UTC 2021
+----------+
| COUNT(*) |
+----------+
|    11477 |
+----------+
[root@controller-0 ~]# podman exec -it -u root galera-bundle-podman-0 sh -c "date -u && mysql -e 'select COUNT(*) from ovs_neutron.ports'";
Thu Mar 11 10:34:09 UTC 2021
+----------+
| COUNT(*) |
+----------+
|     3908 |
+----------+
[root@controller-0 ~]# podman exec -it -u root galera-bundle-podman-0 sh -c "date -u && mysql -e 'select COUNT(*) from ovs_neutron.networks'";
Thu Mar 11 10:34:21 UTC 2021
+----------+
| COUNT(*) |
+----------+
|      366 |
+----------+



Version-Release number of selected component (if applicable):
RHOS-16.1-RHEL-8-20210216.n.1

How reproducible:
1. Run rally scenario https://github.com/cloud-bulldozer/browbeat/blob/master/rally/rally-plugins/netcreate-boot/netcreate_nova_boot_fip_ping_sec_groups.py#L31
with the configuration Num_vms: 4 num_sg:2 num_sgr:10 concurrency:16 times: 500

Flavio has done some analysis https://docs.google.com/document/d/1h3JgXQgAOKI64WTXtiDqP47JEONJMNHlLxZbFN2-wUs/edit?ts=60494053#heading=h.innw5ujbmlza

One user and one project is used by rally in this testing. So all the neutron resources are created using one tenant.
(overcloud) [stack@undercloud ~]$ openstack project list
+----------------------------------+---------------------------+
| ID                               | Name                      |
+----------------------------------+---------------------------+
| 3acb775d30b24db5b195cd13788e6e33 | c_rally_8c9a5195_zTrSbWRr |
| 4251cad0bb224f2b8d8f9b89c148c785 | admin                     |
| 5a862093635e4d4c83e834fe1e3379bc | service                   |
+----------------------------------+---------------------------+
(overcloud) [stack@undercloud ~]$ openstack user list
+----------------------------------+---------------------------+
| ID                               | Name                      |
+----------------------------------+---------------------------+
| 0d0700a464074354b57da6d3e44a2da0 | admin                     |
| 8521bd5c7b194027b00ae7acd7623bb4 | cinder                    |
| e6c48a55e72945f6b579a10f73b3ce5c | cinderv2                  |
| 24ea8943aabf4b9fa8c8c57172c1e268 | cinderv3                  |
| 3fd92baf9e2643e6af514023f2eb488e | glance                    |
| 869f89fa9e7f49888b72d5bae2ae99a2 | heat                      |
| 49297ac7d5fa4fa78938d9425bc33fb6 | heat_stack_domain_admin   |
| 0e974bf8f05c412fa2c97fc203060c6f | heat-cfn                  |
| fac9913c2cac4e3580a6fa13097db635 | neutron                   |
| 85c01593a35642a39a7d662d2e71e98d | nova                      |
| 8453b9b3e4ad42cea9433cdada50691f | placement                 |
| 1e3a5702c9c04d99bc1fc31cabe4c10e | swift                     |
| f3304db57f394f8c98e5116117245228 | c_rally_8c9a5195_faAB98kr |
+----------------------------------+---------------------------+

Neutron logs http://rdu-storage01.scalelab.redhat.com/anilvenkata/20210309-114754.tar.gz
Rally logs http://rdu-storage01.scalelab.redhat.com/anilvenkata/20210309-114754-rally.tar.gz

Comment 1 ffernand 2021-03-12 11:38:30 UTC
Before we invest too much energy on this, it may be important to know the requirements in terms of what customers need.
It may be possible that number of security group rules is more important than the number of security groups. @atrag may know.

Should we need to address this BZ, the most viable way is to change the requester (nova in this case) to use pagination
and leverage what is already implemented in neutron API. That basically involves the usage of the "limit" and "marker"
parameters.

    @db_api.retry_if_session_inactive()
    def get_security_groups(self, context, filters=None, fields=None,
                            sorts=None, limit=None,
                            marker=None, page_reverse=False, default_sg=False)

    https://github.com/openstack/neutron/blob/98db3b9c80a6309f7b91624c5eeea159f0b57081/neutron/db/securitygroups_db.py#L161-L164
    https://docs.openstack.org/api-ref/network/v2/index.html#security-groups-security-groups

More info on pagination is available here:

    https://docs.openstack.org/api-ref/network/v2/#pagination

Comment 7 anil venkata 2021-05-14 08:30:27 UTC
Slawek, As it is very important patch for the scale (helpful for many customers), is it possible to backport it to OSP13?

Comment 30 errata-xmlrpc 2021-12-09 20:18:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.7 (Train) bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3762


Note You need to log in before you can comment on or make changes to this bug.