Bug 1571499 - Cannot Scale past 75 VMs- scale regression
Summary: Cannot Scale past 75 VMs- scale regression
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: nova-maint
QA Contact: nova-maint
URL:
Whiteboard: scale_lab
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-25 02:20 UTC by Sai Sindhur Malleni
Modified: 2019-09-09 13:50 UTC (History)
28 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-07-31 14:24:08 UTC
Target Upstream Version:


Attachments (Terms of Use)
overcloud-controller-0 (5.94 MB, application/x-gzip)
2018-04-25 02:23 UTC, Sai Sindhur Malleni
no flags Details
overcloud-controller-1 (6.00 MB, application/x-gzip)
2018-04-25 02:23 UTC, Sai Sindhur Malleni
no flags Details
overcloud-controller-2 (5.87 MB, application/x-gzip)
2018-04-25 02:24 UTC, Sai Sindhur Malleni
no flags Details
overcloud-novacompute-0 (46.89 KB, application/x-gzip)
2018-04-25 02:26 UTC, Sai Sindhur Malleni
no flags Details
overcloud-novacompute-1 (48.58 KB, application/x-gzip)
2018-04-25 02:27 UTC, Sai Sindhur Malleni
no flags Details
overcloud-novacompute-2 (50.55 KB, application/x-gzip)
2018-04-25 02:28 UTC, Sai Sindhur Malleni
no flags Details
overcloud-novacompute-3 (56.68 KB, application/x-gzip)
2018-04-25 02:28 UTC, Sai Sindhur Malleni
no flags Details
overcloud-novacompute-4 (48.26 KB, application/x-gzip)
2018-04-25 02:29 UTC, Sai Sindhur Malleni
no flags Details
overcloud-novacompute-5 (56.51 KB, application/x-gzip)
2018-04-25 02:29 UTC, Sai Sindhur Malleni
no flags Details

Description Sai Sindhur Malleni 2018-04-25 02:20:53 UTC
Description of problem:

In an OpenStack setup with 3 controllers and 6 compute nodes where each compute node has 128G of memory and 56 cores, we are trying to scale test by spawning instances with an m1.xtiny flavor (1 cpu, 64 MB RAM and 1G disk. 

we are using a browbeat rally scenario to spawn 300 VMs at a concurrency on 10 with each VM on its own network (https://github.com/openstack/browbeat/blob/master/rally/rally-plugins/netcreate-boot/netcreate_nova_boot.py).


We see that once we hit around 75 VMs, nova boots start failing with several reasons as below (hypervisor capacity is not a problem)

'message': u'Build of instance 7db8f60d-1570-4be4-a63b-e52b285e9085 aborted: Request to http://172.16.0.18:9696/v2.0/ports timed out', u'code': 500, u'details': u'  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1840, in _do_build_and_run_instance\n    filter_properties, request_spec)\n  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2053, in _build_and_run_instance\n    bdms=block_device_mapping)\n  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in _
_exit__\n    self.force_reraise()\n  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2025, in _build_and_run_instance\n    instance=instance)\n  File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__\n    
self.gen.throw(type, value, traceback)\n  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2265, in _build_resources\n    reason=six.text_type(exc))\n', u'created': u'2018-04-25T01:13:53Z'} |
===============================================================================
'message': u'Build of instance 8dd19606-fc38-417c-b88c-262f0a1a72fb aborted: Failed to allocate the network(s), not rescheduling.', u'code': 500, u'details': u'  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1840, in _do_build_and_run_instance\n    filter_properties, request_spec)\n  File "/usr/lib/python2.7/site
-packages/nova/compute/manager.py", line 2053, in _build_and_run_instance\n    bdms=block_device_mapping)\n  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __ex
it__\n    self.force_reraise()\n  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2005, in _build_and_run_instance\n    block_device_mapping) as resources:\n  File "/usr/lib64/python2.7/contextlib.py", line 17, in 
__enter__\n    return self.gen.next()\n  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2175, in _build_resources\n    reason=msg)\n', u'created': u'2018-04-25T01:36:58Z'} |
===============================================================================
Timed out waiting for a reply to message ID f56a8117ac97469d996695343e19dada', u'code': 500, u'details': u'  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1840, in _do_build_and_run_instance\n    filter_properties, request_spec)\n  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2132, in _build_and_run_instance\n    instance.save(expected_task_state=task_states.SPAWNING)\n  File "/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 210, in wrapper\n    ctxt, self, fn.__name__, args, kwargs)\n  File "/usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 245, in object_action\n    objmethod=objmethod, args=args, kwargs=kwargs)\n  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 174, in call\n    retry=self.retry)\n  File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 131, in _send\n    timeout=timeout, retry=retry)\n  File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 559, in send\n    retry=retry)\n  File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 548, in _send\n    result = self._waiter.wait(msg_id, timeout)\n  File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 440, in wait\n    message = self.waiters.get(msg_id, timeout=timeout)\n  File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 328, in get\n    \'to message ID %s\' % msg_id)\n', u'created': u'2018-04-25T01:21:44
===============================================================================
Build of instance 527b2a09-d228-4b25-82ef-6fd415c989f3 aborted: Failure prepping block device.', u'code': 500, u'details': u'  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1840, in _do_build_and_run_instance\n    filter_properties, request_spec)\n  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2053, in _build_and_run_instance\n    bdms=block_device_mapping)\n  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n    self.force_reraise()\n  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n    six.reraise(self.type_, self.value, self.tb)\n  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2005, in _build_and_run_instance\n    block_device_mapping) as resources:\n  File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__\n    return self.gen.next()\n  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2222, in _build_resources\n    reason=msg)\n', u'created': u'2018-04-25T01:26:46Z'}



Version-Release number of selected component (if applicable):
OSP 13
Container image tag 2018-04-10.2

How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP with 3 controllers and 6 compute nodes
2. Run Browbeat+Rally scale tests to create networks and VMs
3.

Actual results:
Cannot scale past 75 VMs

Expected results:
In previous releases we could scale to atleast a few hundred VMs on the same setup

Additional info:
This is a performance and scale regression. I have also noticed this happen irrespective of the neutron backend (ml2/ovs, ml2/ovn and ml2/odl)

Comment 1 Sai Sindhur Malleni 2018-04-25 02:23:16 UTC
Created attachment 1426323 [details]
overcloud-controller-0

Comment 2 Sai Sindhur Malleni 2018-04-25 02:23:58 UTC
Created attachment 1426324 [details]
overcloud-controller-1

Comment 3 Sai Sindhur Malleni 2018-04-25 02:24:37 UTC
Created attachment 1426325 [details]
overcloud-controller-2

Comment 4 Sai Sindhur Malleni 2018-04-25 02:26:20 UTC
Created attachment 1426326 [details]
overcloud-novacompute-0

Comment 5 Sai Sindhur Malleni 2018-04-25 02:27:41 UTC
Created attachment 1426327 [details]
overcloud-novacompute-1

Comment 6 Sai Sindhur Malleni 2018-04-25 02:28:08 UTC
Created attachment 1426328 [details]
overcloud-novacompute-2

Comment 7 Sai Sindhur Malleni 2018-04-25 02:28:42 UTC
Created attachment 1426329 [details]
overcloud-novacompute-3

Comment 8 Sai Sindhur Malleni 2018-04-25 02:29:06 UTC
Created attachment 1426330 [details]
overcloud-novacompute-4

Comment 9 Sai Sindhur Malleni 2018-04-25 02:29:35 UTC
Created attachment 1426331 [details]
overcloud-novacompute-5

Comment 10 Joe Talerico 2018-04-25 11:06:44 UTC
I am seeing very similar results. Couple of things to point out:

- Timeouts waiting replies from neutron/ports
- Timeouts on message replies
- DB Deadlocks

Given this seems related to overall slowdowns, I wondering if the oslo.rootwrap issue that hit us with Neutron containerization is impacting us here (across multiple services)? 

Containers are deploying with : https://gist.github.com/jtaleric/ddd7b996a253f95163e5867fd15774d3

Which, oslo.rootwrap will try to close all FDs -- slowing down services that use oslo.rootwrap. 

I am trying to patch my osp13 env to test this theory out...

Comment 12 Janki 2018-04-25 14:04:48 UTC
I have observed false timeouts for subnet creations. Subnet actually gets created but the reply says ERROR creating subnet. TIMED OUT.

Comment 13 Sai Sindhur Malleni 2018-04-25 18:49:32 UTC
Please find sosreports here:
http://elk.browbeatproject.org:9090/~smalleni/sosreports/

The conrtoller sosreports are:
sosreport-SaiMalleni-20180425124105.tar.xz
sosreport-SaiMalleni-20180425124134.tar.xz
sosreport-SaiMalleni-20180425124202.tar.xz

Comment 14 Sai Sindhur Malleni 2018-04-25 20:21:24 UTC
The attachments to the BZ only contain the openstack service logs, like nova, neutron etc (running in debug). The sosreports linked to in th eprevious comments are from the same environment.

Comment 15 Sai Sindhur Malleni 2018-04-26 03:30:14 UTC
I was able to reproduce this again on freshly deployed cloud. However I used a slightly different scenario this time. Earlier, I was hitting a 75 VM limit in the following scenario:
Create a network
Create subnet
Launch a VM on that subnet

So for every VM we wwere creating a new subnet.

This time I launched all VMs on the same subnet. I got upto 112 VMs after which I got stuck.

The most commons errors can be summed up as:

1. MessagingTimeout: Timed out waiting for a reply to message ID in neutron dhcp-agent, l3-agent and openvswitch agent logs along with nova-conductor logs.

2. We are also seeing DBDeadlock errors in neutron server, nova-conductor and ceilometer agent-notification logs.

3. ConnectTimeout in nova-api logs to neutron
ConnectTimeout: Request to http://172.16.0.13:9696/v2.0/ports?device_id=b39cd902-8f24-40ca-b2d8-4fbf43e8dffa timed out


In neutorn-server logs
 DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'UPDATE agents SET heartbeat_timestamp=%(heartbeat_timestamp)s WHERE agents.id = %(agents_id)s'] [parameters: {'heartbeat_timestamp': datetime.datetime(2018, 4, 26, 3, 0, 15, 923947), 'agents_id': u'229e8791-963e-450f-887b-17c72bd4263e'}] (Background on this error at: http://sqlalche.me/e/2j85)
2018-04-26 03:01:06.934 47 ERROR oslo_db.api
==============================================================================

In nova-conductor logs
DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'UPDATE services SET updated_at=%(updated_at)s, report_count=%(report_count)s WHERE services.id = %(services_id)s'] [parameters: {'services_id': 71, 'updated_at': datetime.datetime(2018, 4, 26, 3, 18, 5, 974982), 'report_count': 1704}] (Background on this error at: http://sqlalche.me/e/2j85)

============================================================================
In ceilometer agent-notification logs
DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'INSERT INTO event (message_id, `generated`, raw, event_type_id) VALUES (%(message_id)s, %(generated)s, %(raw)s, %(event_type_id)s)'] [parameters: {'raw': '{}', 'generated': Decimal('1524712674.345072'), 'event_type_id': 416, 'message_id': u'95a7ae0f-9572-492a-b0b2-21e30d520b7c'}] (Background on this error at: http://sqlalche.me/e/2j85)

Comment 16 Sai Sindhur Malleni 2018-04-26 17:58:38 UTC
I came to a hard stop at 106 VMs even with the neutron ulimit patch. 
https://review.openstack.org/#/c/559631/1

Comment 17 melanie witt 2018-04-26 23:43:58 UTC
(In reply to Sai Sindhur Malleni from comment #15)
> In nova-conductor logs
> DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded;
> try restarting transaction') [SQL: u'UPDATE services SET
> updated_at=%(updated_at)s, report_count=%(report_count)s WHERE services.id =
> %(services_id)s'] [parameters: {'services_id': 71, 'updated_at':
> datetime.datetime(2018, 4, 26, 3, 18, 5, 974982), 'report_count': 1704}]
> (Background on this error at: http://sqlalche.me/e/2j85)

For the nova related part, this query is the service report_count update, which is the "heartbeat" the various nova services use. The query [0] already has a @oslo_db_api.wrap_db_retry(max_retries=5, retry_on_deadlock=True) decorator on it and I don't find any other queries that could be competing with it. I suspect that this failure is a side effect of the overall RPC timeouts that are being encountered in the environment, for example, having heartbeats from various services backlogged in the RPC queue and several being tried at once as a result.

In order to alleviate load on the message queue from nova service heartbeats, I would recommend increasing the [DEFAULT]report_interval and [DEFAULT]service_down_time in the nova.conf files. The defaults are report_interval=10 (seconds) and service_down_time=60 (seconds) and note that service_down_time must be greater than report_interval.

Overall, it looks like neutron is overloaded and failing to reply in time "ConnectTimeout: Request to http://172.16.0.13:9696/v2.0/ports?device_id=b39cd902-8f24-40ca-b2d8-4fbf43e8dffa timed out", causing messaging timeouts, which have a domino effect to other services.

[0] https://code.engineering.redhat.com/gerrit/gitweb?p=nova.git;a=blob;f=nova/db/sqlalchemy/api.py;h=73ca7ea36854ea13b245de5b5d0063c460ae4cd9;hb=307382f58d38778b480d2d030e427759a44c204b#l566

Comment 19 Yatin Karel 2018-04-30 07:30:08 UTC
(In reply to Sai Sindhur Malleni from comment #16)
> I came to a hard stop at 106 VMs even with the neutron ulimit patch. 
> https://review.openstack.org/#/c/559631/1
So that's better 106 as compared to 75.

@sai, are you not using OSP 13.0 latest puddle for this scale testing?
If not can you also try https://review.openstack.org/#/c/560991/ which fixes ulimit for nova-compute and cinder-volume?

Comment 20 Yatin Karel 2018-04-30 08:07:04 UTC
(In reply to Yatin Karel from comment #19)
> (In reply to Sai Sindhur Malleni from comment #16)
> > I came to a hard stop at 106 VMs even with the neutron ulimit patch. 
> > https://review.openstack.org/#/c/559631/1
> So that's better 106 as compared to 75.
> 
> @sai, are you not using OSP 13.0 latest puddle for this scale testing?
> If not can you also try https://review.openstack.org/#/c/560991/ which fixes
> ulimit for nova-compute and cinder-volume?

I checked attached logs from a compute node(compute-2) but can't find impact of large ulimit(following concurrency.processutils operations in logs:- Running cmd (subprocess): ......).
I think https://review.openstack.org/#/c/560991/ would improve the current situation a lot.

I have seen a good difference with this patch for subprocess operations.
Can i also see sosreport for any compute node(compute-2 or any) to see how bad operations are when doing scale testing with large ulimit.

Comment 21 Sai Sindhur Malleni 2018-04-30 13:38:51 UTC
I'm using puddle 2018-04-24.1

I tried patching just nova services annd neutron with the ulimit (but missed out on cinder-volme). However, the results were not any better. I stopped around 93 VMs.

Comment 23 Sai Sindhur Malleni 2018-05-02 16:07:44 UTC
Dan, 
the disk on one of the controllers does seem to be at a 100%
https://snapshot.raintank.io/dashboard/snapshot/AwGS5d0wVb7zic3H230xC6sAbhMnWlTf

Comment 24 Joe Talerico 2018-05-02 16:41:39 UTC
I am seeing similar disk utilization. Swift is quite busy using the disk (More than likely due to Telem).

https://snapshot.raintank.io/dashboard/snapshot/kUjMjnP2e1ZZ67qS7u5zq3JoU9N8uIRi

I am going to disable telem and re-test.

Comment 25 Sai Sindhur Malleni 2018-05-02 17:54:45 UTC
Using iotop to look at processes waiting on I/O looks like swift-object-server is #1
Total DISK READ :      24.81 K/s | Total DISK WRITE :     306.03 K/s
 104003 be/4 42445	 8.27 K/s    0.00 B/s  0.00 % 99.99 % python2 /usr/bin/swift-object-server /etc/swift/object-server.conf
 104002 be/4 42445	 0.00 B/s    0.00 B/s  0.00 % 99.99 % python2 /usr/bin/swift-object-server /etc/swift/object-server.conf
 104000 be/4 42445	 0.00 B/s    0.00 B/s  0.00 % 99.99 % python2 /usr/bin/swift-object-server /etc/swift/object-server.conf
 103988 be/4 42445	 0.00 B/s    0.00 B/s  0.00 % 99.99 % python2 /usr/bin/swift-object-server /etc/swift/object-server.conf
 103982 be/4 42445	 0.00 B/s    0.00 B/s  0.00 % 99.99 % python2 /usr/bin/swift-object-server /etc/swift/object-server.conf
 103966 be/4 42445	 0.00 B/s    0.00 B/s  0.00 % 99.99 % python2 /usr/bin/swift-object-server /etc/swift/object-server.conf
103944 be/4 42445 0.00 B/s 0.00 B/s 0.00 % 99.99 % python2 /usr/bin/swift-object-server /etc/swift/object-server.conf

Comment 26 Sai Sindhur Malleni 2018-05-02 17:55:32 UTC
The above output is when the disk is at 100 util.

Comment 27 Sai Sindhur Malleni 2018-05-02 18:08:57 UTC
Also, it is worth noting that disk util goes to 100% only on one of the controllers and remains at 100%. The other two controllers have disk utilizatio n less than 5%. Even if swift is the suspect, is there anything that could be causing this only on one controller?

Comment 28 Joe Talerico 2018-05-02 18:41:10 UTC
Reviewing my system during some of this testing we can see Swift is quite busy, and so is the system disk it is using (shared resource across all controller services). 

Disabling Telemetry resulted in the exact same test scenarios succeeding :

- http://c09-h23-r630.rdu.openstack.engineering.redhat.com:9001/20180502-182813/rally/

To Compare (with Telemetry enabled) :

- http://c09-h23-r630.rdu.openstack.engineering.redhat.com:9001/telem/20180501-170346/rally/

The only change I made to my deploy command was :
  -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml

Comment 35 Andrew Beekhof 2018-05-08 12:24:14 UTC
Adding Mike to see if there is anything we can contribute regarding the deadlocks.

Discussing on the PIDONE call tonight, we also felt that it would be good to verify if the test _still_ passes on OSP12 (ie. with RHEL 7.5 instead of 7.4).  That would provide an important datapoint.

Comment 42 Sai Sindhur Malleni 2018-05-17 16:28:39 UTC
Sorry I have been caught up with OSP13 Director Scale work. Did not retest with OSP12 and RHEL7.5


Note You need to log in before you can comment on or make changes to this bug.