Description of problem: In an OpenStack setup with 3 controllers and 6 compute nodes where each compute node has 128G of memory and 56 cores, we are trying to scale test by spawning instances with an m1.xtiny flavor (1 cpu, 64 MB RAM and 1G disk. we are using a browbeat rally scenario to spawn 300 VMs at a concurrency on 10 with each VM on its own network (https://github.com/openstack/browbeat/blob/master/rally/rally-plugins/netcreate-boot/netcreate_nova_boot.py). We see that once we hit around 75 VMs, nova boots start failing with several reasons as below (hypervisor capacity is not a problem) 'message': u'Build of instance 7db8f60d-1570-4be4-a63b-e52b285e9085 aborted: Request to http://172.16.0.18:9696/v2.0/ports timed out', u'code': 500, u'details': u' File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1840, in _do_build_and_run_instance\n filter_properties, request_spec)\n File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2053, in _build_and_run_instance\n bdms=block_device_mapping)\n File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in _ _exit__\n self.force_reraise()\n File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2025, in _build_and_run_instance\n instance=instance)\n File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__\n self.gen.throw(type, value, traceback)\n File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2265, in _build_resources\n reason=six.text_type(exc))\n', u'created': u'2018-04-25T01:13:53Z'} | =============================================================================== 'message': u'Build of instance 8dd19606-fc38-417c-b88c-262f0a1a72fb aborted: Failed to allocate the network(s), not rescheduling.', u'code': 500, u'details': u' File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1840, in _do_build_and_run_instance\n filter_properties, request_spec)\n File "/usr/lib/python2.7/site -packages/nova/compute/manager.py", line 2053, in _build_and_run_instance\n bdms=block_device_mapping)\n File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __ex it__\n self.force_reraise()\n File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2005, in _build_and_run_instance\n block_device_mapping) as resources:\n File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__\n return self.gen.next()\n File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2175, in _build_resources\n reason=msg)\n', u'created': u'2018-04-25T01:36:58Z'} | =============================================================================== Timed out waiting for a reply to message ID f56a8117ac97469d996695343e19dada', u'code': 500, u'details': u' File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1840, in _do_build_and_run_instance\n filter_properties, request_spec)\n File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2132, in _build_and_run_instance\n instance.save(expected_task_state=task_states.SPAWNING)\n File "/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 210, in wrapper\n ctxt, self, fn.__name__, args, kwargs)\n File "/usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 245, in object_action\n objmethod=objmethod, args=args, kwargs=kwargs)\n File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 174, in call\n retry=self.retry)\n File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 131, in _send\n timeout=timeout, retry=retry)\n File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 559, in send\n retry=retry)\n File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 548, in _send\n result = self._waiter.wait(msg_id, timeout)\n File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 440, in wait\n message = self.waiters.get(msg_id, timeout=timeout)\n File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 328, in get\n \'to message ID %s\' % msg_id)\n', u'created': u'2018-04-25T01:21:44 =============================================================================== Build of instance 527b2a09-d228-4b25-82ef-6fd415c989f3 aborted: Failure prepping block device.', u'code': 500, u'details': u' File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 1840, in _do_build_and_run_instance\n filter_properties, request_spec)\n File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2053, in _build_and_run_instance\n bdms=block_device_mapping)\n File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__\n self.force_reraise()\n File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise\n six.reraise(self.type_, self.value, self.tb)\n File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2005, in _build_and_run_instance\n block_device_mapping) as resources:\n File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__\n return self.gen.next()\n File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2222, in _build_resources\n reason=msg)\n', u'created': u'2018-04-25T01:26:46Z'} Version-Release number of selected component (if applicable): OSP 13 Container image tag 2018-04-10.2 How reproducible: 100% Steps to Reproduce: 1. Deploy OSP with 3 controllers and 6 compute nodes 2. Run Browbeat+Rally scale tests to create networks and VMs 3. Actual results: Cannot scale past 75 VMs Expected results: In previous releases we could scale to atleast a few hundred VMs on the same setup Additional info: This is a performance and scale regression. I have also noticed this happen irrespective of the neutron backend (ml2/ovs, ml2/ovn and ml2/odl)
Created attachment 1426323 [details] overcloud-controller-0
Created attachment 1426324 [details] overcloud-controller-1
Created attachment 1426325 [details] overcloud-controller-2
Created attachment 1426326 [details] overcloud-novacompute-0
Created attachment 1426327 [details] overcloud-novacompute-1
Created attachment 1426328 [details] overcloud-novacompute-2
Created attachment 1426329 [details] overcloud-novacompute-3
Created attachment 1426330 [details] overcloud-novacompute-4
Created attachment 1426331 [details] overcloud-novacompute-5
I am seeing very similar results. Couple of things to point out: - Timeouts waiting replies from neutron/ports - Timeouts on message replies - DB Deadlocks Given this seems related to overall slowdowns, I wondering if the oslo.rootwrap issue that hit us with Neutron containerization is impacting us here (across multiple services)? Containers are deploying with : https://gist.github.com/jtaleric/ddd7b996a253f95163e5867fd15774d3 Which, oslo.rootwrap will try to close all FDs -- slowing down services that use oslo.rootwrap. I am trying to patch my osp13 env to test this theory out...
I have observed false timeouts for subnet creations. Subnet actually gets created but the reply says ERROR creating subnet. TIMED OUT.
Please find sosreports here: http://elk.browbeatproject.org:9090/~smalleni/sosreports/ The conrtoller sosreports are: sosreport-SaiMalleni-20180425124105.tar.xz sosreport-SaiMalleni-20180425124134.tar.xz sosreport-SaiMalleni-20180425124202.tar.xz
The attachments to the BZ only contain the openstack service logs, like nova, neutron etc (running in debug). The sosreports linked to in th eprevious comments are from the same environment.
I was able to reproduce this again on freshly deployed cloud. However I used a slightly different scenario this time. Earlier, I was hitting a 75 VM limit in the following scenario: Create a network Create subnet Launch a VM on that subnet So for every VM we wwere creating a new subnet. This time I launched all VMs on the same subnet. I got upto 112 VMs after which I got stuck. The most commons errors can be summed up as: 1. MessagingTimeout: Timed out waiting for a reply to message ID in neutron dhcp-agent, l3-agent and openvswitch agent logs along with nova-conductor logs. 2. We are also seeing DBDeadlock errors in neutron server, nova-conductor and ceilometer agent-notification logs. 3. ConnectTimeout in nova-api logs to neutron ConnectTimeout: Request to http://172.16.0.13:9696/v2.0/ports?device_id=b39cd902-8f24-40ca-b2d8-4fbf43e8dffa timed out In neutorn-server logs DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'UPDATE agents SET heartbeat_timestamp=%(heartbeat_timestamp)s WHERE agents.id = %(agents_id)s'] [parameters: {'heartbeat_timestamp': datetime.datetime(2018, 4, 26, 3, 0, 15, 923947), 'agents_id': u'229e8791-963e-450f-887b-17c72bd4263e'}] (Background on this error at: http://sqlalche.me/e/2j85) 2018-04-26 03:01:06.934 47 ERROR oslo_db.api ============================================================================== In nova-conductor logs DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'UPDATE services SET updated_at=%(updated_at)s, report_count=%(report_count)s WHERE services.id = %(services_id)s'] [parameters: {'services_id': 71, 'updated_at': datetime.datetime(2018, 4, 26, 3, 18, 5, 974982), 'report_count': 1704}] (Background on this error at: http://sqlalche.me/e/2j85) ============================================================================ In ceilometer agent-notification logs DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'INSERT INTO event (message_id, `generated`, raw, event_type_id) VALUES (%(message_id)s, %(generated)s, %(raw)s, %(event_type_id)s)'] [parameters: {'raw': '{}', 'generated': Decimal('1524712674.345072'), 'event_type_id': 416, 'message_id': u'95a7ae0f-9572-492a-b0b2-21e30d520b7c'}] (Background on this error at: http://sqlalche.me/e/2j85)
I came to a hard stop at 106 VMs even with the neutron ulimit patch. https://review.openstack.org/#/c/559631/1
(In reply to Sai Sindhur Malleni from comment #15) > In nova-conductor logs > DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; > try restarting transaction') [SQL: u'UPDATE services SET > updated_at=%(updated_at)s, report_count=%(report_count)s WHERE services.id = > %(services_id)s'] [parameters: {'services_id': 71, 'updated_at': > datetime.datetime(2018, 4, 26, 3, 18, 5, 974982), 'report_count': 1704}] > (Background on this error at: http://sqlalche.me/e/2j85) For the nova related part, this query is the service report_count update, which is the "heartbeat" the various nova services use. The query [0] already has a @oslo_db_api.wrap_db_retry(max_retries=5, retry_on_deadlock=True) decorator on it and I don't find any other queries that could be competing with it. I suspect that this failure is a side effect of the overall RPC timeouts that are being encountered in the environment, for example, having heartbeats from various services backlogged in the RPC queue and several being tried at once as a result. In order to alleviate load on the message queue from nova service heartbeats, I would recommend increasing the [DEFAULT]report_interval and [DEFAULT]service_down_time in the nova.conf files. The defaults are report_interval=10 (seconds) and service_down_time=60 (seconds) and note that service_down_time must be greater than report_interval. Overall, it looks like neutron is overloaded and failing to reply in time "ConnectTimeout: Request to http://172.16.0.13:9696/v2.0/ports?device_id=b39cd902-8f24-40ca-b2d8-4fbf43e8dffa timed out", causing messaging timeouts, which have a domino effect to other services. [0] https://code.engineering.redhat.com/gerrit/gitweb?p=nova.git;a=blob;f=nova/db/sqlalchemy/api.py;h=73ca7ea36854ea13b245de5b5d0063c460ae4cd9;hb=307382f58d38778b480d2d030e427759a44c204b#l566
(In reply to Sai Sindhur Malleni from comment #16) > I came to a hard stop at 106 VMs even with the neutron ulimit patch. > https://review.openstack.org/#/c/559631/1 So that's better 106 as compared to 75. @sai, are you not using OSP 13.0 latest puddle for this scale testing? If not can you also try https://review.openstack.org/#/c/560991/ which fixes ulimit for nova-compute and cinder-volume?
(In reply to Yatin Karel from comment #19) > (In reply to Sai Sindhur Malleni from comment #16) > > I came to a hard stop at 106 VMs even with the neutron ulimit patch. > > https://review.openstack.org/#/c/559631/1 > So that's better 106 as compared to 75. > > @sai, are you not using OSP 13.0 latest puddle for this scale testing? > If not can you also try https://review.openstack.org/#/c/560991/ which fixes > ulimit for nova-compute and cinder-volume? I checked attached logs from a compute node(compute-2) but can't find impact of large ulimit(following concurrency.processutils operations in logs:- Running cmd (subprocess): ......). I think https://review.openstack.org/#/c/560991/ would improve the current situation a lot. I have seen a good difference with this patch for subprocess operations. Can i also see sosreport for any compute node(compute-2 or any) to see how bad operations are when doing scale testing with large ulimit.
I'm using puddle 2018-04-24.1 I tried patching just nova services annd neutron with the ulimit (but missed out on cinder-volme). However, the results were not any better. I stopped around 93 VMs.
Dan, the disk on one of the controllers does seem to be at a 100% https://snapshot.raintank.io/dashboard/snapshot/AwGS5d0wVb7zic3H230xC6sAbhMnWlTf
I am seeing similar disk utilization. Swift is quite busy using the disk (More than likely due to Telem). https://snapshot.raintank.io/dashboard/snapshot/kUjMjnP2e1ZZ67qS7u5zq3JoU9N8uIRi I am going to disable telem and re-test.
Using iotop to look at processes waiting on I/O looks like swift-object-server is #1 Total DISK READ : 24.81 K/s | Total DISK WRITE : 306.03 K/s 104003 be/4 42445 8.27 K/s 0.00 B/s 0.00 % 99.99 % python2 /usr/bin/swift-object-server /etc/swift/object-server.conf 104002 be/4 42445 0.00 B/s 0.00 B/s 0.00 % 99.99 % python2 /usr/bin/swift-object-server /etc/swift/object-server.conf 104000 be/4 42445 0.00 B/s 0.00 B/s 0.00 % 99.99 % python2 /usr/bin/swift-object-server /etc/swift/object-server.conf 103988 be/4 42445 0.00 B/s 0.00 B/s 0.00 % 99.99 % python2 /usr/bin/swift-object-server /etc/swift/object-server.conf 103982 be/4 42445 0.00 B/s 0.00 B/s 0.00 % 99.99 % python2 /usr/bin/swift-object-server /etc/swift/object-server.conf 103966 be/4 42445 0.00 B/s 0.00 B/s 0.00 % 99.99 % python2 /usr/bin/swift-object-server /etc/swift/object-server.conf 103944 be/4 42445 0.00 B/s 0.00 B/s 0.00 % 99.99 % python2 /usr/bin/swift-object-server /etc/swift/object-server.conf
The above output is when the disk is at 100 util.
Also, it is worth noting that disk util goes to 100% only on one of the controllers and remains at 100%. The other two controllers have disk utilizatio n less than 5%. Even if swift is the suspect, is there anything that could be causing this only on one controller?
Reviewing my system during some of this testing we can see Swift is quite busy, and so is the system disk it is using (shared resource across all controller services). Disabling Telemetry resulted in the exact same test scenarios succeeding : - http://c09-h23-r630.rdu.openstack.engineering.redhat.com:9001/20180502-182813/rally/ To Compare (with Telemetry enabled) : - http://c09-h23-r630.rdu.openstack.engineering.redhat.com:9001/telem/20180501-170346/rally/ The only change I made to my deploy command was : -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml
Adding Mike to see if there is anything we can contribute regarding the deadlocks. Discussing on the PIDONE call tonight, we also felt that it would be good to verify if the test _still_ passes on OSP12 (ie. with RHEL 7.5 instead of 7.4). That would provide an important datapoint.
Sorry I have been caught up with OSP13 Director Scale work. Did not retest with OSP12 and RHEL7.5