Bug 1571499
| Summary: | Cannot Scale past 75 VMs- scale regression | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Sai Sindhur Malleni <smalleni> | ||||||||||||||||||||
| Component: | openstack-nova | Assignee: | OSP DFG:Compute <osp-dfg-compute> | ||||||||||||||||||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | OSP DFG:Compute <osp-dfg-compute> | ||||||||||||||||||||
| Severity: | unspecified | Docs Contact: | |||||||||||||||||||||
| Priority: | unspecified | ||||||||||||||||||||||
| Version: | 13.0 (Queens) | CC: | abeekhof, abond, berrange, bhaley, chjones, dasmith, eglynn, jchhatba, jcoufal, jdanjou, jhakimra, jtaleric, kchamart, lyarwood, markmc, mbayer, mbooth, michele, mwitt, psuriset, racedoro, sbauza, sferdjao, sgordon, smalleni, srevivo, vromanso, ykarel | ||||||||||||||||||||
| Target Milestone: | --- | Keywords: | ZStream | ||||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||||||||
| OS: | Unspecified | ||||||||||||||||||||||
| Whiteboard: | scale_lab | ||||||||||||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||||||
| Last Closed: | 2018-07-31 14:24:08 UTC | Type: | Bug | ||||||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||
| Embargoed: | |||||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||
|
Description
Sai Sindhur Malleni
2018-04-25 02:20:53 UTC
Created attachment 1426323 [details]
overcloud-controller-0
Created attachment 1426324 [details]
overcloud-controller-1
Created attachment 1426325 [details]
overcloud-controller-2
Created attachment 1426326 [details]
overcloud-novacompute-0
Created attachment 1426327 [details]
overcloud-novacompute-1
Created attachment 1426328 [details]
overcloud-novacompute-2
Created attachment 1426329 [details]
overcloud-novacompute-3
Created attachment 1426330 [details]
overcloud-novacompute-4
Created attachment 1426331 [details]
overcloud-novacompute-5
I am seeing very similar results. Couple of things to point out: - Timeouts waiting replies from neutron/ports - Timeouts on message replies - DB Deadlocks Given this seems related to overall slowdowns, I wondering if the oslo.rootwrap issue that hit us with Neutron containerization is impacting us here (across multiple services)? Containers are deploying with : https://gist.github.com/jtaleric/ddd7b996a253f95163e5867fd15774d3 Which, oslo.rootwrap will try to close all FDs -- slowing down services that use oslo.rootwrap. I am trying to patch my osp13 env to test this theory out... I have observed false timeouts for subnet creations. Subnet actually gets created but the reply says ERROR creating subnet. TIMED OUT. Please find sosreports here: http://elk.browbeatproject.org:9090/~smalleni/sosreports/ The conrtoller sosreports are: sosreport-SaiMalleni-20180425124105.tar.xz sosreport-SaiMalleni-20180425124134.tar.xz sosreport-SaiMalleni-20180425124202.tar.xz The attachments to the BZ only contain the openstack service logs, like nova, neutron etc (running in debug). The sosreports linked to in th eprevious comments are from the same environment. I was able to reproduce this again on freshly deployed cloud. However I used a slightly different scenario this time. Earlier, I was hitting a 75 VM limit in the following scenario: Create a network Create subnet Launch a VM on that subnet So for every VM we wwere creating a new subnet. This time I launched all VMs on the same subnet. I got upto 112 VMs after which I got stuck. The most commons errors can be summed up as: 1. MessagingTimeout: Timed out waiting for a reply to message ID in neutron dhcp-agent, l3-agent and openvswitch agent logs along with nova-conductor logs. 2. We are also seeing DBDeadlock errors in neutron server, nova-conductor and ceilometer agent-notification logs. 3. ConnectTimeout in nova-api logs to neutron ConnectTimeout: Request to http://172.16.0.13:9696/v2.0/ports?device_id=b39cd902-8f24-40ca-b2d8-4fbf43e8dffa timed out In neutorn-server logs DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'UPDATE agents SET heartbeat_timestamp=%(heartbeat_timestamp)s WHERE agents.id = %(agents_id)s'] [parameters: {'heartbeat_timestamp': datetime.datetime(2018, 4, 26, 3, 0, 15, 923947), 'agents_id': u'229e8791-963e-450f-887b-17c72bd4263e'}] (Background on this error at: http://sqlalche.me/e/2j85) 2018-04-26 03:01:06.934 47 ERROR oslo_db.api ============================================================================== In nova-conductor logs DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'UPDATE services SET updated_at=%(updated_at)s, report_count=%(report_count)s WHERE services.id = %(services_id)s'] [parameters: {'services_id': 71, 'updated_at': datetime.datetime(2018, 4, 26, 3, 18, 5, 974982), 'report_count': 1704}] (Background on this error at: http://sqlalche.me/e/2j85) ============================================================================ In ceilometer agent-notification logs DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') [SQL: u'INSERT INTO event (message_id, `generated`, raw, event_type_id) VALUES (%(message_id)s, %(generated)s, %(raw)s, %(event_type_id)s)'] [parameters: {'raw': '{}', 'generated': Decimal('1524712674.345072'), 'event_type_id': 416, 'message_id': u'95a7ae0f-9572-492a-b0b2-21e30d520b7c'}] (Background on this error at: http://sqlalche.me/e/2j85) I came to a hard stop at 106 VMs even with the neutron ulimit patch. https://review.openstack.org/#/c/559631/1 (In reply to Sai Sindhur Malleni from comment #15) > In nova-conductor logs > DBDeadlock: (pymysql.err.InternalError) (1205, u'Lock wait timeout exceeded; > try restarting transaction') [SQL: u'UPDATE services SET > updated_at=%(updated_at)s, report_count=%(report_count)s WHERE services.id = > %(services_id)s'] [parameters: {'services_id': 71, 'updated_at': > datetime.datetime(2018, 4, 26, 3, 18, 5, 974982), 'report_count': 1704}] > (Background on this error at: http://sqlalche.me/e/2j85) For the nova related part, this query is the service report_count update, which is the "heartbeat" the various nova services use. The query [0] already has a @oslo_db_api.wrap_db_retry(max_retries=5, retry_on_deadlock=True) decorator on it and I don't find any other queries that could be competing with it. I suspect that this failure is a side effect of the overall RPC timeouts that are being encountered in the environment, for example, having heartbeats from various services backlogged in the RPC queue and several being tried at once as a result. In order to alleviate load on the message queue from nova service heartbeats, I would recommend increasing the [DEFAULT]report_interval and [DEFAULT]service_down_time in the nova.conf files. The defaults are report_interval=10 (seconds) and service_down_time=60 (seconds) and note that service_down_time must be greater than report_interval. Overall, it looks like neutron is overloaded and failing to reply in time "ConnectTimeout: Request to http://172.16.0.13:9696/v2.0/ports?device_id=b39cd902-8f24-40ca-b2d8-4fbf43e8dffa timed out", causing messaging timeouts, which have a domino effect to other services. [0] https://code.engineering.redhat.com/gerrit/gitweb?p=nova.git;a=blob;f=nova/db/sqlalchemy/api.py;h=73ca7ea36854ea13b245de5b5d0063c460ae4cd9;hb=307382f58d38778b480d2d030e427759a44c204b#l566 (In reply to Sai Sindhur Malleni from comment #16) > I came to a hard stop at 106 VMs even with the neutron ulimit patch. > https://review.openstack.org/#/c/559631/1 So that's better 106 as compared to 75. @sai, are you not using OSP 13.0 latest puddle for this scale testing? If not can you also try https://review.openstack.org/#/c/560991/ which fixes ulimit for nova-compute and cinder-volume? (In reply to Yatin Karel from comment #19) > (In reply to Sai Sindhur Malleni from comment #16) > > I came to a hard stop at 106 VMs even with the neutron ulimit patch. > > https://review.openstack.org/#/c/559631/1 > So that's better 106 as compared to 75. > > @sai, are you not using OSP 13.0 latest puddle for this scale testing? > If not can you also try https://review.openstack.org/#/c/560991/ which fixes > ulimit for nova-compute and cinder-volume? I checked attached logs from a compute node(compute-2) but can't find impact of large ulimit(following concurrency.processutils operations in logs:- Running cmd (subprocess): ......). I think https://review.openstack.org/#/c/560991/ would improve the current situation a lot. I have seen a good difference with this patch for subprocess operations. Can i also see sosreport for any compute node(compute-2 or any) to see how bad operations are when doing scale testing with large ulimit. I'm using puddle 2018-04-24.1 I tried patching just nova services annd neutron with the ulimit (but missed out on cinder-volme). However, the results were not any better. I stopped around 93 VMs. Dan, the disk on one of the controllers does seem to be at a 100% https://snapshot.raintank.io/dashboard/snapshot/AwGS5d0wVb7zic3H230xC6sAbhMnWlTf I am seeing similar disk utilization. Swift is quite busy using the disk (More than likely due to Telem). https://snapshot.raintank.io/dashboard/snapshot/kUjMjnP2e1ZZ67qS7u5zq3JoU9N8uIRi I am going to disable telem and re-test. Using iotop to look at processes waiting on I/O looks like swift-object-server is #1 Total DISK READ : 24.81 K/s | Total DISK WRITE : 306.03 K/s 104003 be/4 42445 8.27 K/s 0.00 B/s 0.00 % 99.99 % python2 /usr/bin/swift-object-server /etc/swift/object-server.conf 104002 be/4 42445 0.00 B/s 0.00 B/s 0.00 % 99.99 % python2 /usr/bin/swift-object-server /etc/swift/object-server.conf 104000 be/4 42445 0.00 B/s 0.00 B/s 0.00 % 99.99 % python2 /usr/bin/swift-object-server /etc/swift/object-server.conf 103988 be/4 42445 0.00 B/s 0.00 B/s 0.00 % 99.99 % python2 /usr/bin/swift-object-server /etc/swift/object-server.conf 103982 be/4 42445 0.00 B/s 0.00 B/s 0.00 % 99.99 % python2 /usr/bin/swift-object-server /etc/swift/object-server.conf 103966 be/4 42445 0.00 B/s 0.00 B/s 0.00 % 99.99 % python2 /usr/bin/swift-object-server /etc/swift/object-server.conf 103944 be/4 42445 0.00 B/s 0.00 B/s 0.00 % 99.99 % python2 /usr/bin/swift-object-server /etc/swift/object-server.conf The above output is when the disk is at 100 util. Also, it is worth noting that disk util goes to 100% only on one of the controllers and remains at 100%. The other two controllers have disk utilizatio n less than 5%. Even if swift is the suspect, is there anything that could be causing this only on one controller? Reviewing my system during some of this testing we can see Swift is quite busy, and so is the system disk it is using (shared resource across all controller services). Disabling Telemetry resulted in the exact same test scenarios succeeding : - http://c09-h23-r630.rdu.openstack.engineering.redhat.com:9001/20180502-182813/rally/ To Compare (with Telemetry enabled) : - http://c09-h23-r630.rdu.openstack.engineering.redhat.com:9001/telem/20180501-170346/rally/ The only change I made to my deploy command was : -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml Adding Mike to see if there is anything we can contribute regarding the deadlocks. Discussing on the PIDONE call tonight, we also felt that it would be good to verify if the test _still_ passes on OSP12 (ie. with RHEL 7.5 instead of 7.4). That would provide an important datapoint. Sorry I have been caught up with OSP13 Director Scale work. Did not retest with OSP12 and RHEL7.5 |