Description of problem: The problem detected during execution REST API automatic tests on virt environment (1 undercloud, 3 controllers, 2 computes). (it's hard to call them "performance", because only 2 concurrent threads loaded) It's a regression compare to rhos10 which I loaded with 5 concurrent threads and test passed successfully. (More details here - http://yobshans.rdu.openstack.engineering.redhat.com/rhos-jmeter/result/2017-07-05-virt-rhos-10-restapi-perf-test-5x50/) So, error stared to raise after 10-15 min of test running. As result services such as neutron-server, cinder-scheduler,nova-scheduler, nova-conductor and other lost connection to MySQL server and REST API calls failed. Openstack is not functional. Only controller nodes restart help to return back an application to live. See attached message file for details. Errors looks like: Jul 6 22:05:48 controller-1 neutron-server: ERROR [sqlalchemy.pool.QueuePool] [12842] Exception during reset or similar Jul 6 22:05:48 controller-1 neutron-server: Traceback (most recent call last): Jul 6 22:05:48 controller-1 neutron-server: File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 636, in _finalize_fairy Jul 6 22:05:48 controller-1 neutron-server: fairy._reset(pool) Jul 6 22:05:48 controller-1 neutron-server: File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 776, in _reset Jul 6 22:05:48 controller-1 neutron-server: pool._dialect.do_rollback(self) Jul 6 22:05:48 controller-1 neutron-server: File "/usr/lib64/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", line 2526, in do_rollback Jul 6 22:05:48 controller-1 neutron-server: dbapi_connection.rollback() Jul 6 22:05:48 controller-1 neutron-server: File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 778, in rollback Jul 6 22:05:48 controller-1 neutron-server: self._read_ok_packet() Jul 6 22:05:48 controller-1 neutron-server: File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 752, in _read_ok_packet Jul 6 22:05:48 controller-1 neutron-server: pkt = self._read_packet() Jul 6 22:05:48 controller-1 neutron-server: File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 991, in _read_packet Jul 6 22:05:48 controller-1 neutron-server: packet.check_error() Jul 6 22:05:48 controller-1 neutron-server: File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 393, in check_error Jul 6 22:05:48 controller-1 neutron-server: err.raise_mysql_exception(self._data) Jul 6 22:05:48 controller-1 neutron-server: File "/usr/lib/python2.7/site-packages/pymysql/err.py", line 107, in raise_mysql_exception Jul 6 22:05:48 controller-1 neutron-server: raise errorclass(errno, errval) Jul 6 22:05:48 controller-1 neutron-server: InternalError: (1047, u'WSREP has not yet prepared node for application use') Jul 6 22:05:48 controller-1 neutron-server: ERROR [oslo_db.sqlalchemy.exc_filters] [12842] DBAPIError exception wrapped from (pymysql.err.InternalError) (1047, u'WSREP has not yet prepared node for application use') [SQL: u'SELECT 1'] Jul 6 22:05:48 controller-1 neutron-server: Traceback (most recent call last): Jul 6 22:05:48 controller-1 neutron-server: File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context Jul 6 22:05:48 controller-1 neutron-server: context) Jul 6 22:05:48 controller-1 neutron-server: File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute Jul 6 22:05:48 controller-1 neutron-server: cursor.execute(statement, parameters) Jul 6 22:05:48 controller-1 neutron-server: File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 166, in execute Jul 6 22:05:48 controller-1 neutron-server: result = self._query(query) Jul 6 22:05:48 controller-1 neutron-server: File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 322, in _query Jul 6 22:05:48 controller-1 neutron-server: conn.query(q) Jul 6 22:05:48 controller-1 neutron-server: File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 841, in query Jul 6 22:05:48 controller-1 neutron-server: self._affected_rows = self._read_query_result(unbuffered=unbuffered) Jul 6 22:05:48 controller-1 neutron-server: File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1029, in _read_query_result Jul 6 22:05:48 controller-1 neutron-server: result.read() Jul 6 22:05:48 controller-1 neutron-server: File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1312, in read Jul 6 22:05:48 controller-1 neutron-server: first_packet = self.connection._read_packet() Jul 6 22:05:48 controller-1 neutron-server: File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 991, in _read_packet Jul 6 22:05:48 controller-1 neutron-server: packet.check_error() Jul 6 22:05:48 controller-1 neutron-server: File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 393, in check_error Jul 6 22:05:48 controller-1 neutron-server: err.raise_mysql_exception(self._data) Jul 6 22:05:48 controller-1 neutron-server: File "/usr/lib/python2.7/site-packages/pymysql/err.py", line 107, in raise_mysql_exception Jul 6 22:05:48 controller-1 neutron-server: raise errorclass(errno, errval) Jul 6 22:05:48 controller-1 neutron-server: InternalError: (1047, u'WSREP has not yet prepared node for application use') Version-Release number of selected component (if applicable): rhos-release 11 -p 2017-06-30.3 [root@controller-1 tmp]# python -c "import sqlalchemy; print sqlalchemy.__version__" 1.0.11 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Test result: http://yobshans.rdu.openstack.engineering.redhat.com/rhos-jmeter/result/2017-07-06-virt-rhos-11-restapi-perf-test-2x50/ Pool statistics: http://yobshans.rdu.openstack.engineering.redhat.com/rhos-jmeter/result/2017-07-06-virt-rhos-11-restapi-perf-test-2x50/stat/stat.html Expected results: Additional info:
Created attachment 1295358 [details] message log file from controller 0
Created attachment 1295360 [details] message log file from controller 1
Created attachment 1295361 [details] message log file from controller 2
Created attachment 1295362 [details] slow query controller 0
Created attachment 1295363 [details] slow query controller 1
Created attachment 1295364 [details] slow query controller 2
Have you tried the innodb_flush_log_at_trx_commit flag as mentioned and additionally if these are VM based overcloud nodes, that is not appropriate for a stress test. sosreports will be very helpful.
(In reply to Michael Bayer from comment #7) > Have you tried the innodb_flush_log_at_trx_commit flag as mentioned and Not yet. I'd like provide all required information before After that I planed to rerun test as you suggested > additionally if these are VM based overcloud nodes, that is not appropriate > for a stress test. It's not stress test - 2 concurrent threads. Nothing! Why RHOS 10 is working? >sosreports will be very helpful. Preparing. Regards, Yuri
There is a setting called innodb_file_per_table that increases file handles and write activity by a large margin. This flag is now turned on in OSP11 (would want to confirm you see this move from off to on from 10 to 11). In particular it can be taxing on a qemu VM that does not specify disk cache mode = 'unsafe'.
Created attachment 1295381 [details] sosreport controller 0 part 1 I split original sosreport archive using command: # split -b 20000000 sosreport-controller-0.localdomain-20170707182823.tar.xz sosreport-controller-0 To join use command: # cat sosreport-controller-0* > new.tar.xz
Created attachment 1295384 [details] sosreport controller 0 part 2
Created attachment 1295386 [details] sosreport controller 0 part 3
Created attachment 1295388 [details] sosreport controller 1 part 1
Created attachment 1295389 [details] sosreport controller 1 part 2
Created attachment 1295390 [details] sosreport controller 1 part 3
Created attachment 1295391 [details] sosreport controller 2 part 1
Created attachment 1295392 [details] sosreport controller 2 part 2
Created attachment 1295393 [details] sosreport controller 2 part 3
(In reply to Michael Bayer from comment #9) > There is a setting called innodb_file_per_table that increases file handles > and write activity by a large margin. This flag is now turned on in OSP11 > (would want to confirm you see this move from off to on from 10 to 11). In > particular it can be taxing on a qemu VM that does not specify disk cache > mode = 'unsafe'. I found that setting in /etc/my.cnf.d/galera.cnf: innodb_file_per_table = ON How do you suggest to change it? Thanks Yuri
can be done in /etc/my.cnf.d/galera.cnf: 1. turn off performance schema on all three nodes: https://mariadb.com/resources/blog/starting-mysql-low-memory-virtual-machines 2. set innodb_flush_log_at_trx_commit=2 we can see all three controllers are very low on memory, they only have 8G to start with and are all down to 100M: cat sosreport-controller-?/free total used free shared buff/cache available Mem: 8010516 7653780 145040 40032 211696 38732 Swap: 0 0 0 total used free shared buff/cache available Mem: 8010516 7566804 196388 55436 247324 101084 Swap: 0 0 0 total used free shared buff/cache available Mem: 8010516 7570244 105232 39968 335040 61480 Swap: 0 0 0 controller0 is also churning on load: [classic@photon2 sos]$ cat sosreport-controller-0/uptime 18:28:39 up 5:44, 1 user, load average: 10.66, 5.29, 3.48 [classic@photon2 sos]$ cat sosreport-controller-1/uptime 18:27:18 up 5:43, 1 user, load average: 2.12, 2.11, 2.12 [classic@photon2 sos]$ cat sosreport-controller-2/uptime 18:28:54 up 5:43, 1 user, load average: 2.18, 2.62, 2.48 looking at lsof stats, we can se ceilometer and httpd are the biggest owners of filehandles, and gnocchi, which I'm not sure is part of OSP10, is using a lot too. The mysqld daemon, which will use more handles as a result of the innodb_file_per_table setting, is still not the biggest user of handles: [classic@photon2 sos]$ awk '{ print $1 }' sosreport-controller-0/lsof | sort | uniq -c | sort -gr | head -10 37289 ceilomete 26045 httpd 6502 swift-obj 5793 gnocchi-m 5024 mysqld 3648 mongod 2224 libvirtd 2196 /usr/bin/ 1612 aodh-eval 1322 nova-api [classic@photon2 sos]$ awk '{ print $1 }' sosreport-controller-1/lsof | sort | uniq -c | sort -gr | head -10 awk: fatal: cannot open file `sosreport-controller-1/lsof' for reading (No such file or directory) [classic@photon2 sos]$ awk '{ print $1 }' sosreport-controller-2/lsof | sort | uniq -c | sort -gr | head -10 36861 ceilomete 22646 httpd 6501 swift-obj 5793 gnocchi-m 3224 mongod 2224 libvirtd 2196 /usr/bin/ 1626 aodh-eval 1464 nova-api 1260 heat-api- i'm starting to look at the nature of the "out of memory" error, it seems like it did not actually kill the server and hit two times, each after controller 0 joined the cluster. i want to see what the nature of this join was. that will be in a subsequent comment. Overall I think stuffing this many extremely file/memory hungry applications all onto a tiny 8G VM should not be surprising when the server runs out of memory. ideally galera and others would all be spread out on individual servers.
(In reply to Yuri Obshansky from comment #19) > (In reply to Michael Bayer from comment #9) > > There is a setting called innodb_file_per_table that increases file handles > > and write activity by a large margin. This flag is now turned on in OSP11 > > (would want to confirm you see this move from off to on from 10 to 11). In > > particular it can be taxing on a qemu VM that does not specify disk cache > > mode = 'unsafe'. > > I found that setting in /etc/my.cnf.d/galera.cnf: > innodb_file_per_table = ON > > How do you suggest to change it? don't change that. I'd look into getting ceilometer / gnocchi (IIUC gnocchi is strictly used by ceilometer) off these controllers as it seems to be the biggest user of resources (though a "top" display, for some reason not part of the sosreport, would tell a lot more).
galera node on controller 0 became partitioned due to network partition at 170707 13:27:31, moved into NON-PRIMARY, was shutdown by Pacemaker and restarted, rejoined the cluster. out of memory errror occurred after it received SST sync on rejoining the cluster but became synced nevertheless by 13:27:57. controller 0 again seemed to fall behind and get stopped at 14:04:09 and again restarted, out of memory error at 14:07:16, synced in cluster again at 14:07:16. I haven't dug too closely into these splits but all three servers were at 95%+ memory utilization which I think is the main issue.
Hi Michael, You did very detailed analysis of sosreport. That's awesome, but still there is no answer why that issue didn't happen on RHOS 10. I deployed RHOS 10 and rerun test on weekend using the same hardware. Test completed successfully with load of 5 (Five) concurrent threads. Here is test report: http://yobshans.rdu.openstack.engineering.redhat.com/rhos-jmeter/result/2017-07-09-virt-rhos-10-restapi-perf-test-5x20/ The default RHOS 10 has running ceilometer and gnocchi service on controllers, and they didn't impact on test result. [root@controller-0 tmp]# ps -ef | grep ceilometer ceilome+ 1892 1 0 Jul09 ? 00:04:01 /usr/bin/python2 /usr/bin/ceilometer-polling --polling-namespaces central --logfile /var/log/ceilometer/central.log ceilome+ 1924 1 0 Jul09 ? 00:03:49 /usr/bin/python2 /usr/bin/ceilometer-collector --logfile /var/log/ceilometer/collector.log ceilome+ 1941 1 0 Jul09 ? 00:03:52 /usr/bin/python2 /usr/bin/ceilometer-agent-notification --logfile /var/log/ceilometer/agent-notification.log ceilome+ 2533 1892 0 Jul09 ? 00:05:40 ceilometer-polling - AgentManager(0) ceilome+ 2550 1924 7 Jul09 ? 01:34:42 ceilometer-collector - CollectorService(0) ceilome+ 791538 1941 3 02:18 ? 00:23:06 ceilometer-agent-notification - NotificationService(0) [root@controller-0 tmp]# ps -ef | grep gnocchi gnocchi 1921 1 0 Jul09 ? 00:04:21 /usr/bin/python2 /usr/bin/gnocchi-metricd --logfile /var/log/gnocchi/metricd.log gnocchi 1944 1 0 Jul09 ? 00:02:51 /usr/bin/python2 /usr/bin/gnocchi-statsd --logfile /var/log/gnocchi/statsd.log gnocchi 2583 1921 0 Jul09 ? 00:01:19 /usr/bin/python2 /usr/bin/gnocchi-metricd --logfile /var/log/gnocchi/metricd.log gnocchi 2596 1921 0 Jul09 ? 00:03:17 gnocchi-metricd - scheduler(0) gnocchi 2597 1921 0 Jul09 ? 00:05:58 gnocchi-metricd - processing(0) gnocchi 2598 1921 0 Jul09 ? 00:06:10 gnocchi-metricd - processing(1) gnocchi 2599 1921 0 Jul09 ? 00:01:24 gnocchi-metricd - reporting(0) gnocchi 2600 1921 0 Jul09 ? 00:01:23 gnocchi-metricd - janitor(0) gnocchi 107735 4056 0 03:35 ? 00:00:50 /usr/sbin/httpd -DFOREGROUND gnocchi 107736 4056 0 03:35 ? 00:00:49 /usr/sbin/httpd -DFOREGROUND I created sosreport for RHOS 10. It will be useful if you could analyse them as you're already done for RHOS 11.
Created attachment 1295835 [details] rhos10 sosreport controller 0 part 1
Created attachment 1295836 [details] rhos10 sosreport controller 0 part 2
Created attachment 1295838 [details] rhos10 sosreport controller 0 part 3
Created attachment 1295839 [details] rhos10 sosreport controller 0 part 4
Created attachment 1295841 [details] rhos10 sosreport controller 0 part 5
Created attachment 1295842 [details] rhos10 sosreport controller 0 part 6
I have an idea. Let's rerun tests without ceilometer / gnocchi and see what's result.
yeah you just mid-aired me on the same comment. so see what happens, the OSP 11 test.
Hi Mike, You were right. Gnocchi and Ceilometer impacted on performance and functionality of Openstack 11. There is still an opened question: why their impact so dramatic in RHOS 11 compare to RHOS 10. I deployed RHOS 10 and 11 with disabled telemetry (no Gnocchi and Ceilometer) and rerun tests. Both test passed successfully. Even RHOS 11 support load of 5 concurrent threads as RHOS 10. Comparison test results here: - RHOS 11 vs RHOS 10 http://yobshans.rdu.openstack.engineering.redhat.com/rhos-jmeter/result/2017-07-10-compare_restapi-perf-test-rhos-11-vs-10/ - RHOS 11 vs RHOS 9 http://yobshans.rdu.openstack.engineering.redhat.com/rhos-jmeter/result/2017-07-11-compare_restapi-perf-test-rhos-11-vs-9/ - RHOS 10 vs RHOS 9 http://yobshans.rdu.openstack.engineering.redhat.com/rhos-jmeter/result/2017-07-11-compare_restapi-perf-test-rhos-10-vs-9/ There is a performance degradation still existed in RHOS 11 but, without mysql crashes and other functionality issues. Probably, this bug could be closed because I think, already existed bugs related to Telemetry problems. Regards, Yuri
well every new version has more bells, whistles, flags, and things, so unless you tend to performance equally as much on every release, it will otherwise decrease. but overall the "run 18 services on one machine" model needs to change at this point.
while there's an issue with resource usage for this particular test, SQLAlchemy's connection pool is not part of it, resource starvation on the host was making it impossible to reliably maintain database connections from a network / MySQL server perspective.