Description of problem: While running create-network-list scenario with Rally times:1000 concurrency:32 (originally saw with concurrency 64) I am observing errors in /var/log/messages and /var/log/neutron/server.log /var/log/neutron/server.log (100% can reproduce, not intermittent) https://gist.github.com/jtaleric/1494e85bd25dd68e7673b67af5b9ae6a#file-gistfile2-txt /var/log/messages (These do not always show up - intermittent) https://gist.github.com/jtaleric/1494e85bd25dd68e7673b67af5b9ae6a#file-gistfile3-txt Network is not being saturated DiskIO is a concern, swift+gnocchi is deployed (our current default with TripleO/Director) https://snapshot.raintank.io/dashboard/snapshot/aNLcSHrzS7cxY90570Lkum21EiEuHjmw?orgId=2 Version-Release number of selected component (if applicable): [root@overcloud-controller-0 heat-admin]# rpm -qa | grep sql python-sqlparse-0.1.18-5.el7ost.noarch puppet-mysql-3.11.1-0.20170627232431.96f86dc.el7ost.noarch python-sqlalchemy-1.0.11-2.el7ost.x86_64 openstack-gnocchi-indexer-sqlalchemy-4.0.0-0.20170612072714.a852437.el7ost.noarch postgresql-libs-9.2.21-1.el7.x86_64 collectd-mysql-5.7.1-2.el7.x86_64 python-sqlalchemy-utils-0.31.3-2.el7ost.noarch sqlite-3.7.17-8.el7.x86_64 python-sqlite3dbm-0.1.4-7.el7ost.noarch [root@overcloud-controller-0 heat-admin]# rpm -qa mariadb mariadb-5.5.56-2.el7.x86_64 [root@overcloud-controller-0 heat-admin]# rpm -qa | grep neutron python-neutron-lib-1.7.0-0.20170529134801.0ee4f4a.el7ost.noarch python-neutron-lbaas-11.0.0-0.20170627061233.2c054e0.el7ost.noarch openstack-neutron-metering-agent-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch python-neutronclient-6.3.0-0.20170601203754.ba535c6.el7ost.noarch openstack-neutron-common-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch openstack-neutron-lbaas-11.0.0-0.20170627061233.2c054e0.el7ost.noarch openstack-neutron-l2gw-agent-10.1.0-0.20170619104111.4bb9806.el7ost.noarch openstack-neutron-sriov-nic-agent-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch python-neutron-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch openstack-neutron-ml2-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch openstack-neutron-linuxbridge-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch puppet-neutron-11.2.0-0.20170626053011.862f130.el7ost.noarch openstack-neutron-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch openstack-neutron-openvswitch-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Rally net-create-list times:1000 concurrency: 64 (observed with 32 too) Actual results: +---------------------------------------------------------------------------------------------------------------------------+ | Response Times (sec) | +------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+ | Action | Min (sec) | Median (sec) | 90%ile (sec) | 95%ile (sec) | Max (sec) | Avg (sec) | Success | Count | +------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+ | neutron.create_network | 0.938 | 1.365 | 35.663 | 57.445 | 116.966 | 10.857 | 98.0% | 1000 | | neutron.list_networks | 0.363 | 12.254 | 33.998 | 43.099 | 102.298 | 15.539 | 99.9% | 981 | | total | 2.026 | 16.883 | 61.21 | 88.523 | 154.774 | 26.397 | 98.0% | 1000 | +------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+ Expected results: 100% success Additional info: Running a curl on galera during the test, galera never reported any issues. Constantly reported "Galera cluster node is synced."
I had Joe make sure the DB connectivity / state never wavered. Joe can you post up all three /var/log/mysqld.log files ?
in fact just link to sosreports somewhere so I can see all neutron / nova / etc. logs, installed versions, etc., did you do the galera innodb_flush_log_at_trx_commit change also ?
(In reply to Michael Bayer from comment #1) > I had Joe make sure the DB connectivity / state never wavered. > > Joe can you post up all three /var/log/mysqld.log files ? Sure -- in the sos. (In reply to Michael Bayer from comment #2) > in fact just link to sosreports somewhere so I can see all neutron / nova / > etc. logs, installed versions, etc., did you do the galera > innodb_flush_log_at_trx_commit change also ? http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/sos/sosreport-overcloud-controller-0.perf.lab.eng.rdu.redhat.com-20170714113317.tar.xz http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/sos/sosreport-overcloud-controller-1.perf.lab.eng.rdu.redhat.com-20170714113739.tar.xz http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/sos/sosreport-overcloud-controller-2.perf.lab.eng.rdu.redhat.com-20170714113754.tar.xz Yup, I added that line to my galera.conf and ran `pcs resource restart galera`
upstream issue / gerrit for the "can't reconnect" error part of this (which I don't think is the cause) is at https://bugs.launchpad.net/oslo.db/+bug/1704474
joe did you try running w/ telemetry / ceilometer etc. disabled? that's been pretty much the "go-to" resolution the past couple of weeks. I'm still going to look over these sosreports.
(In reply to Michael Bayer from comment #5) > joe did you try running w/ telemetry / ceilometer etc. disabled? that's > been pretty much the "go-to" resolution the past couple of weeks. I'm > still going to look over these sosreports. I just built a fresh cloud w/o telemetry. I'll give it a go.
(In reply to Joe Talerico from comment #8) > (In reply to Michael Bayer from comment #5) > > joe did you try running w/ telemetry / ceilometer etc. disabled? that's > > been pretty much the "go-to" resolution the past couple of weeks. I'm > > still going to look over these sosreports. > > I just built a fresh cloud w/o telemetry. I'll give it a go. How'd it go?
Closing this out as I was not able to re-create this bug with the latest OSP12 Passed Phase 1 (10/05/17.7) http://kibana.scalelab.redhat.com/goto/909c61bdb16e01b76b283a1a788fcb43 The one error in ^, was due to the message below. However we did not see the errors mentioned above. I did however see this again: 2017-10-17 18:31:50.556 126234 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 408: RequestTimeout: Request Timeout (HTTP 408) 2017-10-17 18:31:50.556 126234 WARNING keystonemiddleware.auth_token [-] Identity response: <html><body><h1>408 Request Time-out</h1> Your browser didn't send a complete request in time. </body></html> : RequestTimeout: Request Timeout (HTTP 408) 2017-10-17 18:31:50.556 126234 CRITICAL keystonemiddleware.auth_token [-] Unable to validate token: Failed to fetch token data from identity server: ServiceError: Failed to fetch token data from identity server