1470853 – Database error during rally scenario

Bug 1470853 - Database error during rally scenario

Summary: Database error during rally scenario

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	12.0 (Pike)
Hardware:	All
OS:	All
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Joe Talerico
QA Contact:	Toni Freger
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-13 20:04 UTC by Joe Talerico
Modified:	2017-10-17 18:55 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-10-17 18:55:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Joe Talerico 2017-07-13 20:04:34 UTC

Description of problem:
While running create-network-list scenario with Rally times:1000 concurrency:32 (originally saw with concurrency 64) I am observing errors in /var/log/messages and /var/log/neutron/server.log

/var/log/neutron/server.log (100% can reproduce, not intermittent)
https://gist.github.com/jtaleric/1494e85bd25dd68e7673b67af5b9ae6a#file-gistfile2-txt

/var/log/messages (These do not always show up - intermittent)
https://gist.github.com/jtaleric/1494e85bd25dd68e7673b67af5b9ae6a#file-gistfile3-txt

Network is not being saturated

DiskIO is a concern, swift+gnocchi is deployed (our current default with TripleO/Director) https://snapshot.raintank.io/dashboard/snapshot/aNLcSHrzS7cxY90570Lkum21EiEuHjmw?orgId=2


Version-Release number of selected component (if applicable):
[root@overcloud-controller-0 heat-admin]# rpm -qa | grep sql
python-sqlparse-0.1.18-5.el7ost.noarch
puppet-mysql-3.11.1-0.20170627232431.96f86dc.el7ost.noarch
python-sqlalchemy-1.0.11-2.el7ost.x86_64
openstack-gnocchi-indexer-sqlalchemy-4.0.0-0.20170612072714.a852437.el7ost.noarch
postgresql-libs-9.2.21-1.el7.x86_64
collectd-mysql-5.7.1-2.el7.x86_64
python-sqlalchemy-utils-0.31.3-2.el7ost.noarch
sqlite-3.7.17-8.el7.x86_64
python-sqlite3dbm-0.1.4-7.el7ost.noarch
[root@overcloud-controller-0 heat-admin]# rpm -qa mariadb
mariadb-5.5.56-2.el7.x86_64
[root@overcloud-controller-0 heat-admin]# rpm -qa | grep neutron
python-neutron-lib-1.7.0-0.20170529134801.0ee4f4a.el7ost.noarch
python-neutron-lbaas-11.0.0-0.20170627061233.2c054e0.el7ost.noarch
openstack-neutron-metering-agent-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
python-neutronclient-6.3.0-0.20170601203754.ba535c6.el7ost.noarch
openstack-neutron-common-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
openstack-neutron-lbaas-11.0.0-0.20170627061233.2c054e0.el7ost.noarch
openstack-neutron-l2gw-agent-10.1.0-0.20170619104111.4bb9806.el7ost.noarch
openstack-neutron-sriov-nic-agent-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
python-neutron-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
openstack-neutron-ml2-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
openstack-neutron-linuxbridge-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
puppet-neutron-11.2.0-0.20170626053011.862f130.el7ost.noarch
openstack-neutron-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
openstack-neutron-openvswitch-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch


How reproducible:
100%

Steps to Reproduce:
1. Rally net-create-list times:1000 concurrency: 64 (observed with 32 too)

Actual results:
+---------------------------------------------------------------------------------------------------------------------------+
|                                                   Response Times (sec)                                                    |
+------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+
| Action                 | Min (sec) | Median (sec) | 90%ile (sec) | 95%ile (sec) | Max (sec) | Avg (sec) | Success | Count |
+------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+
| neutron.create_network | 0.938     | 1.365        | 35.663       | 57.445       | 116.966   | 10.857    | 98.0%   | 1000  |
| neutron.list_networks  | 0.363     | 12.254       | 33.998       | 43.099       | 102.298   | 15.539    | 99.9%   | 981   |
| total                  | 2.026     | 16.883       | 61.21        | 88.523       | 154.774   | 26.397    | 98.0%   | 1000  |
+------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+


Expected results:
100% success

Additional info:
Running a curl on galera during the test, galera never reported any issues. Constantly reported "Galera cluster node is synced."

Comment 1 Michael Bayer 2017-07-13 20:06:32 UTC

I had Joe make sure the DB connectivity / state never wavered.

Joe can you post up all three /var/log/mysqld.log files ?

Comment 2 Michael Bayer 2017-07-13 20:08:03 UTC

in fact just link to sosreports somewhere so I can see all neutron / nova / etc. logs, installed versions, etc., did you do the galera innodb_flush_log_at_trx_commit change also ?

Comment 3 Joe Talerico 2017-07-14 12:34:48 UTC

(In reply to Michael Bayer from comment #1)
> I had Joe make sure the DB connectivity / state never wavered.
> 
> Joe can you post up all three /var/log/mysqld.log files ?

Sure -- in the sos.

(In reply to Michael Bayer from comment #2)
> in fact just link to sosreports somewhere so I can see all neutron / nova /
> etc. logs, installed versions, etc., did you do the galera
> innodb_flush_log_at_trx_commit change also ?

http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/sos/sosreport-overcloud-controller-0.perf.lab.eng.rdu.redhat.com-20170714113317.tar.xz

http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/sos/sosreport-overcloud-controller-1.perf.lab.eng.rdu.redhat.com-20170714113739.tar.xz

http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/sos/sosreport-overcloud-controller-2.perf.lab.eng.rdu.redhat.com-20170714113754.tar.xz

Yup, I added that line to my galera.conf and ran `pcs resource restart galera`

Comment 4 Michael Bayer 2017-07-17 14:00:00 UTC

upstream issue / gerrit for the "can't reconnect" error part of this (which I don't think is the cause) is at https://bugs.launchpad.net/oslo.db/+bug/1704474

Comment 5 Michael Bayer 2017-07-17 14:24:46 UTC

joe did you try running w/ telemetry / ceilometer etc. disabled?   that's been pretty much the "go-to" resolution the past couple of weeks.   I'm still going to look over these sosreports.

Comment 8 Joe Talerico 2017-07-17 14:39:56 UTC

(In reply to Michael Bayer from comment #5)
> joe did you try running w/ telemetry / ceilometer etc. disabled?   that's
> been pretty much the "go-to" resolution the past couple of weeks.   I'm
> still going to look over these sosreports.

I just built a fresh cloud w/o telemetry. I'll give it a go.

Comment 9 Assaf Muller 2017-08-25 15:52:09 UTC

(In reply to Joe Talerico from comment #8)
> (In reply to Michael Bayer from comment #5)
> > joe did you try running w/ telemetry / ceilometer etc. disabled?   that's
> > been pretty much the "go-to" resolution the past couple of weeks.   I'm
> > still going to look over these sosreports.
> 
> I just built a fresh cloud w/o telemetry. I'll give it a go.

How'd it go?

Comment 13 Joe Talerico 2017-10-17 18:54:46 UTC

Closing this out as I was not able to re-create this bug with the latest OSP12 Passed Phase 1 (10/05/17.7)

http://kibana.scalelab.redhat.com/goto/909c61bdb16e01b76b283a1a788fcb43

The one error in ^, was due to the message below. However we did not see the errors mentioned above.

I did however see this again: 
2017-10-17 18:31:50.556 126234 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 408: RequestTimeout: Request Timeout (HTTP 408)
2017-10-17 18:31:50.556 126234 WARNING keystonemiddleware.auth_token [-] Identity response: <html><body><h1>408 Request Time-out</h1>
Your browser didn't send a complete request in time.
</body></html>
: RequestTimeout: Request Timeout (HTTP 408)
2017-10-17 18:31:50.556 126234 CRITICAL keystonemiddleware.auth_token [-] Unable to validate token: Failed to fetch token data from identity server: ServiceError: Failed to fetch token data from identity server

Note You need to log in before you can comment on or make changes to this bug.