Bug 1470853 - Database error during rally scenario
Database error during rally scenario
Status: CLOSED NOTABUG
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-neutron (Show other bugs)
12.0 (Pike)
All All
medium Severity medium
: ---
: ---
Assigned To: Joe Talerico
Toni Freger
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-13 16:04 EDT by Joe Talerico
Modified: 2017-10-17 14:55 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-10-17 14:55:03 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Joe Talerico 2017-07-13 16:04:34 EDT
Description of problem:
While running create-network-list scenario with Rally times:1000 concurrency:32 (originally saw with concurrency 64) I am observing errors in /var/log/messages and /var/log/neutron/server.log

/var/log/neutron/server.log (100% can reproduce, not intermittent)
https://gist.github.com/jtaleric/1494e85bd25dd68e7673b67af5b9ae6a#file-gistfile2-txt

/var/log/messages (These do not always show up - intermittent)
https://gist.github.com/jtaleric/1494e85bd25dd68e7673b67af5b9ae6a#file-gistfile3-txt

Network is not being saturated

DiskIO is a concern, swift+gnocchi is deployed (our current default with TripleO/Director) https://snapshot.raintank.io/dashboard/snapshot/aNLcSHrzS7cxY90570Lkum21EiEuHjmw?orgId=2


Version-Release number of selected component (if applicable):
[root@overcloud-controller-0 heat-admin]# rpm -qa | grep sql
python-sqlparse-0.1.18-5.el7ost.noarch
puppet-mysql-3.11.1-0.20170627232431.96f86dc.el7ost.noarch
python-sqlalchemy-1.0.11-2.el7ost.x86_64
openstack-gnocchi-indexer-sqlalchemy-4.0.0-0.20170612072714.a852437.el7ost.noarch
postgresql-libs-9.2.21-1.el7.x86_64
collectd-mysql-5.7.1-2.el7.x86_64
python-sqlalchemy-utils-0.31.3-2.el7ost.noarch
sqlite-3.7.17-8.el7.x86_64
python-sqlite3dbm-0.1.4-7.el7ost.noarch
[root@overcloud-controller-0 heat-admin]# rpm -qa mariadb
mariadb-5.5.56-2.el7.x86_64
[root@overcloud-controller-0 heat-admin]# rpm -qa | grep neutron
python-neutron-lib-1.7.0-0.20170529134801.0ee4f4a.el7ost.noarch
python-neutron-lbaas-11.0.0-0.20170627061233.2c054e0.el7ost.noarch
openstack-neutron-metering-agent-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
python-neutronclient-6.3.0-0.20170601203754.ba535c6.el7ost.noarch
openstack-neutron-common-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
openstack-neutron-lbaas-11.0.0-0.20170627061233.2c054e0.el7ost.noarch
openstack-neutron-l2gw-agent-10.1.0-0.20170619104111.4bb9806.el7ost.noarch
openstack-neutron-sriov-nic-agent-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
python-neutron-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
openstack-neutron-ml2-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
openstack-neutron-linuxbridge-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
puppet-neutron-11.2.0-0.20170626053011.862f130.el7ost.noarch
openstack-neutron-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch
openstack-neutron-openvswitch-11.0.0-0.20170628060509.9a72bfe.el7ost.noarch


How reproducible:
100%

Steps to Reproduce:
1. Rally net-create-list times:1000 concurrency: 64 (observed with 32 too)

Actual results:
+---------------------------------------------------------------------------------------------------------------------------+
|                                                   Response Times (sec)                                                    |
+------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+
| Action                 | Min (sec) | Median (sec) | 90%ile (sec) | 95%ile (sec) | Max (sec) | Avg (sec) | Success | Count |
+------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+
| neutron.create_network | 0.938     | 1.365        | 35.663       | 57.445       | 116.966   | 10.857    | 98.0%   | 1000  |
| neutron.list_networks  | 0.363     | 12.254       | 33.998       | 43.099       | 102.298   | 15.539    | 99.9%   | 981   |
| total                  | 2.026     | 16.883       | 61.21        | 88.523       | 154.774   | 26.397    | 98.0%   | 1000  |
+------------------------+-----------+--------------+--------------+--------------+-----------+-----------+---------+-------+


Expected results:
100% success

Additional info:
Running a curl on galera during the test, galera never reported any issues. Constantly reported "Galera cluster node is synced."
Comment 1 Michael Bayer 2017-07-13 16:06:32 EDT
I had Joe make sure the DB connectivity / state never wavered.

Joe can you post up all three /var/log/mysqld.log files ?
Comment 2 Michael Bayer 2017-07-13 16:08:03 EDT
in fact just link to sosreports somewhere so I can see all neutron / nova / etc. logs, installed versions, etc., did you do the galera innodb_flush_log_at_trx_commit change also ?
Comment 3 Joe Talerico 2017-07-14 08:34:48 EDT
(In reply to Michael Bayer from comment #1)
> I had Joe make sure the DB connectivity / state never wavered.
> 
> Joe can you post up all three /var/log/mysqld.log files ?

Sure -- in the sos.

(In reply to Michael Bayer from comment #2)
> in fact just link to sosreports somewhere so I can see all neutron / nova /
> etc. logs, installed versions, etc., did you do the galera
> innodb_flush_log_at_trx_commit change also ?

http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/sos/sosreport-overcloud-controller-0.perf.lab.eng.rdu.redhat.com-20170714113317.tar.xz

http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/sos/sosreport-overcloud-controller-1.perf.lab.eng.rdu.redhat.com-20170714113739.tar.xz

http://perf1.perf.lab.eng.bos.redhat.com/jtaleric/OpenStack/sos/sosreport-overcloud-controller-2.perf.lab.eng.rdu.redhat.com-20170714113754.tar.xz

Yup, I added that line to my galera.conf and ran `pcs resource restart galera`
Comment 4 Michael Bayer 2017-07-17 10:00:00 EDT
upstream issue / gerrit for the "can't reconnect" error part of this (which I don't think is the cause) is at https://bugs.launchpad.net/oslo.db/+bug/1704474
Comment 5 Michael Bayer 2017-07-17 10:24:46 EDT
joe did you try running w/ telemetry / ceilometer etc. disabled?   that's been pretty much the "go-to" resolution the past couple of weeks.   I'm still going to look over these sosreports.
Comment 8 Joe Talerico 2017-07-17 10:39:56 EDT
(In reply to Michael Bayer from comment #5)
> joe did you try running w/ telemetry / ceilometer etc. disabled?   that's
> been pretty much the "go-to" resolution the past couple of weeks.   I'm
> still going to look over these sosreports.

I just built a fresh cloud w/o telemetry. I'll give it a go.
Comment 9 Assaf Muller 2017-08-25 11:52:09 EDT
(In reply to Joe Talerico from comment #8)
> (In reply to Michael Bayer from comment #5)
> > joe did you try running w/ telemetry / ceilometer etc. disabled?   that's
> > been pretty much the "go-to" resolution the past couple of weeks.   I'm
> > still going to look over these sosreports.
> 
> I just built a fresh cloud w/o telemetry. I'll give it a go.

How'd it go?
Comment 13 Joe Talerico 2017-10-17 14:54:46 EDT
Closing this out as I was not able to re-create this bug with the latest OSP12 Passed Phase 1 (10/05/17.7)

http://kibana.scalelab.redhat.com/goto/909c61bdb16e01b76b283a1a788fcb43

The one error in ^, was due to the message below. However we did not see the errors mentioned above.

I did however see this again: 
2017-10-17 18:31:50.556 126234 ERROR keystonemiddleware.auth_token [-] Bad response code while validating token: 408: RequestTimeout: Request Timeout (HTTP 408)
2017-10-17 18:31:50.556 126234 WARNING keystonemiddleware.auth_token [-] Identity response: <html><body><h1>408 Request Time-out</h1>
Your browser didn't send a complete request in time.
</body></html>
: RequestTimeout: Request Timeout (HTTP 408)
2017-10-17 18:31:50.556 126234 CRITICAL keystonemiddleware.auth_token [-] Unable to validate token: Failed to fetch token data from identity server: ServiceError: Failed to fetch token data from identity server

Note You need to log in before you can comment on or make changes to this bug.