Bug 1144181 - oslo.db fails to keep trying to connect to the database on startup
Summary: oslo.db fails to keep trying to connect to the database on startup
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: python-oslo-db
Version: rawhide
Hardware: Unspecified
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Alan Pevec (Fedora)
QA Contact: Ami Jeain
URL:
Whiteboard:
: 1049519 1140897 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-09-18 22:17 UTC by Nir Magnezi
Modified: 2019-09-10 14:11 UTC (History)
13 users (show)

Fixed In Version: python-oslo-db-1.0.2-2.fc22
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-10-27 23:44:36 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
packstack answer file (5.48 KB, text/plain)
2014-09-18 22:17 UTC, Nir Magnezi
no flags Details
iptables -L ( during issue ) (4.86 KB, text/plain)
2014-09-30 13:39 UTC, Thomas Chacko
no flags Details
iptables -L ( after rabbitmqserver & Neutron server restart and working ) (4.86 KB, text/plain)
2014-09-30 13:40 UTC, Thomas Chacko
no flags Details
Neutron Server log during this bug reproduction and workaround (132.44 KB, text/plain)
2014-09-30 13:41 UTC, Thomas Chacko
no flags Details
Mariadb MYSQL log_last 1000 (70.47 KB, text/plain)
2014-10-01 03:36 UTC, Thomas Chacko
no flags Details
RabbitMQ server log_last 1000 lines (39.17 KB, text/plain)
2014-10-01 03:36 UTC, Thomas Chacko
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1376211 0 None None None Never
OpenStack gerrit 125347 0 'None' MERGED Ensure create_engine() retries the initial connection test 2021-01-08 23:07:43 UTC

Description Nir Magnezi 2014-09-18 22:17:57 UTC
Created attachment 939050 [details]
packstack answer file

Description of problem:
=======================
Post installation, Neutron Fails to operate because it cannot connect to the database.

Version-Release number of selected component (if applicable):
=============================================================
openstack-packstack-2014.2-0.2.dev1266.g63d9c50.el7.centos.noarch
openstack-packstack-puppet-2014.2-0.2.dev1266.g63d9c50.el7.centos.noarch

How reproducible:
=================
Tested Once.

Steps to Reproduce:
===================
Used the attached answer file to deploy Juno RDO setup (controller, networker, 2 computes):

Actual results:
===============
Neutron Logs contain Tracebacks.

OperationalError: (OperationalError) (2003, "Can't connect to MySQL server on '192.168.161.235' (111)") None None

For details see the attached server.log

Expected results:
=================
Neutron should be functional.


Additional Information:
=======================
snipped from neutron.conf:

[database]
connection = mysql://neutron:123456.161.235/neutron
max_retries = 10
retry_interval = 10
idle_timeout = 3600

Comment 4 Ihar Hrachyshka 2014-09-19 08:27:20 UTC
Nir, is it possible to connect to that mysql server via e.g. telnet?

Comment 5 Nir Magnezi 2014-09-20 22:16:26 UTC
(In reply to Ihar Hrachyshka from comment #4)
> Nir, is it possible to connect to that mysql server via e.g. telnet?

Yes. Tried locally: mysql -u neutron -p -h 10.35.161.235 neutron
and it seems to work fine (both neutron server and db reside on the same node).

Comment 6 Ihar Hrachyshka 2014-09-29 11:31:06 UTC
So something forbids Neutron to communicate with MySQL thru the port (while other processes are ok to communicate with it). Are there any AVC errors in audit.log? What if you disable SELinux?

Comment 7 Ihar Hrachyshka 2014-09-30 12:40:06 UTC
OK, so here is what we may need to debug the issue:

- iptables -L output before neutron-server fails to connect to mysql server;
- iptables -L output at the moment when console mysql client succeeds to connect to mysql server.

Also, please specify whether SELinux is enabled (if so, please attach audit.log), and whether FirewallD and NetworkManager are running.

Comment 8 Ihar Hrachyshka 2014-09-30 12:40:22 UTC
*** Bug 1049519 has been marked as a duplicate of this bug. ***

Comment 9 Ihar Hrachyshka 2014-09-30 12:41:19 UTC
Thomas, please provide details as per comment above from your failed setup from bug 1049519. Thanks.

Comment 10 Thomas Chacko 2014-09-30 13:39:05 UTC
Created attachment 942717 [details]
iptables -L ( during issue )

Comment 11 Thomas Chacko 2014-09-30 13:40:05 UTC
Created attachment 942718 [details]
iptables -L ( after rabbitmqserver  & Neutron server restart and working )

Comment 12 Thomas Chacko 2014-09-30 13:41:00 UTC
Created attachment 942719 [details]
Neutron Server log during this bug reproduction and workaround

Comment 13 Thomas Chacko 2014-09-30 13:45:37 UTC
The Neutron-Server cannot be restarted when this SQL connect error is logged. Neutron server logs indicates issue with AMQP ( RabbitMQ ) connection. So first restarted rabbitmq-server and then started Neutron-Server service. Now it works fines. So it appears there is an interplay between neutron-server and rabbitmq-server. SELinux is in permissive mode. Other logs attached above.

Comment 14 Ihar Hrachyshka 2014-09-30 18:14:48 UTC
Thomas, do you mean that after you tried to restart Neutron, it didn't manage to come up due to AMQP issue, so you restarted RabbitMQ and MySQL and then Neutron, which helped to get rid of the failures?

The way RabbitMQ connection is broken looks very weird:
"Failed to consume message from queue: Socket closed"

I wonder whether there are any relevant logs for mysql and rabbitmq services?

Comment 15 Nir Magnezi 2014-09-30 19:04:38 UTC
(In reply to Ihar Hrachyshka from comment #14)
> Thomas, do you mean that after you tried to restart Neutron, it didn't
> manage to come up due to AMQP issue, so you restarted RabbitMQ and MySQL and
> then Neutron, which helped to get rid of the failures?
> 
> The way RabbitMQ connection is broken looks very weird:
> "Failed to consume message from queue: Socket closed"
> 
> I wonder whether there are any relevant logs for mysql and rabbitmq services?

I managed to reproduce this with SElinux set to permissive.

Comment 16 Thomas Chacko 2014-10-01 03:36:07 UTC
Created attachment 942907 [details]
Mariadb MYSQL log_last 1000

Comment 17 Thomas Chacko 2014-10-01 03:36:33 UTC
Created attachment 942908 [details]
RabbitMQ server log_last 1000 lines

Comment 18 Thomas Chacko 2014-10-01 03:38:48 UTC
(In reply to Ihar Hrachyshka from comment #14)
> Thomas, do you mean that after you tried to restart Neutron, it didn't
> manage to come up due to AMQP issue, so you restarted RabbitMQ and MySQL and
> then Neutron, which helped to get rid of the failures?
> 
> The way RabbitMQ connection is broken looks very weird:
> "Failed to consume message from queue: Socket closed"

Yes. Neutron Server can only be restarted successfully by first restarting RabbitMQ server. The MYSQL restart was not required.

 
> I wonder whether there are any relevant logs for mysql and rabbitmq services?
I attach the last 1000 lines from these logs here,

Comment 19 Thomas Chacko 2014-10-01 03:57:22 UTC
Additional comments pertaining to PackStack Juno install- may or may not be relevant for this bug. During this install, I had manually updated the RabbitMQ config file as the one which Packstack creates issue with Novacompute service start.  Refer : https://ask.openstack.org/en/question/48329/openstack-juno-using-rdo-fails-installation-amqp-server-closed-the-connection/

Comment 20 Nir Magnezi 2014-10-01 08:39:34 UTC
@ihar, I just noticed the exact same issue with cinder

Comment 21 Ihar Hrachyshka 2014-10-01 09:34:01 UTC
OK, so the following is based on interactive session on one of machines that were kindly provided by Nir Magnezi...

The failure occurs after restart. Neutron starts and fails to connect to MariaDB server. After the failure, it does not attempt to retry to connect to SQL server but just exits.

Oct 01 11:24:48 nmagnezi-os-cont1.scl.lab.tlv.redhat.com systemd[1]: neutron-server.service: main process exited, code=exited, status=1/FAILURE

Now let's see what's the status of MariaDB:

Oct 01 11:24:13 nmagnezi-os-cont1.scl.lab.tlv.redhat.com systemd[1]: Starting MariaDB database server...
Oct 01 11:25:20 nmagnezi-os-cont1.scl.lab.tlv.redhat.com mysqld_safe[1237]: 141001 11:25:20 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Oct 01 11:25:21 nmagnezi-os-cont1.scl.lab.tlv.redhat.com mysqld_safe[1237]: 141001 11:25:21 mysqld_safe Starting mysqld daemon with databases from...mysql
Oct 01 11:25:22 nmagnezi-os-cont1.scl.lab.tlv.redhat.com mysqld_safe[1237]: 141001 11:25:22 mysqld_safe WSREP: Running position recovery with --lo....pid'
Oct 01 11:25:29 nmagnezi-os-cont1.scl.lab.tlv.redhat.com mysqld_safe[1237]: 141001 11:25:29 mysqld_safe WSREP: Recovered position 00000000-0000-00...00:-1
Oct 01 11:25:35 nmagnezi-os-cont1.scl.lab.tlv.redhat.com systemd[1]: Started MariaDB database server.

So Neutron crashed while MariaDB was starting up, but before it reported startup succeeded.

This is because Neutron service does not wait for MariaDB to start (neither it waits for AMQP service). The only dependencies that are specified for neutron-server unit file are: "After=syslog.target network.target"

There are multiple questions to be answered:
- why has it taken so long for MariaDB to start (quoting Alan Pevec: "initial startup probably takes longer as it runs prepare-db-dir ?");
- why Neutron does not wait for SQL and AMQP services to start before proceeding with its own startup;
- why Neutron crashes instead of retrying (this question probably belongs to oslo.db).

Comment 22 Ihar Hrachyshka 2014-10-01 09:35:02 UTC
Grabbing the bug and moving it to openstack-neutron component since it seems to be irrelevant in terms of which deployment tool was used to deploy Openstack.

Comment 23 Ihar Hrachyshka 2014-10-01 09:35:38 UTC
That said, solution should be generic enough and be applied to all openstack services (see references to Cinder crashing the same way).

Comment 24 Ihar Hrachyshka 2014-10-01 10:23:52 UTC
OK, I've read a bit more code, and it seems now I understand how it failed.

So oslo.db HAS a mechanism for db retries on connect failure (see retry_interval, max_retries and other options in neutron.conf). The problem is that it's applied in create_engine() when calling _test_connection() while the crash occurred before _test_connection() was called, specifically, when engine was __init__-ialized. It seems like in oslo.db 0.4.0 there was a bug introduced that ended with oslo.db invoking SQL statements on engine (which resulted in engine.connect() being called) before the connection is tested (specifically, "SHOW VARIABLES LIKE 'sql_mode'" was called for MariaDB).

Now when I look into oslo.db master branch, I see that the code was reworked there and seems like not being affected by the issue anymore (the statement is postponed till the time when engine.connect() is called inside _test_connection() instead of being called right away). So my hypothesis is that oslo.db>=1.0.0 should solve the problem. The appropriate version bump was already merged in upstream openstack/requirements project [1] and is going to be propagated into all projects that use oslo.db before final Juno release.

[1]: https://review.openstack.org/#/c/122408/

Comment 25 Ihar Hrachyshka 2014-10-01 10:38:47 UTC
The simplest way to reproduce the failure is:

1. stop MariaDB.
2. execute the following Python script:

'''
import oslo.db.sqlalchemy.session

url = 'mysql://neutron:123456.161.235/neutron'
engine = oslo.db.sqlalchemy.session.EngineFacade(url)
'''

It will fail with the same traceback.

Comment 26 Ihar Hrachyshka 2014-10-01 10:50:23 UTC
Reported to upstream.

Comment 27 Ihar Hrachyshka 2014-10-01 11:13:04 UTC
In the meantime, the workaround is to wait for mariadb to finish its startup (check systemctl status mariadb) and then restart neutron and other services affected by the bug (systemctl restart neutron-server|cinder).

Comment 28 Ihar Hrachyshka 2014-10-01 11:50:39 UTC
Thomas, please check the latest comments. Does it look the same as your issue? Does the workaround work for you?

Comment 29 wes hayutin 2014-10-01 11:54:56 UTC
*** Bug 1140897 has been marked as a duplicate of this bug. ***

Comment 30 Ihar Hrachyshka 2014-10-01 12:07:50 UTC
Talking to Roman Podolyaka from upstream oslo.db team. Seems like the bug is present in oslo-incubator db code too, so it should probably be backported to Icehouse and Havana, passing thru boring oslo-incubator backport process. Oh my!..

Comment 31 Thomas Chacko 2014-10-01 13:22:17 UTC
(In reply to Ihar Hrachyshka from comment #28)
> Thomas, please check the latest comments. Does it look the same as your
> issue? Does the workaround work for you?
Yes Ihar, the MYSQL and Neutron issue is present and Neutron-Server is failing without retry.  The above workaround works.

However in my case, there was also an issue with RabbitMQ AMQP server no starting when the Control Node Server hosting this MQ was rebooted. However I noticed in this case the Neutron Server kept retrying until the AMQP was restarted manually. So unlike MYSQL, Neutron DOES wait for AMQP service to be up and not crash.

-------------  server.log -------------
20│2014-10-01 18:41:23.467 7137 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...│  
ct│2014-10-01 18:41:24.468 7137 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 192.168.│
20│0.20:5672                                                                                                      │
20│2014-10-01 18:41:24.481 7137 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 192.168.0.20:5672 is │/O
ct│unreachable: [Errno 111] ECONNREFUSED. Trying again in 13 seconds.                                             │
20│2014-10-01 18:41:37.490 7137 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...│

//<--- manually started the RabbitMQ service here //


20│2014-10-01 18:41:38.493 7137 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 192.168.│/O
ct│0.20:5672                                                                                                      │
20│2014-10-01 18:41:38.518 7137 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 192.168.0│
20│.20:5672                                                                                                       │

Comment 32 Ihar Hrachyshka 2014-10-01 13:53:30 UTC
Ok, great. So Thomas's issue with MySQL is the same, and AMQP broker issue seems to be unrelated to Neutron but something with the broker itself. Do you have logs for rabbitmq to check?

Comment 33 Alan Pevec 2014-10-01 22:27:35 UTC
Please create new bz for the AMQP broker issue, let's keep this one for oslo.db

Comment 34 Ihar Hrachyshka 2014-10-02 08:54:27 UTC
Alan, also note that the bug is present in incubator db code too, meaning it's present in RHOS4 and RHOSP5. The sad thing is also that we'll probably need to clone the bug to each of the projects that use that code in those releases, to track the fix. :|

Comment 35 Ihar Hrachyshka 2014-10-07 21:13:21 UTC
FYI the fix was merged in u/s. Still it's not clear whether we need to backport it to Icehouse/Havana. If so, we need to backport it to oslo-incubator db codebase first.


Note You need to log in before you can comment on or make changes to this bug.