Created attachment 939050 [details] packstack answer file Description of problem: ======================= Post installation, Neutron Fails to operate because it cannot connect to the database. Version-Release number of selected component (if applicable): ============================================================= openstack-packstack-2014.2-0.2.dev1266.g63d9c50.el7.centos.noarch openstack-packstack-puppet-2014.2-0.2.dev1266.g63d9c50.el7.centos.noarch How reproducible: ================= Tested Once. Steps to Reproduce: =================== Used the attached answer file to deploy Juno RDO setup (controller, networker, 2 computes): Actual results: =============== Neutron Logs contain Tracebacks. OperationalError: (OperationalError) (2003, "Can't connect to MySQL server on '192.168.161.235' (111)") None None For details see the attached server.log Expected results: ================= Neutron should be functional. Additional Information: ======================= snipped from neutron.conf: [database] connection = mysql://neutron:123456.161.235/neutron max_retries = 10 retry_interval = 10 idle_timeout = 3600
Nir, is it possible to connect to that mysql server via e.g. telnet?
(In reply to Ihar Hrachyshka from comment #4) > Nir, is it possible to connect to that mysql server via e.g. telnet? Yes. Tried locally: mysql -u neutron -p -h 10.35.161.235 neutron and it seems to work fine (both neutron server and db reside on the same node).
So something forbids Neutron to communicate with MySQL thru the port (while other processes are ok to communicate with it). Are there any AVC errors in audit.log? What if you disable SELinux?
OK, so here is what we may need to debug the issue: - iptables -L output before neutron-server fails to connect to mysql server; - iptables -L output at the moment when console mysql client succeeds to connect to mysql server. Also, please specify whether SELinux is enabled (if so, please attach audit.log), and whether FirewallD and NetworkManager are running.
*** Bug 1049519 has been marked as a duplicate of this bug. ***
Thomas, please provide details as per comment above from your failed setup from bug 1049519. Thanks.
Created attachment 942717 [details] iptables -L ( during issue )
Created attachment 942718 [details] iptables -L ( after rabbitmqserver & Neutron server restart and working )
Created attachment 942719 [details] Neutron Server log during this bug reproduction and workaround
The Neutron-Server cannot be restarted when this SQL connect error is logged. Neutron server logs indicates issue with AMQP ( RabbitMQ ) connection. So first restarted rabbitmq-server and then started Neutron-Server service. Now it works fines. So it appears there is an interplay between neutron-server and rabbitmq-server. SELinux is in permissive mode. Other logs attached above.
Thomas, do you mean that after you tried to restart Neutron, it didn't manage to come up due to AMQP issue, so you restarted RabbitMQ and MySQL and then Neutron, which helped to get rid of the failures? The way RabbitMQ connection is broken looks very weird: "Failed to consume message from queue: Socket closed" I wonder whether there are any relevant logs for mysql and rabbitmq services?
(In reply to Ihar Hrachyshka from comment #14) > Thomas, do you mean that after you tried to restart Neutron, it didn't > manage to come up due to AMQP issue, so you restarted RabbitMQ and MySQL and > then Neutron, which helped to get rid of the failures? > > The way RabbitMQ connection is broken looks very weird: > "Failed to consume message from queue: Socket closed" > > I wonder whether there are any relevant logs for mysql and rabbitmq services? I managed to reproduce this with SElinux set to permissive.
Created attachment 942907 [details] Mariadb MYSQL log_last 1000
Created attachment 942908 [details] RabbitMQ server log_last 1000 lines
(In reply to Ihar Hrachyshka from comment #14) > Thomas, do you mean that after you tried to restart Neutron, it didn't > manage to come up due to AMQP issue, so you restarted RabbitMQ and MySQL and > then Neutron, which helped to get rid of the failures? > > The way RabbitMQ connection is broken looks very weird: > "Failed to consume message from queue: Socket closed" Yes. Neutron Server can only be restarted successfully by first restarting RabbitMQ server. The MYSQL restart was not required. > I wonder whether there are any relevant logs for mysql and rabbitmq services? I attach the last 1000 lines from these logs here,
Additional comments pertaining to PackStack Juno install- may or may not be relevant for this bug. During this install, I had manually updated the RabbitMQ config file as the one which Packstack creates issue with Novacompute service start. Refer : https://ask.openstack.org/en/question/48329/openstack-juno-using-rdo-fails-installation-amqp-server-closed-the-connection/
@ihar, I just noticed the exact same issue with cinder
OK, so the following is based on interactive session on one of machines that were kindly provided by Nir Magnezi... The failure occurs after restart. Neutron starts and fails to connect to MariaDB server. After the failure, it does not attempt to retry to connect to SQL server but just exits. Oct 01 11:24:48 nmagnezi-os-cont1.scl.lab.tlv.redhat.com systemd[1]: neutron-server.service: main process exited, code=exited, status=1/FAILURE Now let's see what's the status of MariaDB: Oct 01 11:24:13 nmagnezi-os-cont1.scl.lab.tlv.redhat.com systemd[1]: Starting MariaDB database server... Oct 01 11:25:20 nmagnezi-os-cont1.scl.lab.tlv.redhat.com mysqld_safe[1237]: 141001 11:25:20 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'. Oct 01 11:25:21 nmagnezi-os-cont1.scl.lab.tlv.redhat.com mysqld_safe[1237]: 141001 11:25:21 mysqld_safe Starting mysqld daemon with databases from...mysql Oct 01 11:25:22 nmagnezi-os-cont1.scl.lab.tlv.redhat.com mysqld_safe[1237]: 141001 11:25:22 mysqld_safe WSREP: Running position recovery with --lo....pid' Oct 01 11:25:29 nmagnezi-os-cont1.scl.lab.tlv.redhat.com mysqld_safe[1237]: 141001 11:25:29 mysqld_safe WSREP: Recovered position 00000000-0000-00...00:-1 Oct 01 11:25:35 nmagnezi-os-cont1.scl.lab.tlv.redhat.com systemd[1]: Started MariaDB database server. So Neutron crashed while MariaDB was starting up, but before it reported startup succeeded. This is because Neutron service does not wait for MariaDB to start (neither it waits for AMQP service). The only dependencies that are specified for neutron-server unit file are: "After=syslog.target network.target" There are multiple questions to be answered: - why has it taken so long for MariaDB to start (quoting Alan Pevec: "initial startup probably takes longer as it runs prepare-db-dir ?"); - why Neutron does not wait for SQL and AMQP services to start before proceeding with its own startup; - why Neutron crashes instead of retrying (this question probably belongs to oslo.db).
Grabbing the bug and moving it to openstack-neutron component since it seems to be irrelevant in terms of which deployment tool was used to deploy Openstack.
That said, solution should be generic enough and be applied to all openstack services (see references to Cinder crashing the same way).
OK, I've read a bit more code, and it seems now I understand how it failed. So oslo.db HAS a mechanism for db retries on connect failure (see retry_interval, max_retries and other options in neutron.conf). The problem is that it's applied in create_engine() when calling _test_connection() while the crash occurred before _test_connection() was called, specifically, when engine was __init__-ialized. It seems like in oslo.db 0.4.0 there was a bug introduced that ended with oslo.db invoking SQL statements on engine (which resulted in engine.connect() being called) before the connection is tested (specifically, "SHOW VARIABLES LIKE 'sql_mode'" was called for MariaDB). Now when I look into oslo.db master branch, I see that the code was reworked there and seems like not being affected by the issue anymore (the statement is postponed till the time when engine.connect() is called inside _test_connection() instead of being called right away). So my hypothesis is that oslo.db>=1.0.0 should solve the problem. The appropriate version bump was already merged in upstream openstack/requirements project [1] and is going to be propagated into all projects that use oslo.db before final Juno release. [1]: https://review.openstack.org/#/c/122408/
The simplest way to reproduce the failure is: 1. stop MariaDB. 2. execute the following Python script: ''' import oslo.db.sqlalchemy.session url = 'mysql://neutron:123456.161.235/neutron' engine = oslo.db.sqlalchemy.session.EngineFacade(url) ''' It will fail with the same traceback.
Reported to upstream.
In the meantime, the workaround is to wait for mariadb to finish its startup (check systemctl status mariadb) and then restart neutron and other services affected by the bug (systemctl restart neutron-server|cinder).
Thomas, please check the latest comments. Does it look the same as your issue? Does the workaround work for you?
*** Bug 1140897 has been marked as a duplicate of this bug. ***
Talking to Roman Podolyaka from upstream oslo.db team. Seems like the bug is present in oslo-incubator db code too, so it should probably be backported to Icehouse and Havana, passing thru boring oslo-incubator backport process. Oh my!..
(In reply to Ihar Hrachyshka from comment #28) > Thomas, please check the latest comments. Does it look the same as your > issue? Does the workaround work for you? Yes Ihar, the MYSQL and Neutron issue is present and Neutron-Server is failing without retry. The above workaround works. However in my case, there was also an issue with RabbitMQ AMQP server no starting when the Control Node Server hosting this MQ was rebooted. However I noticed in this case the Neutron Server kept retrying until the AMQP was restarted manually. So unlike MYSQL, Neutron DOES wait for AMQP service to be up and not crash. ------------- server.log ------------- 20│2014-10-01 18:41:23.467 7137 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...│ ct│2014-10-01 18:41:24.468 7137 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 192.168.│ 20│0.20:5672 │ 20│2014-10-01 18:41:24.481 7137 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 192.168.0.20:5672 is │/O ct│unreachable: [Errno 111] ECONNREFUSED. Trying again in 13 seconds. │ 20│2014-10-01 18:41:37.490 7137 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds...│ //<--- manually started the RabbitMQ service here // 20│2014-10-01 18:41:38.493 7137 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 192.168.│/O ct│0.20:5672 │ 20│2014-10-01 18:41:38.518 7137 INFO oslo.messaging._drivers.impl_rabbit [-] Connected to AMQP server on 192.168.0│ 20│.20:5672 │
Ok, great. So Thomas's issue with MySQL is the same, and AMQP broker issue seems to be unrelated to Neutron but something with the broker itself. Do you have logs for rabbitmq to check?
Please create new bz for the AMQP broker issue, let's keep this one for oslo.db
Alan, also note that the bug is present in incubator db code too, meaning it's present in RHOS4 and RHOSP5. The sad thing is also that we'll probably need to clone the bug to each of the projects that use that code in those releases, to track the fix. :|
FYI the fix was merged in u/s. Still it's not clear whether we need to backport it to Icehouse/Havana. If so, we need to backport it to oslo-incubator db codebase first.