Description of problem: if the database is unavailable when neutron-server tries to start, Neutron will retry a pre-configured number of times, as specified in max_retries. max_retries can be set to -1 for an indefinite retry, but this is never attempted because the systemd unit times out and stops the service. This is an issue when configuring an HA environment, in case of a complete cluster restart, because it depends on ensuring the Galera DB cluster is formed before starting Neutron, and this can only be done if using Pacemaker. In other architectures, this cannot be guaranteed and requires manual Neutron server startup after bootstraping the Galera cluster. This behaviour can be fixed by setting Restart=on-failure on the systemd unit file. Version-Release number of selected component (if applicable): openstack-neutron-2014.2.1-6.el7ost.noarch How reproducible: always Steps to Reproduce: 1. Setup an OpenStack environment, set MariaDB startup to disabled. 2. Set max_retries=-1 in /etc/neutron/neutron.conf 3. Try to start neutron-server Actual results: After a number of retries, neutron-server will stop and never try to start again. Expected results: neutron-server retries until the database connection is restablished. Additional info: Some information gathered from my test environment: [root@hacontroller1 ~]# systemctl status neutron-server neutron-server.service - OpenStack Neutron Server Loaded: loaded (/usr/lib/systemd/system/neutron-server.service; enabled) Active: failed (Result: timeout) since lun 2015-02-02 11:13:25 CET; 4s ago Main PID: 1495 CGroup: /system.slice/neutron-server.service feb 02 11:13:25 hacontroller1.example.com systemd[1]: neutron-server.service operation timed out. Terminating. feb 02 11:13:25 hacontroller1.example.com systemd[1]: Failed to start OpenStack Neutron Server. feb 02 11:13:25 hacontroller1.example.com systemd[1]: Unit neutron-server.service entered failed state. [root@hacontroller1 ~]# grep retries /etc/neutron/neutron.conf # Maximum amount of retries to generate a unique MAC address # mac_generation_retries = 16 # How long to backoff for between retries when connecting to # Maximum number of RabbitMQ connection retries. Default is 0 #rabbit_max_retries=0 max_retries = -1 # max_retries = 10 [root@hacontroller1 neutron]# tail -n 25 /var/log/neutron/server.log 2015-02-02 11:12:11.719 1495 INFO neutron.manager [-] Loading core plugin: neutron.plugins.ml2.plugin.Ml2Plugin 2015-02-02 11:12:14.076 1495 INFO neutron.plugins.ml2.managers [-] Configured type driver names: ['local', 'gre', 'flat', 'vxlan', 'vlan'] 2015-02-02 11:12:14.099 1495 INFO neutron.plugins.ml2.drivers.type_flat [-] Arbitrary flat physical_network names allowed 2015-02-02 11:12:14.128 1495 INFO neutron.plugins.ml2.drivers.type_vlan [-] Network VLAN ranges: {} 2015-02-02 11:12:14.188 1495 INFO neutron.plugins.ml2.drivers.type_local [-] ML2 LocalTypeDriver initialization complete 2015-02-02 11:12:14.326 1495 INFO neutron.plugins.ml2.managers [-] Loaded type driver names: ['flat', 'vlan', 'local', 'gre', 'vxlan'] 2015-02-02 11:12:14.327 1495 INFO neutron.plugins.ml2.managers [-] Registered types: ['flat', 'vlan', 'local', 'gre', 'vxlan'] 2015-02-02 11:12:14.328 1495 INFO neutron.plugins.ml2.managers [-] Tenant network_types: ['vxlan'] 2015-02-02 11:12:14.328 1495 INFO neutron.plugins.ml2.managers [-] Configured extension driver names: [] 2015-02-02 11:12:14.330 1495 INFO neutron.plugins.ml2.managers [-] Loaded extension driver names: [] 2015-02-02 11:12:14.330 1495 INFO neutron.plugins.ml2.managers [-] Registered extension drivers: [] 2015-02-02 11:12:14.331 1495 INFO neutron.plugins.ml2.managers [-] Configured mechanism driver names: ['openvswitch'] 2015-02-02 11:12:14.455 1495 INFO neutron.plugins.ml2.managers [-] Loaded mechanism driver names: ['openvswitch'] 2015-02-02 11:12:14.455 1495 INFO neutron.plugins.ml2.managers [-] Registered mechanism drivers: ['openvswitch'] 2015-02-02 11:12:14.638 1495 INFO neutron.plugins.ml2.managers [-] Initializing driver for type 'flat' 2015-02-02 11:12:14.638 1495 INFO neutron.plugins.ml2.drivers.type_flat [-] ML2 FlatTypeDriver initialization complete 2015-02-02 11:12:14.639 1495 INFO neutron.plugins.ml2.managers [-] Initializing driver for type 'vlan' 2015-02-02 11:12:14.773 1495 WARNING oslo.db.sqlalchemy.session [-] SQL connection failed. -1 attempts left. 2015-02-02 11:12:24.788 1495 WARNING oslo.db.sqlalchemy.session [-] SQL connection failed. -2 attempts left. 2015-02-02 11:12:34.801 1495 WARNING oslo.db.sqlalchemy.session [-] SQL connection failed. -3 attempts left. 2015-02-02 11:12:44.817 1495 WARNING oslo.db.sqlalchemy.session [-] SQL connection failed. -4 attempts left. 2015-02-02 11:12:54.830 1495 WARNING oslo.db.sqlalchemy.session [-] SQL connection failed. -5 attempts left. 2015-02-02 11:13:04.842 1495 WARNING oslo.db.sqlalchemy.session [-] SQL connection failed. -6 attempts left. 2015-02-02 11:13:14.854 1495 WARNING oslo.db.sqlalchemy.session [-] SQL connection failed. -7 attempts left. 2015-02-02 11:13:24.867 1495 WARNING oslo.db.sqlalchemy.session [-] SQL connection failed. -8 attempts left.
*** Bug 1188199 has been marked as a duplicate of this bug. ***
I am finding the same issue OSP6 - with the same config as listed above.
Same issue in OSP5 RHELOSP50-1 2014.1.1-4. Easy fix would be to add wait on mariadb.service into systemctl startup script. VM
(In reply to Veaceslav Mindru from comment #5) > Same issue in OSP5 RHELOSP50-1 2014.1.1-4. Easy fix would be to add wait on > mariadb.service into systemctl startup script. > > > VM This would only work if neutron-server and MariaDB are running on the same system. I have worked around this issue by manually creating a file named /etc/systemd/system/neutron-server.service.d/restart.conf (need to create the directory as well) with the following contents: [Service] Restart=on-failure
> /etc/systemd/system/neutron-server.service.d/restart.conf with the following contents: > > [Service] > Restart=on-failure This will keep restarting the service after timeout, so won't help if initial connection takes longer than default systemd timeout. Setting TimeoutStartSec=0 removes this timeout, I proposed that for Nova services in https://review.gerrithub.io/232273 for rpm-master and merged it in RDO Kilo build.
BTW Restart=on-failure or always is fine and should be deployed in all services.
Instead of a Restart=on-failure, would it make more sense to have a Requires or Wants for the mariadb.service? Since, you know, neutron-server will in no way work without MariaDB being there?
(In reply to Scott McBrien from comment #9) > Instead of a Restart=on-failure, would it make more sense to have a Requires > or Wants for the mariadb.service? Since, you know, neutron-server will in > no way work without MariaDB being there? This does not work when MariaDB is running on a different node, because systemd cannot know if the database on a remote node is running.
Trying to summarize to see if I got this right: Setting neutron.conf:max_retries=-1 works correctly from Neutron's point of view, as it retries indefinitely. However, when running neutron-server under systemd, the 'start' operation times out and systemd closes the neutron-server process. The solution you're proposing is to modify the systemd unit file so that it'll restart neutron-server automatically via Restart=on-failure?
(In reply to Assaf Muller from comment #11) > Trying to summarize to see if I got this right: > > Setting neutron.conf:max_retries=-1 works correctly from Neutron's point of > view, as it retries indefinitely. However, when running neutron-server under > systemd, the 'start' operation times out and systemd closes the > neutron-server process. The solution you're proposing is to modify the > systemd unit file so that it'll restart neutron-server automatically via > Restart=on-failure? That's correct.
We should probably sync whatever the approach we take here with general strategy for service decoupling. I am ok with on-failure if that's what we do for other unit files for openstack services.
Upstream patch: https://review.rdoproject.org/r/#/c/1388/