1188198 – neutron-server fails to start due to systemd timeout if DB is unavailable

Bug 1188198 - neutron-server fails to start due to systemd timeout if DB is unavailable

Summary: neutron-server fails to start due to systemd timeout if DB is unavailable

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-neutron
Sub Component:
Version:	6.0 (Juno)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	8.0 (Liberty)
Assignee:	Jakub Libosvar
QA Contact:	Ofer Blaut
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1188199 (view as bug list)
Depends On:
Blocks:	1273456
TreeView+	depends on / blocked

Reported:	2015-02-02 10:27 UTC by Javier Peña
Modified:	2016-12-21 14:48 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1346781 (view as bug list)
Environment:
Last Closed:	2016-12-21 14:48:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Javier Peña 2015-02-02 10:27:06 UTC

Description of problem: if the database is unavailable when neutron-server tries to start, Neutron will retry a pre-configured number of times, as specified in max_retries. 

max_retries can be set to -1 for an indefinite retry, but this is never attempted because the systemd unit times out and stops the service. 

This is an issue when configuring an HA environment, in case of a complete cluster restart, because it depends on ensuring the Galera DB cluster is formed before starting Neutron, and this can only be done if using Pacemaker. In other architectures, this cannot be guaranteed and requires manual Neutron server startup after bootstraping the Galera cluster.

This behaviour can be fixed by setting Restart=on-failure on the systemd unit file.

Version-Release number of selected component (if applicable):
openstack-neutron-2014.2.1-6.el7ost.noarch

How reproducible: always


Steps to Reproduce:
1. Setup an OpenStack environment, set MariaDB startup to disabled.
2. Set max_retries=-1 in /etc/neutron/neutron.conf
3. Try to start neutron-server

Actual results:
After a number of retries, neutron-server will stop and never try to start again.

Expected results:
neutron-server retries until the database connection is restablished.

Additional info:
Some information gathered from my test environment:

[root@hacontroller1 ~]# systemctl status neutron-server
neutron-server.service - OpenStack Neutron Server
   Loaded: loaded (/usr/lib/systemd/system/neutron-server.service; enabled)
   Active: failed (Result: timeout) since lun 2015-02-02 11:13:25 CET; 4s ago
 Main PID: 1495
   CGroup: /system.slice/neutron-server.service

feb 02 11:13:25 hacontroller1.example.com systemd[1]: neutron-server.service operation timed out. Terminating.
feb 02 11:13:25 hacontroller1.example.com systemd[1]: Failed to start OpenStack Neutron Server.
feb 02 11:13:25 hacontroller1.example.com systemd[1]: Unit neutron-server.service entered failed state.

[root@hacontroller1 ~]# grep retries /etc/neutron/neutron.conf 
# Maximum amount of retries to generate a unique MAC address
# mac_generation_retries = 16
# How long to backoff for between retries when connecting to
# Maximum number of RabbitMQ connection retries. Default is 0
#rabbit_max_retries=0
max_retries = -1
# max_retries = 10


[root@hacontroller1 neutron]# tail -n 25 /var/log/neutron/server.log 
2015-02-02 11:12:11.719 1495 INFO neutron.manager [-] Loading core plugin: neutron.plugins.ml2.plugin.Ml2Plugin
2015-02-02 11:12:14.076 1495 INFO neutron.plugins.ml2.managers [-] Configured type driver names: ['local', 'gre', 'flat', 'vxlan', 'vlan']
2015-02-02 11:12:14.099 1495 INFO neutron.plugins.ml2.drivers.type_flat [-] Arbitrary flat physical_network names allowed
2015-02-02 11:12:14.128 1495 INFO neutron.plugins.ml2.drivers.type_vlan [-] Network VLAN ranges: {}
2015-02-02 11:12:14.188 1495 INFO neutron.plugins.ml2.drivers.type_local [-] ML2 LocalTypeDriver initialization complete
2015-02-02 11:12:14.326 1495 INFO neutron.plugins.ml2.managers [-] Loaded type driver names: ['flat', 'vlan', 'local', 'gre', 'vxlan']
2015-02-02 11:12:14.327 1495 INFO neutron.plugins.ml2.managers [-] Registered types: ['flat', 'vlan', 'local', 'gre', 'vxlan']
2015-02-02 11:12:14.328 1495 INFO neutron.plugins.ml2.managers [-] Tenant network_types: ['vxlan']
2015-02-02 11:12:14.328 1495 INFO neutron.plugins.ml2.managers [-] Configured extension driver names: []
2015-02-02 11:12:14.330 1495 INFO neutron.plugins.ml2.managers [-] Loaded extension driver names: []
2015-02-02 11:12:14.330 1495 INFO neutron.plugins.ml2.managers [-] Registered extension drivers: []
2015-02-02 11:12:14.331 1495 INFO neutron.plugins.ml2.managers [-] Configured mechanism driver names: ['openvswitch']
2015-02-02 11:12:14.455 1495 INFO neutron.plugins.ml2.managers [-] Loaded mechanism driver names: ['openvswitch']
2015-02-02 11:12:14.455 1495 INFO neutron.plugins.ml2.managers [-] Registered mechanism drivers: ['openvswitch']
2015-02-02 11:12:14.638 1495 INFO neutron.plugins.ml2.managers [-] Initializing driver for type 'flat'
2015-02-02 11:12:14.638 1495 INFO neutron.plugins.ml2.drivers.type_flat [-] ML2 FlatTypeDriver initialization complete
2015-02-02 11:12:14.639 1495 INFO neutron.plugins.ml2.managers [-] Initializing driver for type 'vlan'
2015-02-02 11:12:14.773 1495 WARNING oslo.db.sqlalchemy.session [-] SQL connection failed. -1 attempts left.
2015-02-02 11:12:24.788 1495 WARNING oslo.db.sqlalchemy.session [-] SQL connection failed. -2 attempts left.
2015-02-02 11:12:34.801 1495 WARNING oslo.db.sqlalchemy.session [-] SQL connection failed. -3 attempts left.
2015-02-02 11:12:44.817 1495 WARNING oslo.db.sqlalchemy.session [-] SQL connection failed. -4 attempts left.
2015-02-02 11:12:54.830 1495 WARNING oslo.db.sqlalchemy.session [-] SQL connection failed. -5 attempts left.
2015-02-02 11:13:04.842 1495 WARNING oslo.db.sqlalchemy.session [-] SQL connection failed. -6 attempts left.
2015-02-02 11:13:14.854 1495 WARNING oslo.db.sqlalchemy.session [-] SQL connection failed. -7 attempts left.
2015-02-02 11:13:24.867 1495 WARNING oslo.db.sqlalchemy.session [-] SQL connection failed. -8 attempts left.

Comment 3 Javier Peña 2015-02-02 10:29:32 UTC

*** Bug 1188199 has been marked as a duplicate of this bug. ***

Comment 4 Russ Builta 2015-03-09 13:00:52 UTC

I am finding the same issue OSP6 - with the same config as listed above.

Comment 5 Veaceslav Mindru 2015-03-27 18:24:41 UTC

Same issue in OSP5 RHELOSP50-1 2014.1.1-4. Easy fix would be to add wait on  mariadb.service into systemctl startup script.


VM

Comment 6 Javier Peña 2015-04-07 08:22:42 UTC

(In reply to Veaceslav Mindru from comment #5)
> Same issue in OSP5 RHELOSP50-1 2014.1.1-4. Easy fix would be to add wait on 
> mariadb.service into systemctl startup script.
> 
> 
> VM

This would only work if neutron-server and MariaDB are running on the same system. I have worked around this issue by manually creating a file named /etc/systemd/system/neutron-server.service.d/restart.conf (need to create the directory as well) with the following contents:

[Service]
Restart=on-failure

Comment 7 Alan Pevec 2015-05-05 20:28:15 UTC

> /etc/systemd/system/neutron-server.service.d/restart.conf with the following contents:
> 
> [Service]
> Restart=on-failure

This will keep restarting the service after timeout, so won't help if initial connection takes longer than default systemd timeout.
Setting TimeoutStartSec=0 removes this timeout, I proposed that for Nova services in https://review.gerrithub.io/232273 for rpm-master and merged it in RDO Kilo build.

Comment 8 Alan Pevec 2015-05-05 20:30:10 UTC

BTW Restart=on-failure or always is fine and should be deployed in all services.

Comment 9 Scott McBrien 2015-12-01 01:00:55 UTC

Instead of a Restart=on-failure, would it make more sense to have a Requires or Wants for the mariadb.service?  Since, you know, neutron-server will in no way work without MariaDB being there?

Comment 10 Javier Peña 2015-12-01 08:57:41 UTC

(In reply to Scott McBrien from comment #9)
> Instead of a Restart=on-failure, would it make more sense to have a Requires
> or Wants for the mariadb.service?  Since, you know, neutron-server will in
> no way work without MariaDB being there?

This does not work when MariaDB is running on a different node, because systemd cannot know if the database on a remote node is running.

Comment 11 Assaf Muller 2015-12-16 17:42:58 UTC

Trying to summarize to see if I got this right:

Setting neutron.conf:max_retries=-1 works correctly from Neutron's point of view, as it retries indefinitely. However, when running neutron-server under systemd, the 'start' operation times out and systemd closes the neutron-server process. The solution you're proposing is to modify the systemd unit file so that it'll restart neutron-server automatically via Restart=on-failure?

Comment 12 Javier Peña 2015-12-17 08:52:24 UTC

(In reply to Assaf Muller from comment #11)
> Trying to summarize to see if I got this right:
> 
> Setting neutron.conf:max_retries=-1 works correctly from Neutron's point of
> view, as it retries indefinitely. However, when running neutron-server under
> systemd, the 'start' operation times out and systemd closes the
> neutron-server process. The solution you're proposing is to modify the
> systemd unit file so that it'll restart neutron-server automatically via
> Restart=on-failure?

That's correct.

Comment 15 Ihar Hrachyshka 2016-06-13 07:15:52 UTC

We should probably sync whatever the approach we take here with general strategy for service decoupling. I am ok with on-failure if that's what we do for other unit files for openstack services.

Comment 16 Jakub Libosvar 2016-06-14 12:51:10 UTC

Upstream patch: https://review.rdoproject.org/r/#/c/1388/

Note You need to log in before you can comment on or make changes to this bug.