1165115 – Wrong bootup order for openstack services

RDO tickets are now tracked in Jira https://issues.redhat.com/projects/RDO/issues/

Bug 1165115 - Wrong bootup order for openstack services

Summary: Wrong bootup order for openstack services

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	RDO
Classification:	Community
Component:	distribution
Sub Component:
Version:	Icehouse
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	Juno
Assignee:	Perry Myers
QA Contact:	Ami Jeain
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-11-18 11:26 UTC by Andres Toomsalu
Modified:	2015-03-19 02:12 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-03-19 02:12:30 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Andres Toomsalu 2014-11-18 11:26:45 UTC

Description of problem:
In RDO Icehouse rpm packages ALL openstack services init scripts have "chkconfig:   - 98 02" - which means that they are started up in alphabetical order. That creates at least problems for Neutron services - as some of them require keystone and neutron-server service to be up already (like neutron-l3-agent and perhaps also neutron-openvswitch-agent).  

Version-Release number of selected component (if applicable): rdo-release-icehouse-4


How reproducible:
Always

Steps to Reproduce:
1. RDO icehouse from yum repo (not using packstack)
2. (re)booting controller host (all CTRL services are co-located on this host)

Actual results:
no L3 router namespaces - ip netns | grep qrouter

Expected results:
neutron L3 routers are in working order


Additional info:

I think correct boot order for Openstack neutron services should be (including their dependencies):
qpid
openstack-keystone
neutron-server
neutron-ovs-cleanup
neutron-openvswitch-agent
neutron-l3-agent
neutron-dhcp-agent
neutron-metadata-agent

Alphabetical (and current init) order is:
neutron-dhcp-agent
neutron-l3-agent
neutron-metadata-agent
neutron-openvswitch-agent
neutron-ovs-cleanup
neutron-server
openstack-keystone
qpid

Comment 1 Lars Kellogg-Stedman 2014-11-18 14:57:47 UTC

I don't see this as a problem with the init scripts.  

OpenStack services should be designed to handle intermittent outages in other services...if keystone or neutron-server isn't available at the time the l3 agent start up, the agent should reconnect successfully when the services become available.

In other words, the order of startup shouldn't matter.

So I don't disagree that there is a problem here, but it sounds to me like the bug should be filed against neutron.

Are you able to reproduce this explicitly, other than by rebooting?  For example, if you shut down neutron and clean up all the network namespaces, can you reproduce this problem by starting services independently?  Having an explicit reproducer will make this much easier to resolve.

Comment 2 Andres Toomsalu 2014-11-18 20:42:38 UTC

I did some more tests and here are the findings:

* as in the Icehouse neutron-l3-agent wont cleanup qrouter namespaces anymore on agent restart - then this issue is only reproducable when doing full reboot of ctrl node
* during ctrl host boot-up neutron agents are started before neutron-server and neutron-server service is started before keystone service (alphabetically) - which produces authorization error for keystone and that is the reason for l3 qrouter build delay
* actually you were right - neutron agents do re-connect to neutron-server after some time
* seems that also neutron-server will issue keystone requests again after some time
* recovery of neutron ctrl services seems to take around 1 minute in average - time measured after booting neutron l2/l3 agents
* so I must admit that we managed to test always within this 1 minute time window - and notice the failure - and we didnt spot it later :) 

In conclusion:
* agents/services reconnection logic seems to be right - just it takes time for reconnect/recovery to happen 
* in our case contributing 1 minute to the network outage during ctrl host reboot is a serious issue :) 
* in our case we are having co-located ctrl services on the same (clustered or non-clustered) host(s) - in that case boot order matters and it would be a nice optimization - in order to shorten network outage during host reboot by 1 minute at least - and not to generate unnecessary error messages in logs

We can always manually adjust chkconfig order in our setups - yet it would be nice if it could be solved somehow in upstream. 

For reference - our manual fix/optimization was:
--- SNIP ---
#!/bin/sh

SVC_NEUTRON="neutron-server neutron-ovs-cleanup neutron-openvswitch-agent neutron-l3-agent neutron-dhcp-agent neutron-metadata-agent"
SVC_CINDER="openstack-cinder-api openstack-cinder-scheduler openstack-cinder-volume openstack-cinder-backup"
SVC_NOVA="openstack-nova-api openstack-nova-cert openstack-nova-scheduler openstack-nova-conductor openstack-nova-novncproxy openstack-nova-consoleauth"
SVC_GLANCE="openstack-glance-api openstack-glance-registry"
SVCLIST="openstack-keystone $SVC_GLANCE $SVC_CINDER $SVC_NEUTRON $SVC_NOVA"

cd /etc/init.d
for i in $SVCLIST; do 
    echo "Removing: $i "
    chkconfig $i off
done

echo "Fixing service bootup order..."
sed -i -e '/^# chkconfig: /c\# chkconfig:   - 85 02' openstack-keystone
sed -i -e '/^# chkconfig: /c\# chkconfig:   - 86 02' neutron-server
sed -i -e '/^# chkconfig: /c\# chkconfig:   - 87 02' neutron-ovs-cleanup
sed -i -e '/^# chkconfig: /c\# chkconfig:   - 88 02' neutron-openvswitch-agent
sed -i -e '/^# chkconfig: /c\# chkconfig:   - 89 02' neutron-l3-agent
sed -i -e '/^# chkconfig: /c\# chkconfig:   - 89 02' neutron-dhcp-agent
sed -i -e '/^# chkconfig: /c\# chkconfig:   - 89 02' neutron-metadata-agent
sed -i -e '/^# chkconfig: /c\# chkconfig:   - 89 02' neutron-lbaas-agent
sed -i -e '/^# chkconfig: /c\# chkconfig:   - 90 02' openstack-cinder-api
sed -i -e '/^# chkconfig: /c\# chkconfig:   - 90 02' openstack-cinder-scheduler
sed -i -e '/^# chkconfig: /c\# chkconfig:   - 90 02' openstack-cinder-volume
sed -i -e '/^# chkconfig: /c\# chkconfig:   - 90 02' openstack-cinder-backup
sed -i -e '/^# chkconfig: /c\# chkconfig:   - 92 02' openstack-nova-api
sed -i -e '/^# chkconfig: /c\# chkconfig:   - 92 02' openstack-nova-cert
sed -i -e '/^# chkconfig: /c\# chkconfig:   - 92 02' openstack-nova-scheduler
sed -i -e '/^# chkconfig: /c\# chkconfig:   - 92 02' openstack-nova-conductor
sed -i -e '/^# chkconfig: /c\# chkconfig:   - 92 02' openstack-nova-novncproxy
sed -i -e '/^# chkconfig: /c\# chkconfig:   - 92 02' openstack-nova-consoleauth
sed -i -e '/^# chkconfig: /c\# chkconfig:   - 94 02' openstack-glance-api
sed -i -e '/^# chkconfig: /c\# chkconfig:   - 94 02' openstack-glance-registry

for i in $SVCLIST; do 
    echo "Adding: $i"
    chkconfig $i on
done
--- SNIP ---

Comment 3 Lars Kellogg-Stedman 2015-03-19 02:12:30 UTC

Hi.  Sorry for the delay in getting back to you.

Because (a) the reconnect logic seems to work as intended and (b) everything more recent than RHEL6 (RHEL/CentOS 7, Fedora, etc) are using systemd, which has a much more robust mechanism for specifying service dependencies, I am inclined to leave this as is.

That said, you are welcome to submit patches that apply your boot order changes, and if they look sane it is likely they would be accepted by the package maintainers.

Note You need to log in before you can comment on or make changes to this bug.