Description of problem: In RDO Icehouse rpm packages ALL openstack services init scripts have "chkconfig: - 98 02" - which means that they are started up in alphabetical order. That creates at least problems for Neutron services - as some of them require keystone and neutron-server service to be up already (like neutron-l3-agent and perhaps also neutron-openvswitch-agent). Version-Release number of selected component (if applicable): rdo-release-icehouse-4 How reproducible: Always Steps to Reproduce: 1. RDO icehouse from yum repo (not using packstack) 2. (re)booting controller host (all CTRL services are co-located on this host) Actual results: no L3 router namespaces - ip netns | grep qrouter Expected results: neutron L3 routers are in working order Additional info: I think correct boot order for Openstack neutron services should be (including their dependencies): qpid openstack-keystone neutron-server neutron-ovs-cleanup neutron-openvswitch-agent neutron-l3-agent neutron-dhcp-agent neutron-metadata-agent Alphabetical (and current init) order is: neutron-dhcp-agent neutron-l3-agent neutron-metadata-agent neutron-openvswitch-agent neutron-ovs-cleanup neutron-server openstack-keystone qpid
I don't see this as a problem with the init scripts. OpenStack services should be designed to handle intermittent outages in other services...if keystone or neutron-server isn't available at the time the l3 agent start up, the agent should reconnect successfully when the services become available. In other words, the order of startup shouldn't matter. So I don't disagree that there is a problem here, but it sounds to me like the bug should be filed against neutron. Are you able to reproduce this explicitly, other than by rebooting? For example, if you shut down neutron and clean up all the network namespaces, can you reproduce this problem by starting services independently? Having an explicit reproducer will make this much easier to resolve.
I did some more tests and here are the findings: * as in the Icehouse neutron-l3-agent wont cleanup qrouter namespaces anymore on agent restart - then this issue is only reproducable when doing full reboot of ctrl node * during ctrl host boot-up neutron agents are started before neutron-server and neutron-server service is started before keystone service (alphabetically) - which produces authorization error for keystone and that is the reason for l3 qrouter build delay * actually you were right - neutron agents do re-connect to neutron-server after some time * seems that also neutron-server will issue keystone requests again after some time * recovery of neutron ctrl services seems to take around 1 minute in average - time measured after booting neutron l2/l3 agents * so I must admit that we managed to test always within this 1 minute time window - and notice the failure - and we didnt spot it later :) In conclusion: * agents/services reconnection logic seems to be right - just it takes time for reconnect/recovery to happen * in our case contributing 1 minute to the network outage during ctrl host reboot is a serious issue :) * in our case we are having co-located ctrl services on the same (clustered or non-clustered) host(s) - in that case boot order matters and it would be a nice optimization - in order to shorten network outage during host reboot by 1 minute at least - and not to generate unnecessary error messages in logs We can always manually adjust chkconfig order in our setups - yet it would be nice if it could be solved somehow in upstream. For reference - our manual fix/optimization was: --- SNIP --- #!/bin/sh SVC_NEUTRON="neutron-server neutron-ovs-cleanup neutron-openvswitch-agent neutron-l3-agent neutron-dhcp-agent neutron-metadata-agent" SVC_CINDER="openstack-cinder-api openstack-cinder-scheduler openstack-cinder-volume openstack-cinder-backup" SVC_NOVA="openstack-nova-api openstack-nova-cert openstack-nova-scheduler openstack-nova-conductor openstack-nova-novncproxy openstack-nova-consoleauth" SVC_GLANCE="openstack-glance-api openstack-glance-registry" SVCLIST="openstack-keystone $SVC_GLANCE $SVC_CINDER $SVC_NEUTRON $SVC_NOVA" cd /etc/init.d for i in $SVCLIST; do echo "Removing: $i " chkconfig $i off done echo "Fixing service bootup order..." sed -i -e '/^# chkconfig: /c\# chkconfig: - 85 02' openstack-keystone sed -i -e '/^# chkconfig: /c\# chkconfig: - 86 02' neutron-server sed -i -e '/^# chkconfig: /c\# chkconfig: - 87 02' neutron-ovs-cleanup sed -i -e '/^# chkconfig: /c\# chkconfig: - 88 02' neutron-openvswitch-agent sed -i -e '/^# chkconfig: /c\# chkconfig: - 89 02' neutron-l3-agent sed -i -e '/^# chkconfig: /c\# chkconfig: - 89 02' neutron-dhcp-agent sed -i -e '/^# chkconfig: /c\# chkconfig: - 89 02' neutron-metadata-agent sed -i -e '/^# chkconfig: /c\# chkconfig: - 89 02' neutron-lbaas-agent sed -i -e '/^# chkconfig: /c\# chkconfig: - 90 02' openstack-cinder-api sed -i -e '/^# chkconfig: /c\# chkconfig: - 90 02' openstack-cinder-scheduler sed -i -e '/^# chkconfig: /c\# chkconfig: - 90 02' openstack-cinder-volume sed -i -e '/^# chkconfig: /c\# chkconfig: - 90 02' openstack-cinder-backup sed -i -e '/^# chkconfig: /c\# chkconfig: - 92 02' openstack-nova-api sed -i -e '/^# chkconfig: /c\# chkconfig: - 92 02' openstack-nova-cert sed -i -e '/^# chkconfig: /c\# chkconfig: - 92 02' openstack-nova-scheduler sed -i -e '/^# chkconfig: /c\# chkconfig: - 92 02' openstack-nova-conductor sed -i -e '/^# chkconfig: /c\# chkconfig: - 92 02' openstack-nova-novncproxy sed -i -e '/^# chkconfig: /c\# chkconfig: - 92 02' openstack-nova-consoleauth sed -i -e '/^# chkconfig: /c\# chkconfig: - 94 02' openstack-glance-api sed -i -e '/^# chkconfig: /c\# chkconfig: - 94 02' openstack-glance-registry for i in $SVCLIST; do echo "Adding: $i" chkconfig $i on done --- SNIP ---
Hi. Sorry for the delay in getting back to you. Because (a) the reconnect logic seems to work as intended and (b) everything more recent than RHEL6 (RHEL/CentOS 7, Fedora, etc) are using systemd, which has a much more robust mechanism for specifying service dependencies, I am inclined to leave this as is. That said, you are welcome to submit patches that apply your boot order changes, and if they look sane it is likely they would be accepted by the package maintainers.