openstack-neutron-2013.2-10.el6ost.noarch This was spotted while debugging #1036514 service neutron-server start will return [OK] even if it fails to start because files are missing. init scripts should never return [OK] in those cases and fail.
I have tested that running the neutron-server command manually with the same arguments in the script exits nonzero when failing to load. When running via either init scripts (RHEL) or systemctl (Fedora), the script seems to return immediately with exit code 0, then the running the 'status' commands show that the process failed. This is not a neutron-specific issue, as this is the exact same behavior that Nova produces as well. I'm guessing this has something to do with the interaction between the daemonizing of the services and eventlet? I'm not sure how immediate the results from tail -f are, but doing something like: tail -f /var/log/neutron/server.log & and then service neutron-server restart; echo "**** Returned $? ****" displays the echoed output before any log entries are displayed to the screen. I'll keep looking, but given that this behavior seems common throughout the Openstack projects, this probably isn't a "neutron" component issue and the severity probably shouldn't be "urgent".
I spoke to Alan yesterday, and indeed this problem affects almost if not all openstack services. I first notice in neutron that's where i filed. The issue is urgent and possibly blocker for all components. An init script that return 0 and service is not running is: 1) misleading 2) complicates debugging 3) will generate GSS customer case for "It starts but it doesn't work"
Bugs like this also prevent system monitoring and management tools from operating correctly.
We should use this bug to fix up all of the init scripts if we can. Reassigning to apevec
Reported for Swift in bug 1020480 and Pete has proposed a simple fix upstream: https://review.openstack.org/#/c/58069/1/bin/swift-account-server,unified Swift is of course different, but looks like similar fix could be deployed in all OpenStack services.
> I'm guessing this has something to do with the interaction between the > daemonizing of the services and eventlet? It's actually missing daemonizing in the OpenStack services, we're faking it with & in initscripts e.g. nova: daemon --user nova --pidfile $pidfile "$exec --logfile $logfile &>/dev/null & echo \$! > $pidfile" Proper fix would be to move daemonizing into service startup code so it would return only when it's done initialization.
... or we add startup wrapper which waits on signal over socket that service sends when ready. Such "onready" signal is currently deployed in Keystone[1] and used for systemd service type=notify [2] [1] https://github.com/openstack/keystone/blob/master/bin/keystone-all#L80 [2] http://www.freedesktop.org/software/systemd/man/sd_notify.html
The only "simple" solution I can find is to have the daemons write their own pidfile at the end of initialization and have the init script check for it with a loop/timeout. There are several conditions that need to be taken into account from the init script: - pidfile is not created within timeout: - daemon is not running -> FAILED - daemon is running -> kill it -> FAILED - pidfile is created within timeout, loop twice more for the sake of it: - daemon is not running -> FAILED - daemon is running -> OK On stop or exit, pidfile should be removed even daemon() checks for it. Every other solution I can think of involves more fragile code. At least this should be certain to find or not the pidfile after init and take a consistent approach to the problem. this is on the assumption that pidfile is written once service has read all its config file and it´s ready to reply to API requests. As above, an error because of "keystone talks to mysql only when asked" and mysql is not running should be considered a runtime error and not something init should care about.
> this is on the assumption that pidfile is written once service has read all > its config file and it´s ready to reply to API requests. This needs to be added at the same place as [1] in comment 11. Currently, as can be seen from Nova initscript example in comment 10, $pidfile is created by the initscript.
David, please review Keystone implementation which I plan to replicate to other services: http://pkgs.fedoraproject.org/cgit/openstack-keystone.git/commit/?h=el6-havana&id=de327d38ca6cfbbcbf4560ea3fa73daec60e9132 In my testing it successfully replaces previous wait_until_keystone_available() workaround in the initscript.
I wasn't aware that systemd.py used abstract namespace sockets, but the way you use '@' as an escape for '\0' looks fine. I would use True and False rather than 1 and 0 for the unset_env argument when calling _sd_notify. 1 and 0 work would make sense if this had to run on ancient Python but OpenStack really, really needs at least 2.6. Similarly, I think "if msg.find('READY=1') == -1:" is kind of archaic and hard to read. "if 'READY=1' in msg:" would be better. Maybe there needs to be try/except/return error value around the sock.bind() call in onready, since the bind might be able to fail in some cases? The packaging looks sane but I'm also not sure about the best place for daemon_notify.sh to live. Putting it in the same place as the similar files, like you did, seems fine. I fear the TEMP comments will last forever.
This worked for Nova with ProcessLauncher and multiple workers: https://github.com/redhat-openstack/nova/commit/e17e59739181610ae4916dfe1dca35d83b86366a
Please be aware that this applies also to mongodb init script.
Xavier found issue with sending readiness notification from inside run_service: it returns premature OK in case of multiple service endpoints (enabled_apis config parameter) in one systemd/sysv service/initscript like openstack-nova-api which by default provides osapi, ec2 and metadata services. I'll move service readiness notification to nova/cmd/api.py after all enabled apis are launched.
> I'll move service readiness notification to nova/cmd/api.py after all > enabled apis are launched. Even better suggestion by Xavier[1] covers all cases without modifying startup cmds. [1] https://review.openstack.org/87309
I noticed $prog-startup.log (in daemon_notify.sh) is getting python logging output which we don't want since we're setting log_file in dist.confs. This is due to use_stderr default=True in Oslo log module: https://github.com/openstack/oslo-incubator/blob/master/openstack/common/log.py#L136 I'll push use_stderr = false in all dist.confs, default config sets logfile so output on stderr is not expected.
Created attachment 886948 [details] Followup patch for neutron-dist.conf
Created attachment 886960 [details] Fix change of default logfilename api.log -> nova-api.log
Created attachment 886961 [details] Followup patch for nova-dist.conf
Created attachment 886962 [details] Followup patch for cinder-dist.conf
Created attachment 886976 [details] corresponding packaging patch for rhos4 openstack-glance
Created attachment 886977 [details] Followup patch for glance*dist.conf 1/2
Created attachment 886978 [details] Followup patch for glance*dist.conf 2/2
Comment on attachment 886976 [details] corresponding packaging patch for rhos4 openstack-glance Should the daemon-notify.sh script itself be in this patch?
Created attachment 887135 [details] corresponding packaging patch for rhos4 openstack-glance
Created attachment 889951 [details] patch for rhos4 openstack-cinder
*** Bug 1091691 has been marked as a duplicate of this bug. ***
*** Bug 1093741 has been marked as a duplicate of this bug. ***
*** Bug 1094250 has been marked as a duplicate of this bug. ***
Let's see what happens after removing the sqlalchemy library. $ mv /usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg /usr/lib64/python2.6/site-packages/orig-SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg $ service openstack-nova-network start; echo $? Starting openstack-nova-network: [ OK ] 0 $ service openstack-nova-network status; echo $? openstack-nova-network dead but pid file exists 1 It also indicates success even after the /usr/bin/python renamed. openstack-nova-network-2013.2.3-7.el6ost
With removed python binary the following python services resports successful restart: for ser in openstack-nova-{api,cert,object,compute,network,scheduler,conductor} openstack-glance-{api,registry,scrubber} openstack-keystone neutron-{server,dhcp-agent,l3-agent,metadata-agent,lbaas-agent,openvswitch-agent,metering-agent} openstack-swift-{proxy,account,container,object,object-expirer} openstack-cinder-{api,backup,volume,scheduler} openstack-ceilometer-{api,central,compute,collector,alarm-notifier,alarm-evaluator} openstack-heat-{api,api-cfn,api-cloudwatch,engine};do service $ser restart &>/dev/null && echo zero: $ser ; done zero: openstack-nova-cert zero: openstack-nova-compute zero: openstack-nova-network zero: openstack-nova-scheduler zero: openstack-nova-conductor zero: openstack-glance-scrubber zero: neutron-dhcp-agent zero: neutron-l3-agent zero: neutron-metadata-agent zero: neutron-lbaas-agent zero: neutron-openvswitch-agent zero: neutron-metering-agent zero: openstack-swift-proxy zero: openstack-swift-account zero: openstack-swift-container zero: openstack-swift-object zero: openstack-swift-object-expirer zero: openstack-cinder-volume zero: openstack-cinder-scheduler zero: openstack-ceilometer-api zero: openstack-ceilometer-central zero: openstack-ceilometer-compute zero: openstack-ceilometer-collector zero: openstack-ceilometer-alarm-notifier zero: openstack-ceilometer-alarm-evaluator zero: openstack-heat-api zero: openstack-heat-api-cfn zero: openstack-heat-api-cloudwatch zero: openstack-heat-engine # openstack-status == Nova services == openstack-nova-api: inactive openstack-nova-cert: dead openstack-nova-compute: dead openstack-nova-network: dead (disabled on boot) openstack-nova-scheduler: dead openstack-nova-conductor: dead == Glance services == openstack-glance-api: inactive openstack-glance-registry: inactive == Keystone service == openstack-keystone: inactive == Horizon service == openstack-dashboard: active == neutron services == neutron-server: inactive neutron-dhcp-agent: dead neutron-l3-agent: dead neutron-metadata-agent: dead neutron-lbaas-agent: dead neutron-openvswitch-agent: dead == Swift services == openstack-swift-proxy: dead openstack-swift-account: dead openstack-swift-container: dead openstack-swift-object: dead == Cinder services == openstack-cinder-api: inactive openstack-cinder-scheduler: dead openstack-cinder-volume: dead == Ceilometer services == openstack-ceilometer-api: dead openstack-ceilometer-central: dead openstack-ceilometer-compute: dead openstack-ceilometer-collector: dead openstack-ceilometer-alarm-notifier: dead openstack-ceilometer-alarm-evaluator: dead == Heat services == openstack-heat-api: dead openstack-heat-api-cfn: dead openstack-heat-api-cloudwatch: dead openstack-heat-engine: dead == Support services ==
(In reply to Attila Fazekas from comment #68) > $ service openstack-nova-network start; echo $? Only openstack-nova-api initscript has been modified to use daemon_notify wrapper.
Now that the summary reflects the content of the bug, I set it to VERIFIED as afazekas' comment proved that the named services did not return zero.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0577.html