1036515 – Fix some OpenStack init scripts to return non-zero if service cannot start

Bug 1036515 - Fix some OpenStack init scripts to return non-zero if service cannot start

Summary: Fix some OpenStack init scripts to return non-zero if service cannot start

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	distribution
Sub Component:
Version:	4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	z4
Target Release:	4.0
Assignee:	Alan Pevec
QA Contact:	Attila Darazs
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1091691 1093741 1094250 (view as bug list)
Depends On:
Blocks:	1040649 RHEL-OSP_Neutron_HA 1098570 1122569
TreeView+	depends on / blocked

Reported:	2013-12-02 08:15 UTC by Fabio Massimo Di Nitto
Modified:	2019-09-10 14:09 UTC (History)
CC List:	23 users (show)
Fixed In Version:	openstack-nova-2013.2.3-7.el6ost openstack-neutron-2013.2.3-7.el6ost openstack-glance-2013.2.3-3.el6ost openstack-cinder-2013.2.3-2.el6ost
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1066408 1094862 1098570 1122569 (view as bug list)
Environment:
Last Closed:	2014-05-29 19:57:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
corresponding packaging patch for rhos4 openstack-nova (4.05 KB, patch) 2014-04-14 12:04 UTC, Alan Pevec	xqueralt: review+	Details \| Diff
corresponding packaging patch for rhos4 openstack-neutron (4.30 KB, patch) 2014-04-14 21:30 UTC, Alan Pevec	no flags	Details \| Diff
Followup patch for neutron-dist.conf (832 bytes, patch) 2014-04-16 17:18 UTC, Alan Pevec	no flags	Details \| Diff
Fix change of default logfilename api.log -> nova-api.log (1.28 KB, patch) 2014-04-16 17:32 UTC, Alan Pevec	no flags	Details \| Diff
Followup patch for nova-dist.conf (800 bytes, patch) 2014-04-16 17:35 UTC, Alan Pevec	no flags	Details \| Diff
Followup patch for cinder-dist.conf (811 bytes, patch) 2014-04-16 17:38 UTC, Alan Pevec	no flags	Details \| Diff
corresponding packaging patch for rhos4 openstack-glance (3.98 KB, patch) 2014-04-16 20:23 UTC, Alan Pevec	no flags	Details \| Diff
*Followup patch for glancedist.conf 2/2** (1.17 KB, patch) 2014-04-16 20:27 UTC, Alan Pevec	pbrady: review+	Details \| Diff
corresponding packaging patch for rhos4 openstack-glance (5.48 KB, patch) 2014-04-17 10:53 UTC, Alan Pevec	pbrady: review+	Details \| Diff
Show Obsolete (2) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2014:0577	0	normal	SHIPPED_LIVE	Red Hat Enterprise Linux OpenStack Platform 4 Bug Fix and Enhancement Advisory	2014-05-29 23:55:40 UTC

Description Fabio Massimo Di Nitto 2013-12-02 08:15:36 UTC

openstack-neutron-2013.2-10.el6ost.noarch

This was spotted while debugging #1036514

service neutron-server start will return [OK] even if it fails to start because files are missing.

init scripts should never return [OK] in those cases and fail.

Comment 1 Terry Wilson 2013-12-04 03:38:14 UTC

I have tested that running the neutron-server command manually with the same arguments in the script exits nonzero when failing to load. When running via either init scripts (RHEL) or systemctl (Fedora), the script seems to return immediately with exit code 0, then the running the 'status' commands show that the process failed. This is not a neutron-specific issue, as this is the exact same behavior that Nova produces as well.

I'm guessing this has something to do with the interaction between the daemonizing of the services and eventlet? I'm not sure how immediate the results from tail -f are, but doing something like:

  tail -f /var/log/neutron/server.log &

and then

  service neutron-server restart; echo "**** Returned $? ****"

displays the echoed output before any log entries are displayed to the screen. I'll keep looking, but given that this behavior seems common throughout the Openstack projects, this probably isn't a "neutron" component issue and the severity probably shouldn't be "urgent".

Comment 2 Fabio Massimo Di Nitto 2013-12-04 04:23:12 UTC

I spoke to Alan yesterday, and indeed this problem affects almost if not all openstack services.

I first notice in neutron that's where i filed.

The issue is urgent and possibly blocker for all components. An init script that return 0 and service is not running is:

1) misleading
2) complicates debugging
3) will generate GSS customer case for "It starts but it doesn't work"

Comment 3 Lon Hohberger 2013-12-04 04:51:20 UTC

Bugs like this also prevent system monitoring and management tools from operating correctly.

Comment 4 Perry Myers 2013-12-04 14:37:55 UTC

We should use this bug to fix up all of the init scripts if we can.  Reassigning to apevec

Comment 5 Alan Pevec 2013-12-05 13:21:26 UTC

Reported for Swift in bug 1020480 and Pete has proposed a simple fix upstream:
 https://review.openstack.org/#/c/58069/1/bin/swift-account-server,unified

Swift is of course different, but looks like similar fix could be deployed in all OpenStack services.

Comment 10 Alan Pevec 2013-12-10 12:05:44 UTC

> I'm guessing this has something to do with the interaction between the
> daemonizing of the services and eventlet?

It's actually missing daemonizing in the OpenStack services, we're faking it with & in initscripts e.g. nova:

    daemon --user nova --pidfile $pidfile "$exec --logfile $logfile &>/dev/null & echo \$! > $pidfile"

Proper fix would be to move daemonizing into service startup code so it would return only when it's done initialization.

Comment 11 Alan Pevec 2013-12-10 12:32:22 UTC

... or we add startup wrapper which waits on signal over socket that service sends when ready. Such "onready" signal is currently deployed in Keystone[1] and used for systemd service type=notify [2]


[1] https://github.com/openstack/keystone/blob/master/bin/keystone-all#L80

[2] http://www.freedesktop.org/software/systemd/man/sd_notify.html

Comment 12 Fabio Massimo Di Nitto 2013-12-10 13:36:42 UTC

The only "simple" solution I can find is to have the daemons write their own pidfile at the end of initialization and have the init script check for it with a loop/timeout.

There are several conditions that need to be taken into account from the init script:

- pidfile is not created within timeout:
  - daemon is not running -> FAILED
  - daemon is running -> kill it -> FAILED
- pidfile is created within timeout, loop twice more for the sake of it:
  - daemon is not running -> FAILED
  - daemon is running -> OK

On stop or exit, pidfile should be removed even daemon() checks for it.

Every other solution I can think of involves more fragile code. At least this should be certain to find or not the pidfile after init and take a consistent approach to the problem.

this is on the assumption that pidfile is written once service has read all its config file and it´s ready to reply to API requests. As above, an error because of "keystone talks to mysql only when asked" and mysql is not running should be considered a runtime error and not something init should care about.

Comment 13 Alan Pevec 2013-12-10 15:22:39 UTC

> this is on the assumption that pidfile is written once service has read all
> its config file and it´s ready to reply to API requests.

This needs to be added at the same place as [1] in comment 11.
Currently, as can be seen from Nova initscript example in comment 10, $pidfile is created by the initscript.

Comment 21 Alan Pevec 2014-01-10 03:11:46 UTC

David, please review Keystone implementation which I plan to replicate to other services: http://pkgs.fedoraproject.org/cgit/openstack-keystone.git/commit/?h=el6-havana&id=de327d38ca6cfbbcbf4560ea3fa73daec60e9132

In my testing it successfully replaces previous wait_until_keystone_available() workaround in the initscript.

Comment 22 David Ripton 2014-01-10 18:07:07 UTC

I wasn't aware that systemd.py used abstract namespace sockets, but the way you use '@' as an escape for '\0' looks fine.

I would use True and False rather than 1 and 0 for the unset_env argument when calling _sd_notify.  1 and 0 work would make sense if this had to run on ancient Python but OpenStack really, really needs at least 2.6.

Similarly, I think "if msg.find('READY=1') == -1:" is kind of archaic and hard to read.  "if 'READY=1' in msg:" would be better.

Maybe there needs to be try/except/return error value around the sock.bind() call in onready, since the bind might be able to fail in some cases?

The packaging looks sane but I'm also not sure about the best place for daemon_notify.sh to live.  Putting it in the same place as the similar files, like you did, seems fine.  I fear the TEMP comments will last forever.

Comment 24 Alan Pevec 2014-01-24 23:45:49 UTC

This worked for Nova with ProcessLauncher and multiple workers:
 https://github.com/redhat-openstack/nova/commit/e17e59739181610ae4916dfe1dca35d83b86366a

Comment 25 Fabio Massimo Di Nitto 2014-02-11 08:18:40 UTC

Please be aware that this applies also to mongodb init script.

Comment 32 Alan Pevec 2014-04-14 15:37:56 UTC

Xavier found issue with sending readiness notification from inside run_service:
it returns premature OK in case of multiple service endpoints (enabled_apis config parameter) in one systemd/sysv service/initscript like openstack-nova-api which by default provides osapi, ec2 and metadata services.

I'll move service readiness notification to nova/cmd/api.py after all enabled apis are launched.

Comment 34 Alan Pevec 2014-04-14 16:55:01 UTC

> I'll move service readiness notification to nova/cmd/api.py after all
> enabled apis are launched.

Even better suggestion by Xavier[1] covers all cases without modifying startup cmds.

[1] https://review.openstack.org/87309

Comment 47 Alan Pevec 2014-04-16 16:42:30 UTC

I noticed $prog-startup.log (in daemon_notify.sh) is getting python logging output which we don't want since we're setting log_file in dist.confs.

This is due to use_stderr default=True in Oslo log module:
 https://github.com/openstack/oslo-incubator/blob/master/openstack/common/log.py#L136 

I'll push use_stderr = false in all dist.confs, default config sets logfile so output on stderr is not expected.

Comment 48 Alan Pevec 2014-04-16 17:18:08 UTC

Created attachment 886948 [details]
Followup patch for neutron-dist.conf

Comment 49 Alan Pevec 2014-04-16 17:32:22 UTC

Created attachment 886960 [details]
Fix change of default logfilename api.log -> nova-api.log

Comment 50 Alan Pevec 2014-04-16 17:35:44 UTC

Created attachment 886961 [details]
Followup patch for nova-dist.conf

Comment 51 Alan Pevec 2014-04-16 17:38:26 UTC

Created attachment 886962 [details]
Followup patch for cinder-dist.conf

Comment 53 Alan Pevec 2014-04-16 20:23:21 UTC

Created attachment 886976 [details]
corresponding packaging patch for rhos4 openstack-glance

Comment 54 Alan Pevec 2014-04-16 20:26:24 UTC

Created attachment 886977 [details]
Followup patch for glance*dist.conf 1/2

Comment 55 Alan Pevec 2014-04-16 20:27:03 UTC

Created attachment 886978 [details]
Followup patch for glance*dist.conf 2/2

Comment 56 Pádraig Brady 2014-04-17 01:10:03 UTC

Comment on attachment 886976 [details]
corresponding packaging patch for rhos4 openstack-glance

Should the daemon-notify.sh script itself be in this patch?

Comment 57 Alan Pevec 2014-04-17 10:53:12 UTC

Created attachment 887135 [details]
corresponding packaging patch for rhos4 openstack-glance

Comment 60 Alan Pevec 2014-04-26 00:58:57 UTC

Created attachment 889951 [details]
patch for rhos4 openstack-cinder

Comment 63 Flavio Percoco 2014-04-28 15:55:28 UTC

*** Bug 1091691 has been marked as a duplicate of this bug. ***

Comment 64 Alan Pevec 2014-05-02 16:09:57 UTC

*** Bug 1093741 has been marked as a duplicate of this bug. ***

Comment 65 Alan Pevec 2014-05-05 11:05:42 UTC

*** Bug 1094250 has been marked as a duplicate of this bug. ***

Comment 68 Attila Fazekas 2014-05-15 14:38:52 UTC

Let's see what happens after removing the sqlalchemy library.

$ mv /usr/lib64/python2.6/site-packages/SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg /usr/lib64/python2.6/site-packages/orig-SQLAlchemy-0.7.8-py2.6-linux-x86_64.egg

$ service openstack-nova-network start; echo $?
Starting openstack-nova-network:                           [  OK  ]
0
$ service openstack-nova-network status; echo $?
openstack-nova-network dead but pid file exists
1

It also indicates success even after the /usr/bin/python renamed.

openstack-nova-network-2013.2.3-7.el6ost

Comment 69 Attila Fazekas 2014-05-16 08:16:08 UTC

With removed python binary the following python services resports successful restart:

for ser in openstack-nova-{api,cert,object,compute,network,scheduler,conductor} openstack-glance-{api,registry,scrubber} openstack-keystone neutron-{server,dhcp-agent,l3-agent,metadata-agent,lbaas-agent,openvswitch-agent,metering-agent} openstack-swift-{proxy,account,container,object,object-expirer} openstack-cinder-{api,backup,volume,scheduler} openstack-ceilometer-{api,central,compute,collector,alarm-notifier,alarm-evaluator} openstack-heat-{api,api-cfn,api-cloudwatch,engine};do service $ser restart &>/dev/null && echo zero: $ser ; done
zero: openstack-nova-cert
zero: openstack-nova-compute
zero: openstack-nova-network
zero: openstack-nova-scheduler
zero: openstack-nova-conductor
zero: openstack-glance-scrubber
zero: neutron-dhcp-agent
zero: neutron-l3-agent
zero: neutron-metadata-agent
zero: neutron-lbaas-agent
zero: neutron-openvswitch-agent
zero: neutron-metering-agent
zero: openstack-swift-proxy
zero: openstack-swift-account
zero: openstack-swift-container
zero: openstack-swift-object
zero: openstack-swift-object-expirer
zero: openstack-cinder-volume
zero: openstack-cinder-scheduler
zero: openstack-ceilometer-api
zero: openstack-ceilometer-central
zero: openstack-ceilometer-compute
zero: openstack-ceilometer-collector
zero: openstack-ceilometer-alarm-notifier
zero: openstack-ceilometer-alarm-evaluator
zero: openstack-heat-api
zero: openstack-heat-api-cfn
zero: openstack-heat-api-cloudwatch
zero: openstack-heat-engine
# openstack-status 
== Nova services ==
openstack-nova-api:                     inactive
openstack-nova-cert:                    dead
openstack-nova-compute:                 dead
openstack-nova-network:                 dead      (disabled on boot)
openstack-nova-scheduler:               dead
openstack-nova-conductor:               dead
== Glance services ==
openstack-glance-api:                   inactive
openstack-glance-registry:              inactive
== Keystone service ==
openstack-keystone:                     inactive
== Horizon service ==
openstack-dashboard:                    active
== neutron services ==
neutron-server:                         inactive
neutron-dhcp-agent:                     dead
neutron-l3-agent:                       dead
neutron-metadata-agent:                 dead
neutron-lbaas-agent:                    dead
neutron-openvswitch-agent:              dead
== Swift services ==
openstack-swift-proxy:                  dead
openstack-swift-account:                dead
openstack-swift-container:              dead
openstack-swift-object:                 dead
== Cinder services ==
openstack-cinder-api:                   inactive
openstack-cinder-scheduler:             dead
openstack-cinder-volume:                dead
== Ceilometer services ==
openstack-ceilometer-api:               dead
openstack-ceilometer-central:           dead
openstack-ceilometer-compute:           dead
openstack-ceilometer-collector:         dead
openstack-ceilometer-alarm-notifier:    dead
openstack-ceilometer-alarm-evaluator:   dead
== Heat services ==
openstack-heat-api:                     dead
openstack-heat-api-cfn:                 dead
openstack-heat-api-cloudwatch:          dead
openstack-heat-engine:                  dead
== Support services ==

Comment 70 Alan Pevec 2014-05-16 14:26:13 UTC

(In reply to Attila Fazekas from comment #68)
> $ service openstack-nova-network start; echo $?

Only openstack-nova-api initscript has been modified to use daemon_notify wrapper.

Comment 75 Attila Darazs 2014-05-23 13:49:46 UTC

Now that the summary reflects the content of the bug, I set it to VERIFIED as afazekas' comment proved that the named services did not return zero.

Comment 77 errata-xmlrpc 2014-05-29 19:57:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0577.html

Note You need to log in before you can comment on or make changes to this bug.

afazekas
ajeain
apevec
chrisw
dprince
dripton
eharney
jhenner
jpeeler
jpokorny
lpeer
majopela
markmc
oblaut
ohochman
pbrady
sdake
shardy
sreichar
vpopovic
wfoster
xqueralt
yeylon