1324589 – systemctl stop ceph.target doesn't stop ceph

Bug 1324589 - systemctl stop ceph.target doesn't stop ceph

Summary: systemctl stop ceph.target doesn't stop ceph

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Build
Sub Component:
Version:	2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	2.0
Assignee:	Boris Ranto
QA Contact:	ceph-qe-bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1319871 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-04-06 17:45 UTC by Vasu Kulkarni
Modified:	2022-02-21 18:03 UTC (History)
CC List:	8 users (show)
Fixed In Version:	RHEL: ceph-10.2.1-1.el7cp Ubuntu: ceph_10.2.1-2redhat1xenial
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-08-23 19:35:21 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	14839	0	None	None	None	2016-04-06 17:45:09 UTC
Red Hat Product Errata	RHBA-2016:1755	0	normal	SHIPPED_LIVE	Red Hat Ceph Storage 2.0 bug fix and enhancement update	2016-08-23 23:23:52 UTC

Description Vasu Kulkarni 2016-04-06 17:45:09 UTC

Description of problem:

Upstream bug tracker for systemctl stop ceph.target doesn't stop ceph

workaround is to run 'systemctl stop ceph-mon.target ceph-osd.target ceph-mds.target ceph-radosgw.target'

Version-Release number of selected component (if applicable):
2.0

How reproducible:
Always

Comment 2 Samuel Just 2016-04-08 14:40:46 UTC

I don't understand, is <target> the cluster name?  Where's the documentation that indicates that this is supposed to work?

Comment 3 Vasu Kulkarni 2016-04-08 21:06:53 UTC

Sorry Sam this got auto assigned to you, I couldn't find the right component and selected 'unclassified',  Boris is actually looking into this, this is a systemd service issue and I have linked a upstream tracker.

Comment 4 Boris Ranto 2016-04-11 12:19:50 UTC

I've talked with some systemd people and generally, the systemd unit files looked fine and the possible explanations for this behaviour included:

1.) The sub-targets (ceph-{mon,osd,mds,radosgw}.target) were not enabled in systemd and so ceph.target did not propagate the signal to restart the daemons

You can try running systemctl enable ceph-{mon,osd}.target to see if it helps. If this is the case then we should enable the sub-targets with ceph-deploy.

This is probably not the case as the clean install seems to work fine for me.


2.) This happened after upgrade from an older version of ceph where there were no sub-targets. If this is the case, we should tune our post(upgrade) scripts to handle the upgrade path better.

You can try running systemctl reenable ceph.target to see if it helps in this case -- this should clean-up the symlink mess.

Comment 5 Vasu Kulkarni 2016-04-14 01:41:19 UTC

Boris,

This is right after ceph-deploy install(not an upgrade case), I have tried the command you said and it didn't work, you said it works for you, how did you install?

[ubuntu@magna094 ~]$ sudo systemctl enable ceph-osd.target
Failed to execute operation: No such file or directory
[ubuntu@magna094 ~]$ ps -eaf | grep ceph
root      6182     1  0 00:48 ?        00:00:00 /bin/bash -c ulimit -n 32768; TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=128MB /usr/bin/ceph-osd -i 3 --pid-file /var/run/ceph/osd.3.pid -c /etc/ceph/ceph.conf --cluster ceph -f
root      6186  6182  0 00:48 ?        00:00:05 /usr/bin/ceph-osd -i 3 --pid-file /var/run/ceph/osd.3.pid -c /etc/ceph/ceph.conf --cluster ceph -f
root      7436     1  0 00:48 ?        00:00:00 /bin/bash -c ulimit -n 32768; TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=128MB /usr/bin/ceph-osd -i 4 --pid-file /var/run/ceph/osd.4.pid -c /etc/ceph/ceph.conf --cluster ceph -f
root      7439  7436  0 00:48 ?        00:00:05 /usr/bin/ceph-osd -i 4 --pid-file /var/run/ceph/osd.4.pid -c /etc/ceph/ceph.conf --cluster ceph -f
ubuntu    7876  7780  0 01:36 pts/0    00:00:00 grep --color=auto ceph

Comment 6 Boris Ranto 2016-04-25 10:16:27 UTC

I've been able to reproduce on my machines with upstream jewel repo. It looks like it is the first case. The targets in the middle are not enabled, nor started and so the ceph.target won't propagate the stop/start calls to the underlying ceph-mon/osd services.

The solution here will be to make ceph-deploy enable and start the 'in-the-middle' ceph-osd/mon targets.

I'll try to create an upstream PR for this ~soon.

Comment 7 Ken Dreyer (Red Hat) 2016-04-25 14:04:20 UTC

Just a heads' up that ceph-deploy will ship in RHCS 2, but it's "unsupported", and we expect the main installation process to be handled by https://github.com/ceph/ceph-installer which calls https://github.com/ceph/ceph-ansible

Comment 8 Vasu Kulkarni 2016-04-25 14:39:32 UTC

Our upgrades have to go through this path and having restart is working is important for those cases, we should also look into whether its also an issue with init script itself because I remember it didn't work even with ceph-ansible sometime back when I was looking into a different issue.

Comment 10 Boris Ranto 2016-04-26 12:32:57 UTC

*** Bug 1319871 has been marked as a duplicate of this bug. ***

Comment 11 Boris Ranto 2016-04-27 19:55:14 UTC

This should be fixed by

https://github.com/ceph/ceph/pull/8714

which is now in master.

Comment 12 Ken Dreyer (Red Hat) 2016-05-06 15:35:47 UTC

https://github.com/ceph/ceph/pull/8801 was merged to jewel and will be in v10.2.1.

Comment 14 Vasu Kulkarni 2016-05-18 17:00:42 UTC

This is still not fixed, I see the same behavior in master branch as well on centos, The only thing that works reliably now is "systemclt start", 

1) stop on nodes with mon/osd, will stop a mon and probably a single osd, 
2) stop on nodes with just osd role, will return immediately and the only way to stop it is using individual service, Also after stopping and starting individual service on this role, I see the osd.service gets incremented which looks like another bug, note the output below osd.0 and osd.1 are now osd.3 and osd.4

● ceph-disk                                                               loaded failed failed    Ceph disk activation: /dev/sdb1
● ceph-disk                                                               loaded failed failed    Ceph disk activation: /dev/sdb2
● ceph-osd                                                                       loaded failed failed    Ceph object storage daemon
● ceph-osd                                                                       loaded failed failed    Ceph object storage daemon
  ceph-osd                                                                       loaded active running   Ceph object storage daemon
  ceph-osd                                                                       loaded active running   Ceph object storage daemon
  system-ceph\x2ddisk.slice                                                                loaded active active    system-ceph\x2ddisk.slice
  system-ceph\x2dosd.slice                                                                 loaded active active    system-ceph\x2dosd.slice
  ceph-osd.target

Comment 15 Boris Ranto 2016-05-19 07:17:32 UTC

Yeah, I should have warned about this. This is probably because your systemd folder is populated with the mess from the previous installs. The removal of the old broken packages won't help, here as they do not do a proper clean-up upon removal and leave the systemd symlink mess, there.

The new packages should be able to do a better job upon removal. Please try removing the new packages and then re-installing them to see if it fixed the problem for you. Even that might not guarantee it will be fixed, though, the clean install (of the entire machine) should.

We might want to document this, though and help guide the user (customer) with some 'what-to-do' steps if they get their machine into this state.

btw: This should not be such a big deal for RHCEPH as the in-the-middle targets that caused all the systemd symlink mess were not used in 1.2.x.

Comment 16 Vasu Kulkarni 2016-05-20 21:26:58 UTC

So I just tried this since I had a new reimage on the nodes and I really dont see this issue, Its actually working as expected, As you said the old service files not properly cleaned up should have created the issue, since we are going from build to build I believe few others might hit this as well, but I dont think customers will actually hit as they haven't got any old 10.2 builds yet.

bug I think it is more useful to fix this bug 1326740 as its consistently seen and would cause anyone to believe disk activation failed but Ian has already pushed it out to 2.1 and I will comment on that bug there.

Comment 17 Vasu Kulkarni 2016-05-24 00:29:05 UTC

Branto,

Actually would you let me know what files to cleanup, In our lab since the nodes go through various builds I am seeing that this stray files are actually creating few issues, I believe as you said a simple document could be better that provides steps to clean up any stray files.  Thanks

Comment 18 Boris Ranto 2016-05-24 11:56:53 UTC

It is a bit tricky but the re-install of the latest packages (all of them) should fix it.

Comment 19 Boris Ranto 2016-05-25 14:52:30 UTC

Moving back to ON_QA per comments 15+. We can document this later/in another bz if desired.

Comment 20 Hemanth Kumar 2016-05-31 10:21:14 UTC

Unable to reproduce the issue.. Moving to verified state
Verification steps :- http://pastebin.test.redhat.com/378848

Comment 22 errata-xmlrpc 2016-08-23 19:35:21 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-1755.html

Note You need to log in before you can comment on or make changes to this bug.