Description of problem: Upstream bug tracker for systemctl stop ceph.target doesn't stop ceph workaround is to run 'systemctl stop ceph-mon.target ceph-osd.target ceph-mds.target ceph-radosgw.target' Version-Release number of selected component (if applicable): 2.0 How reproducible: Always
I don't understand, is <target> the cluster name? Where's the documentation that indicates that this is supposed to work?
Sorry Sam this got auto assigned to you, I couldn't find the right component and selected 'unclassified', Boris is actually looking into this, this is a systemd service issue and I have linked a upstream tracker.
I've talked with some systemd people and generally, the systemd unit files looked fine and the possible explanations for this behaviour included: 1.) The sub-targets (ceph-{mon,osd,mds,radosgw}.target) were not enabled in systemd and so ceph.target did not propagate the signal to restart the daemons You can try running systemctl enable ceph-{mon,osd}.target to see if it helps. If this is the case then we should enable the sub-targets with ceph-deploy. This is probably not the case as the clean install seems to work fine for me. 2.) This happened after upgrade from an older version of ceph where there were no sub-targets. If this is the case, we should tune our post(upgrade) scripts to handle the upgrade path better. You can try running systemctl reenable ceph.target to see if it helps in this case -- this should clean-up the symlink mess.
Boris, This is right after ceph-deploy install(not an upgrade case), I have tried the command you said and it didn't work, you said it works for you, how did you install? [ubuntu@magna094 ~]$ sudo systemctl enable ceph-osd.target Failed to execute operation: No such file or directory [ubuntu@magna094 ~]$ ps -eaf | grep ceph root 6182 1 0 00:48 ? 00:00:00 /bin/bash -c ulimit -n 32768; TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=128MB /usr/bin/ceph-osd -i 3 --pid-file /var/run/ceph/osd.3.pid -c /etc/ceph/ceph.conf --cluster ceph -f root 6186 6182 0 00:48 ? 00:00:05 /usr/bin/ceph-osd -i 3 --pid-file /var/run/ceph/osd.3.pid -c /etc/ceph/ceph.conf --cluster ceph -f root 7436 1 0 00:48 ? 00:00:00 /bin/bash -c ulimit -n 32768; TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=128MB /usr/bin/ceph-osd -i 4 --pid-file /var/run/ceph/osd.4.pid -c /etc/ceph/ceph.conf --cluster ceph -f root 7439 7436 0 00:48 ? 00:00:05 /usr/bin/ceph-osd -i 4 --pid-file /var/run/ceph/osd.4.pid -c /etc/ceph/ceph.conf --cluster ceph -f ubuntu 7876 7780 0 01:36 pts/0 00:00:00 grep --color=auto ceph
I've been able to reproduce on my machines with upstream jewel repo. It looks like it is the first case. The targets in the middle are not enabled, nor started and so the ceph.target won't propagate the stop/start calls to the underlying ceph-mon/osd services. The solution here will be to make ceph-deploy enable and start the 'in-the-middle' ceph-osd/mon targets. I'll try to create an upstream PR for this ~soon.
Just a heads' up that ceph-deploy will ship in RHCS 2, but it's "unsupported", and we expect the main installation process to be handled by https://github.com/ceph/ceph-installer which calls https://github.com/ceph/ceph-ansible
Our upgrades have to go through this path and having restart is working is important for those cases, we should also look into whether its also an issue with init script itself because I remember it didn't work even with ceph-ansible sometime back when I was looking into a different issue.
*** Bug 1319871 has been marked as a duplicate of this bug. ***
This should be fixed by https://github.com/ceph/ceph/pull/8714 which is now in master.
https://github.com/ceph/ceph/pull/8801 was merged to jewel and will be in v10.2.1.
This is still not fixed, I see the same behavior in master branch as well on centos, The only thing that works reliably now is "systemclt start", 1) stop on nodes with mon/osd, will stop a mon and probably a single osd, 2) stop on nodes with just osd role, will return immediately and the only way to stop it is using individual service, Also after stopping and starting individual service on this role, I see the osd.service gets incremented which looks like another bug, note the output below osd.0 and osd.1 are now osd.3 and osd.4 ● ceph-disk loaded failed failed Ceph disk activation: /dev/sdb1 ● ceph-disk loaded failed failed Ceph disk activation: /dev/sdb2 ● ceph-osd loaded failed failed Ceph object storage daemon ● ceph-osd loaded failed failed Ceph object storage daemon ceph-osd loaded active running Ceph object storage daemon ceph-osd loaded active running Ceph object storage daemon system-ceph\x2ddisk.slice loaded active active system-ceph\x2ddisk.slice system-ceph\x2dosd.slice loaded active active system-ceph\x2dosd.slice ceph-osd.target
Yeah, I should have warned about this. This is probably because your systemd folder is populated with the mess from the previous installs. The removal of the old broken packages won't help, here as they do not do a proper clean-up upon removal and leave the systemd symlink mess, there. The new packages should be able to do a better job upon removal. Please try removing the new packages and then re-installing them to see if it fixed the problem for you. Even that might not guarantee it will be fixed, though, the clean install (of the entire machine) should. We might want to document this, though and help guide the user (customer) with some 'what-to-do' steps if they get their machine into this state. btw: This should not be such a big deal for RHCEPH as the in-the-middle targets that caused all the systemd symlink mess were not used in 1.2.x.
So I just tried this since I had a new reimage on the nodes and I really dont see this issue, Its actually working as expected, As you said the old service files not properly cleaned up should have created the issue, since we are going from build to build I believe few others might hit this as well, but I dont think customers will actually hit as they haven't got any old 10.2 builds yet. bug I think it is more useful to fix this bug 1326740 as its consistently seen and would cause anyone to believe disk activation failed but Ian has already pushed it out to 2.1 and I will comment on that bug there.
Branto, Actually would you let me know what files to cleanup, In our lab since the nodes go through various builds I am seeing that this stray files are actually creating few issues, I believe as you said a simple document could be better that provides steps to clean up any stray files. Thanks
It is a bit tricky but the re-install of the latest packages (all of them) should fix it.
Moving back to ON_QA per comments 15+. We can document this later/in another bz if desired.
Unable to reproduce the issue.. Moving to verified state Verification steps :- http://pastebin.test.redhat.com/378848
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-1755.html