Bug 2104936 - [cee/sd][ceph-volume] In RHCS 5.1 when using the custom cluster name ceph-osd failed to start
Summary: [cee/sd][ceph-volume] In RHCS 5.1 when using the custom cluster name ceph-osd...
Keywords:
Status: CLOSED DUPLICATE of bug 2058038
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Volume
Version: 5.1
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 5.2
Assignee: Guillaume Abrioux
QA Contact: Ameena Suhani S H
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-07-07 14:05 UTC by Prasanth M V
Modified: 2022-07-08 06:53 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-07-08 06:53:36 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph pull 46552 0 None Merged pacific: backport of cephadm: fix osd adoption with custom cluster name 2022-07-07 21:09:05 UTC
Github ceph ceph pull 47018 0 None Merged pacific: ceph-volume: allow listing devices by OSD ID 2022-07-08 04:31:05 UTC
Red Hat Issue Tracker RHCEPH-4681 0 None None None 2022-07-07 14:06:27 UTC

Description Prasanth M V 2022-07-07 14:05:35 UTC
Description of problem:

- The customer was performing the RHCS upgrade from RHCS 4.3 to RHCS 5.1 and the upgrade process was successful except for running "cephadm-adopt.yml".

- The running of "cephadm-adopt.yml" playbook got failed with failed "TASK [adopt osd daemon]".
- The "TASK [adopt osd daemon]" got failed due to the ceph-osd services failing to start.

- From ansible log of the first "cephadm-adopt.yml" run:
~~~
  Non-zero exit code 1 from systemctl start ceph-1b148142-f71b-4e49-9e99-1b9c506655aa
  systemctl: stderr Job for ceph-1b148142-f71b-4e49-9e99-1b9c506655aa.service failed because the control process exited with error code.     <<<
  systemctl: stderr See "systemctl status ceph-1b148142-f71b-4e49-9e99-1b9c506655aa.service" and "journalctl -xe" for details.
  Traceback (most recent call last):
    File "/sbin/cephadm", line 8826, in <module>
      main()
    File "/sbin/cephadm", line 8814, in main
      r = ctx.func(ctx)
    File "/sbin/cephadm", line 1941, in _default_image
      return func(ctx)
    File "/sbin/cephadm", line 5533, in command_adopt
      command_adopt_ceph(ctx, daemon_type, daemon_id, fsid)
    File "/sbin/cephadm", line 5738, in command_adopt_ceph
      osd_fsid=osd_fsid)
    File "/sbin/cephadm", line 3134, in deploy_daemon_units
      call_throws(ctx, ['systemctl', 'start', unit_name])
    File "/sbin/cephadm", line 1619, in call_throws
      raise RuntimeError('Failed command: %s' % ' '.join(command))
  RuntimeError: Failed command: systemctl start ceph-1b148142-f71b-4e49-9e99-1b9c506655aa                            <<<
~~~

- When the "TASK [adopt osd daemon]" executed last time the message was like below(Customer has executed cephadm-adopt.yml several times)
~~~  
stdout: osd.101 is already adopted
~~~
- This is same for all other OSDs in the node. By this, I believe ceph-osd are already adopted by cephadm but the ceph-osd daemons are failed to start by systemctl.

- All the OSDs are in one node is in down state and systemd status of the ceph-osd daemon is failed.


Traceback log of ceph-osd:
~~~
Jul 04 15:18:58 node systemd[1]: Starting Ceph osd.142 for 1b148142-f71b-4e49-9e99-1b9c506655aa...
Jul 04 15:19:00 node bash[6583]: Traceback (most recent call last):
Jul 04 15:19:00 node bash[6583]:   File "/usr/sbin/ceph-volume", line 11, in <module>
Jul 04 15:19:00 node bash[6583]:     load_entry_point('ceph-volume==1.0.0', 'console_scripts', 'ceph-volume')()
Jul 04 15:19:00 node bash[6583]:   File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 40, in __init__
Jul 04 15:19:00 node bash[6583]:     self.main(self.argv)
Jul 04 15:19:00 node bash[6583]:   File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, in newfunc
Jul 04 15:19:00 node bash[6583]:     return f(*a, **kw)
Jul 04 15:19:00 node bash[6583]:   File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 152, in main
Jul 04 15:19:00 node bash[6583]:     terminal.dispatch(self.mapper, subcommand_args)
Jul 04 15:19:00 node bash[6583]:   File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
Jul 04 15:19:00 node bash[6583]:     instance.main()
Jul 04 15:19:00 node bash[6583]:   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/main.py", line 46, in main
Jul 04 15:19:00 node bash[6583]:     terminal.dispatch(self.mapper, self.argv)
Jul 04 15:19:00 node bash[6583]:   File "/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in dispatch
Jul 04 15:19:00 node bash[6583]:     instance.main()
Jul 04 15:19:00 node bash[6583]:   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/activate.py", line 377, in main
Jul 04 15:19:00 node bash[6583]:     self.activate(args)
Jul 04 15:19:00 node bash[6583]:   File "/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 16, in is_root
Jul 04 15:19:00 node bash[6583]:     return func(*a, **kw)
Jul 04 15:19:00 node bash[6583]:   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/activate.py", line 301, in activate
Jul 04 15:19:00 node bash[6583]:     activate_bluestore(lvs, args.no_systemd)
Jul 04 15:19:00 node bash[6583]:   File "/usr/lib/python3.6/site-packages/ceph_volume/devices/lvm/activate.py", line 157, in activate_bluestore
Jul 04 15:19:00 node bash[6583]:     configuration.load()
Jul 04 15:19:00 node bash[6583]:   File "/usr/lib/python3.6/site-packages/ceph_volume/configuration.py", line 51, in load
Jul 04 15:19:00 node bash[6583]:     raise exceptions.ConfigurationError(abspath=abspath)
Jul 04 15:19:00 node bash[6583]: ceph_volume.exceptions.ConfigurationError: Unable to load expected Ceph config at: /etc/ceph/cephdev.conf       <<<<
Jul 04 15:19:02 node systemd[1]: ceph-1b148142-f71b-4e49-9e99-1b9c506655aa.service: Control process exited, code=exited status=1
Jul 04 15:19:03 node systemd[1]: ceph-1b148142-f71b-4e49-9e99-1b9c506655aa.service: Failed with result 'exit-code'.
Jul 04 15:19:03 node systemd[1]: Failed to start Ceph osd.142 for 1b148142-f71b-4e49-9e99-1b9c506655aa.
Jul 04 15:19:13 node systemd[1]: ceph-1b148142-f71b-4e49-9e99-1b9c506655aa.service: Service RestartSec=10s expired, scheduling restart.
Jul 04 15:19:13 node systemd[1]: ceph-1b148142-f71b-4e49-9e99-1b9c506655aa.service: Scheduled restart job, restart counter is at 1.
Jul 04 15:19:13 node systemd[1]: Stopped Ceph osd.142 for 1b148142-f71b-4e49-9e99-1b9c506655aa.
~~~


Version-Release number of selected component (if applicable):

Red Hat Ceph Storage 5.1z2 - 5.1.2   ceph version 16.2.7-126.el8cp

Comment 3 Guillaume Abrioux 2022-07-08 06:53:36 UTC

*** This bug has been marked as a duplicate of bug 2058038 ***


Note You need to log in before you can comment on or make changes to this bug.