2013215 – Issue deploying a Ceph cluster with cephadm in a disconnected environment

Bug 2013215 - Issue deploying a Ceph cluster with cephadm in a disconnected environment

Summary: Issue deploying a Ceph cluster with cephadm in a disconnected environment

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Cephadm
Sub Component:
Version:	5.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	5.2
Assignee:	Adam King
QA Contact:	Manisha Saini
Docs Contact:	Karen Norteman
URL:
Whiteboard:
Duplicates (1):	2038414 (view as bug list)
Depends On:
Blocks:	2038414
TreeView+	depends on / blocked

Reported:	2021-10-12 12:01 UTC by egoirand
Modified:	2022-08-09 17:37 UTC (History)
CC List:	5 users (show)
Fixed In Version:	ceph-16.2.8-2.el8cp
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2038414 (view as bug list)
Environment:
Last Closed:	2022-08-09 17:36:41 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1935044	1	low	CLOSED	[cephadm] node-exporter not trying to pull custom image and in unknown state	2023-02-06 18:10:00 UTC
Red Hat Issue Tracker	RHCEPH-2019	0	None	None	None	2021-10-12 12:04:23 UTC
Red Hat Product Errata	RHSA-2022:5997	0	None	None	None	2022-08-09 17:37:11 UTC

Description egoirand 2021-10-12 12:01:26 UTC

Description of problem:

Once cephadm bootstrap has been performed in a disconnected environment, cephadm fails to create a local OSD (ceph orch daemon add osd ceph1:/dev/sdc) trying to connect to an external container instead of one provided by the local registry.

Version-Release number of selected component (if applicable):

ceph version 16.2.0-117.el8cp (0e34bb74700060ebfaa22d99b7d2cdc037b28a57) pacific (stable)

How reproducible:

always.

Steps to Reproduce:

1. environment set up 

Install a RHEL 8.4 OS (we used a VM) with minimal install and update it to latest packages => 
Linux ceph1 4.18.0-305.19.1.el8_4.x86_64 #1 SMP Tue Sep 7 07:07:31 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

Create local RPM repository for for following RPM channels : 
=> Ansible 2.9 for Red Hat Enterprise Linux 8 x86_64 (RPMs)
=> Red Hat Ceph Storage Tools 5 for RHEL 8 x86_64 (RPMs)
=> Red Hat Enterprise Linux 8 for x86_64 - AppStream (RPMs)
=> Red Hat Enterprise Linux 8 for x86_64 - BaseOS (RPMs)

Create local registry and add the following containers (I used skopeo sync to 
perform the synchronisation of these containers between registry.redhat.io and the local registry :
=> rhceph/rhceph-5-rhel8
=> rhceph/rhceph-5-dashboard-rhel8
=> openshift4/ose-prometheus
=> openshift4/ose-prometheus-alertmanager
=> openshift4/ose-prometheus-node-exporter

2. perform the cephadm bootstrap installation 

Install cephadm :
dnf install cephadm
rpm -qi cephadm
Name        : cephadm
Epoch       : 2
Version     : 16.2.0
Release     : 117.el8cp
Architecture: noarch
Install Date: Sun 10 Oct 2021 05:04:44 PM CEST
Group       : Unspecified
Size        : 301088
License     : LGPL-2.1 and LGPL-3.0 and CC-BY-SA-3.0 and GPL-2.0 and BSL-1.0 and BSD-3-Clause and MIT
Signature   : RSA/SHA256, Wed 18 Aug 2021 11:17:49 PM CEST, Key ID 199e2f91fd431d51
Source RPM  : ceph-16.2.0-117.el8cp.src.rpm
Build Date  : Wed 18 Aug 2021 08:26:50 PM CEST
Build Host  : x86-vm-06.build.eng.bos.redhat.com
Relocations : (not relocatable)
Packager    : Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>
Vendor      : Red Hat, Inc.
URL         : http://ceph.com/
Summary     : Utility to bootstrap Ceph clusters
Description :
Utility to bootstrap a Ceph cluster and manage Ceph daemons deployed
with systemd and podman.

Launch bootstrap (note that the local registry needed no username/password but it was mandatory to provide one) :
cephadm --image registry.lab/rhceph/rhceph-5-rhel8:latest  bootstrap --fsid c3a016e5-cbfb-4539-963d-75bf160f6d6a --mon-ip 10.2.41.250 --initial-dashboard-user admin --initial-dashboard-password redhat123 --dashboard-password-noupdate --no-minimize-config --registry-url registry.lab --registry-username admin --registry-password admin

Verify the cluster has started :

[root@ceph1 ~]# ceph status
  cluster:
    id:     c3a016e5-cbfb-4539-963d-75bf160f6d6a
    health: HEALTH_WARN
            failed to probe daemons or devices
            OSD count 0 < osd_pool_default_size 3

  services:
    mon: 1 daemons, quorum ceph1 (age 32m)
    mgr: ceph1.rmjhul(active, since 31m)
    osd: 0 osds: 0 up, 0 in (since 22h)

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:

Modify ceph config to be able to install Ceph Dashboard components :
ceph config set mgr mgr/cephadm/container_image_base registry.lab/rhceph/rhceph-5-rhel8:latest
ceph config set mgr mgr/cephadm/container_image_alertmanager registry.lab/openshift4/ose-prometheus-alertmanager:v4.6
ceph config set mgr mgr/cephadm/container_image_prometheus registry.lab/openshift4/ose-prometheus:v4.6
ceph config set mgr mgr/cephadm/container_image_grafana registry.lab/rhceph/rhceph-5-dashboard-rhel8:latest
ceph config set mgr mgr/cephadm/container_image_node_exporter registry.lab/openshift4/ose-prometheus-no

Verify that prometheus, grafana, alertmanager and node-exporter are running fine (it seems that ceph crash has an issue and does not start) :

# ceph orch ps
NAME                 HOST   STATUS         REFRESHED  AGE  PORTS          VERSION           IMAGE ID      CONTAINER ID
alertmanager.ceph1   ceph1  running (38m)  2m ago     23h  *:9093 *:9094  0.21.0            cfa7ac9e2c00  38a4bb8d163b
grafana.ceph1        ceph1  running (38m)  2m ago     23h  *:3000         6.7.4             09cf77100f6a  c53950b23b99
mgr.ceph1.rmjhul     ceph1  running (38m)  2m ago     23h  *:9283         16.2.0-117.el8cp  2142b60d7974  437dbf146288
mon.ceph1            ceph1  running (38m)  2m ago     23h  -              16.2.0-117.el8cp  2142b60d7974  09feeaa61cb2
node-exporter.ceph1  ceph1  running (38m)  2m ago     23h  *:9100         1.0.1             4afad9935fbf  fca595401f65
prometheus.ceph1     ceph1  running (38m)  2m ago     23h  *:9095         2.22.2            ed805e9dbe13  bf4598bff2c5

# ceph orch ls
NAME           RUNNING  REFRESHED  AGE  PLACEMENT
alertmanager       1/1  2m ago     23h  count:1
crash              0/1  -          23h  ceph1
grafana            1/1  2m ago     23h  count:1
mgr                1/1  2m ago     23h  count:1
mon                1/1  2m ago     23h  count:1
node-exporter      1/1  2m ago     23h  ceph1
prometheus         1/1  2m ago     23h  count:1


3. Let's add an OSD (local /dev/sdc) to the Ceph cluster 

# ceph orch daemon add osd ceph1:/dev/sdc
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1345, in _handle_command
    return self.handle_command(inbuf, cmd)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 167, in handle_command
    return dispatch[cmd['prefix']].call(self, cmd, inbuf)
  File "/usr/share/ceph/mgr/mgr_module.py", line 390, in call
    return self.func(mgr, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 107, in <lambda>
    wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs)  # noqa: E731
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 96, in wrapper
    return func(*args, **kwargs)
  File "/usr/share/ceph/mgr/orchestrator/module.py", line 794, in _daemon_add_osd
    raise_if_exception(completion)
  File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 224, in raise_if_exception
    raise e
RuntimeError: cephadm exited with an error code: 1, stderr:Non-zero exit code 125 from /bin/podman run --rm --ipc=host --authfile=/etc/ceph/podman-auth.json --net=host --entrypoint stat --init -e CONTAINER_IMAGE=docker.io/ceph/daemon-base:latest-pacific-devel -e NODE_NAME=ceph1 -e CEPH_USE_RANDOM_NONCE=1 docker.io/ceph/daemon-base:latest-pacific-devel -c %u %g /var/lib/ceph
stat: stderr Trying to pull docker.io/ceph/daemon-base:latest-pacific-devel...
stat: stderr Error: Error initializing source docker://ceph/daemon-base:latest-pacific-devel: error pinging docker registry registry-1.docker.io: Get "https://registry-1.docker.io/v2/": dial tcp 3.209.182.229:443: i/o timeout
Traceback (most recent call last):
  File "/var/lib/ceph/c3a016e5-cbfb-4539-963d-75bf160f6d6a/cephadm.d7a73386d1e46cffff151775b8e1d098069c88b89aea56cab15b079c1a1f555f", line 8140, in <module>
    main()
  File "/var/lib/ceph/c3a016e5-cbfb-4539-963d-75bf160f6d6a/cephadm.d7a73386d1e46cffff151775b8e1d098069c88b89aea56cab15b079c1a1f555f", line 8128, in main
    r = ctx.func(ctx)
  File "/var/lib/ceph/c3a016e5-cbfb-4539-963d-75bf160f6d6a/cephadm.d7a73386d1e46cffff151775b8e1d098069c88b89aea56cab15b079c1a1f555f", line 1624, in _infer_fsid
    return func(ctx)
  File "/var/lib/ceph/c3a016e5-cbfb-4539-963d-75bf160f6d6a/cephadm.d7a73386d1e46cffff151775b8e1d098069c88b89aea56cab15b079c1a1f555f", line 1708, in _infer_image
    return func(ctx)
  File "/var/lib/ceph/c3a016e5-cbfb-4539-963d-75bf160f6d6a/cephadm.d7a73386d1e46cffff151775b8e1d098069c88b89aea56cab15b079c1a1f555f", line 4518, in command_ceph_volume
    make_log_dir(ctx, ctx.fsid)
  File "/var/lib/ceph/c3a016e5-cbfb-4539-963d-75bf160f6d6a/cephadm.d7a73386d1e46cffff151775b8e1d098069c88b89aea56cab15b079c1a1f555f", line 1810, in make_log_dir
    uid, gid = extract_uid_gid(ctx)
  File "/var/lib/ceph/c3a016e5-cbfb-4539-963d-75bf160f6d6a/cephadm.d7a73386d1e46cffff151775b8e1d098069c88b89aea56cab15b079c1a1f555f", line 2514, in extract_uid_gid
    raise RuntimeError('uid/gid not found')
RuntimeError: uid/gid not found


Actual results:
The OSD is never being created using cephadm

Expected results:
The OSD should be correctly created using the local container image rhceph/rhceph-5-rhel8


Additional info:

Comment 1 RHEL Program Management 2021-10-12 12:01:31 UTC

Please specify the severity of this bug. Severity is defined here:
https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity.

Comment 2 Sebastian Wagner 2021-10-12 12:23:12 UTC

Vasishta, is this BZ related to your BZ rhbz#1935044 ?

Comment 3 Vasishta 2021-10-12 12:44:40 UTC

Sebastian, 
No, issue being tracked in BZ 1935044 seems to be different than the one reported here

Comment 12 Akash Raj 2022-06-02 05:48:33 UTC

*** Bug 2038414 has been marked as a duplicate of this bug. ***

Comment 16 errata-xmlrpc 2022-08-09 17:36:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage Security, Bug Fix, and Enhancement Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5997

Note You need to log in before you can comment on or make changes to this bug.