1719013 – [OSP 15] failed to scale up HCI Ceph all nodes, no container with name or ID ceph-mon-hci-ceph-all-3 found: no such container

Bug 1719013 - [OSP 15] failed to scale up HCI Ceph all nodes, no container with name or ID ceph-mon-hci-ceph-all-3 found: no such container

Summary: [OSP 15] failed to scale up HCI Ceph all nodes, no container with name or ID ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Ceph-Ansible
Sub Component:
Version:	4.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	medium
Target Milestone:	rc
Target Release:	4.0
Assignee:	Guillaume Abrioux
QA Contact:	Yogev Rabl
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-10 19:05 UTC by Yogev Rabl
Modified:	2020-01-31 12:46 UTC (History)
CC List:	14 users (show)
Fixed In Version:	ceph-ansible-4.0.0-0.1.rc10.el8cp
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-31 12:46:18 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
sos report from the failed node (6.34 MB, application/x-xz) 2019-06-11 13:35 UTC, Yogev Rabl	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph-container pull 1410	'None'	closed	daemon/start_mon.sh: add mon without port value	2020-03-01 20:30:11 UTC
Github	ceph ceph-container pull 1412	'None'	closed	daemon/start_mon.sh: add mon without port value (bp #1410)	2020-03-01 20:30:11 UTC
Red Hat Bugzilla	1722066	urgent	CLOSED	Replace controller scenario - RUNNING HANDLER [ceph-handler : restart ceph mon daemon(s) - container] failed with "unabl...	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2020:0312	None	None	None	2020-01-31 12:46:41 UTC

Description Yogev Rabl 2019-06-10 19:05:37 UTC

Description of problem:
An update of an Overcloud failed with the error: 
fatal: [hci-ceph-all-3 -> 192.168.24.15]: FAILED! => changed=true ",
        "  - ceph-mon-hci-ceph-all-3",
        "  - --name",
        "  - mon.",
        "  - -k",
        "  - /var/lib/ceph/mon/ceph-hci-ceph-all-0/keyring",
        "  - auth",
        "  - get-key",
        "  delta: '0:00:00.065085'",
        "  end: '2019-06-10 18:22:21.486676'",
        "  rc: 125",
        "  start: '2019-06-10 18:22:21.421591'",
        "  stderr: 'unable to exec into ceph-mon-hci-ceph-all-3: no container with name or ID ceph-mon-hci-ceph-all-3 found: no such container'",

The overcloud was initially deployed with 3 controller nodes and 3 HCI Ceph All nodes (including monitors, Mgrs and OSDs). The update should have scaled up additional 2 HCI Ceph all nodes.  

Version-Release number of selected component (if applicable):
openstack containers tag: 20190604.1
ceph-ansible-4.0.0-0.1.rc7.el8cp.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud with 3 controller nodes and 3 HCI Ceph all node
2. Update the overcloud with additional 2 HCI Ceph all node


Actual results:
Update fails

Expected results:
The update is successful, all the nodes are set 

Additional info:

Comment 1 Yogev Rabl 2019-06-11 13:35:53 UTC

Created attachment 1579383 [details]
sos report from the failed node

Comment 3 Yogev Rabl 2019-06-18 01:37:15 UTC

it seems that the monitor container starts in the newly created nodes:

[root@hci-ceph-all-3 ~]# podman ps
CONTAINER ID  IMAGE                                           COMMAND               CREATED      STATUS            PORTS  NAMES
6bad6480325e  192.168.24.1:8787/ceph/rhceph-4.0-rhel8:latest  /opt/ceph-contain...  3 hours ago  Up 6 minutes ago         ceph-mon-hci-ceph-all-3

When running ceph -s command from the container we get: 

[root@hci-ceph-all-3 ~]# podman exec ceph-mon-hci-ceph-all-3 ceph -s
  cluster:
    id:     fce928a8-913f-11e9-8da9-525400a19e82
    health: HEALTH_WARN
            2/5 mons down, quorum hci-ceph-all-2,hci-ceph-all-1,hci-ceph-all-0
 
  services:
    mon: 5 daemons, quorum hci-ceph-all-2,hci-ceph-all-1,hci-ceph-all-0 (age 2h), out of quorum: hci-ceph-all-4, hci-ceph-all-3
    mgr: hci-ceph-all-2(active, since 4h), standbys: hci-ceph-all-0, hci-ceph-all-1
    osd: 15 osds: 15 up (since 4h), 15 in (since 4h)
    rgw: 3 daemons active (hci-ceph-all-0.rgw0, hci-ceph-all-1.rgw0, hci-ceph-all-2.rgw0)
 
  data:
    pools:   9 pools, 288 pgs
    objects: 222 objects, 2.7 KiB
    usage:   15 GiB used, 225 GiB / 240 GiB avail
    pgs:     288 active+clean

it is also registered in Ceph's configuration file. The issue is, it has not been mapped properly in monmap

[root@hci-ceph-all-3 ~]# podman exec ceph-mon-hci-ceph-all-3 monmaptool --print /etc/ceph/monmap
monmaptool: monmap file /etc/ceph/monmap
epoch 3
fsid fce928a8-913f-11e9-8da9-525400a19e82
last_changed 2019-06-17 22:37:06.373854
created 2019-06-17 21:13:52.852328
min_mon_release 14 (nautilus)
0: [v2:172.17.3.11:3300/0,v1:172.17.3.11:6789/0] mon.hci-ceph-all-2
1: [v2:172.17.3.90:3300/0,v1:172.17.3.90:6789/0] mon.hci-ceph-all-1
2: [v2:172.17.3.98:3300/0,v1:172.17.3.98:6789/0] mon.hci-ceph-all-0
3: v2:172.17.3.52:3300/0 mon.hci-ceph-all-4
4: v2:172.17.3.136:3300/0 mon.hci-ceph-all-3

Comment 4 John Fulton 2019-06-19 13:19:02 UTC

Any update Guillaume on what you found on Yogev's system?

Comment 5 John Fulton 2019-06-19 13:19:41 UTC

possible related bug 1722066

Comment 6 Yogev Rabl 2019-06-19 14:04:42 UTC

*** Bug 1722066 has been marked as a duplicate of this bug. ***

Comment 17 Yogev Rabl 2020-01-21 18:47:37 UTC

Verified

Comment 19 errata-xmlrpc 2020-01-31 12:46:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0312

Note You need to log in before you can comment on or make changes to this bug.