Bug 1719013
| Summary: | [OSP 15] failed to scale up HCI Ceph all nodes, no container with name or ID ceph-mon-hci-ceph-all-3 found: no such container | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Yogev Rabl <yrabl> | ||||
| Component: | Ceph-Ansible | Assignee: | Guillaume Abrioux <gabrioux> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Yogev Rabl <yrabl> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 4.0 | CC: | aschoen, aschultz, ceph-eng-bugs, gabrioux, gcharot, gfidente, gmeno, hgurav, johfulto, mburns, nthomas, tenobreg, tserlin, vashastr | ||||
| Target Milestone: | rc | Keywords: | Regression, Triaged | ||||
| Target Release: | 4.0 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | ceph-ansible-4.0.0-0.1.rc10.el8cp | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2020-01-31 12:46:18 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Created attachment 1579383 [details]
sos report from the failed node
it seems that the monitor container starts in the newly created nodes:
[root@hci-ceph-all-3 ~]# podman ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6bad6480325e 192.168.24.1:8787/ceph/rhceph-4.0-rhel8:latest /opt/ceph-contain... 3 hours ago Up 6 minutes ago ceph-mon-hci-ceph-all-3
When running ceph -s command from the container we get:
[root@hci-ceph-all-3 ~]# podman exec ceph-mon-hci-ceph-all-3 ceph -s
cluster:
id: fce928a8-913f-11e9-8da9-525400a19e82
health: HEALTH_WARN
2/5 mons down, quorum hci-ceph-all-2,hci-ceph-all-1,hci-ceph-all-0
services:
mon: 5 daemons, quorum hci-ceph-all-2,hci-ceph-all-1,hci-ceph-all-0 (age 2h), out of quorum: hci-ceph-all-4, hci-ceph-all-3
mgr: hci-ceph-all-2(active, since 4h), standbys: hci-ceph-all-0, hci-ceph-all-1
osd: 15 osds: 15 up (since 4h), 15 in (since 4h)
rgw: 3 daemons active (hci-ceph-all-0.rgw0, hci-ceph-all-1.rgw0, hci-ceph-all-2.rgw0)
data:
pools: 9 pools, 288 pgs
objects: 222 objects, 2.7 KiB
usage: 15 GiB used, 225 GiB / 240 GiB avail
pgs: 288 active+clean
it is also registered in Ceph's configuration file. The issue is, it has not been mapped properly in monmap
[root@hci-ceph-all-3 ~]# podman exec ceph-mon-hci-ceph-all-3 monmaptool --print /etc/ceph/monmap
monmaptool: monmap file /etc/ceph/monmap
epoch 3
fsid fce928a8-913f-11e9-8da9-525400a19e82
last_changed 2019-06-17 22:37:06.373854
created 2019-06-17 21:13:52.852328
min_mon_release 14 (nautilus)
0: [v2:172.17.3.11:3300/0,v1:172.17.3.11:6789/0] mon.hci-ceph-all-2
1: [v2:172.17.3.90:3300/0,v1:172.17.3.90:6789/0] mon.hci-ceph-all-1
2: [v2:172.17.3.98:3300/0,v1:172.17.3.98:6789/0] mon.hci-ceph-all-0
3: v2:172.17.3.52:3300/0 mon.hci-ceph-all-4
4: v2:172.17.3.136:3300/0 mon.hci-ceph-all-3
Any update Guillaume on what you found on Yogev's system? possible related bug 1722066 *** Bug 1722066 has been marked as a duplicate of this bug. *** Verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0312 |
Description of problem: An update of an Overcloud failed with the error: fatal: [hci-ceph-all-3 -> 192.168.24.15]: FAILED! => changed=true ", " - ceph-mon-hci-ceph-all-3", " - --name", " - mon.", " - -k", " - /var/lib/ceph/mon/ceph-hci-ceph-all-0/keyring", " - auth", " - get-key", " delta: '0:00:00.065085'", " end: '2019-06-10 18:22:21.486676'", " rc: 125", " start: '2019-06-10 18:22:21.421591'", " stderr: 'unable to exec into ceph-mon-hci-ceph-all-3: no container with name or ID ceph-mon-hci-ceph-all-3 found: no such container'", The overcloud was initially deployed with 3 controller nodes and 3 HCI Ceph All nodes (including monitors, Mgrs and OSDs). The update should have scaled up additional 2 HCI Ceph all nodes. Version-Release number of selected component (if applicable): openstack containers tag: 20190604.1 ceph-ansible-4.0.0-0.1.rc7.el8cp.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy overcloud with 3 controller nodes and 3 HCI Ceph all node 2. Update the overcloud with additional 2 HCI Ceph all node Actual results: Update fails Expected results: The update is successful, all the nodes are set Additional info: