Bug 1871035

Summary: [Ceph-Ansible]: ceph-ansible (3.2) deployment fails on pool creation because of exceeding max pgs value
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: RPietrzak <rpietrza>
Component: Ceph-AnsibleAssignee: Guillaume Abrioux <gabrioux>
Status: CLOSED ERRATA QA Contact: Ameena Suhani S H <amsyedha>
Severity: low Docs Contact:
Priority: unspecified    
Version: 3.3CC: aschoen, ceph-eng-bugs, gmeno, nthomas, tserlin, ykaul
Target Milestone: ---   
Target Release: 3.3z7   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: RHEL: ceph-ansible-3.2.52-1.el7cp Ubuntu: ceph-ansible_3.2.52-2redhat1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-06 18:32:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description RPietrzak 2020-08-21 08:01:40 UTC
Description of problem:
During the deployment of new Ceph cluster deployment is failing with error:

TASK [ceph-osd : create openstack pool(s)

["Error ERANGE:  pg_num 512 size 3 would mean 1728 total pgs, which exceeds max 750 (mon_max_pg_per_osd 250 * num_in_osds 3)"]`

Actual nr of osds: 32

Cause:
[ceph-osd : wait for all osd to be up] task is skipped because of `run_once: true` connected together with condition - `inventory_hostname == ansible_play_hosts_all | last` will cause that this task will be skipped if `osds` group contains more than 1 host.


Version-Release number of selected component (if applicable):
ceph-ansible-3.2.43-1.el7cp (stable-3.2)
ceph rhcs3.3z5 (12.2.12-115.el7cp)
ansible 2.6.18

How reproducible:
easy

Steps to Reproduce:
1. Deploy of Ceph with more than one OSDs host in `osds` group
2. If we want to see error related to pg max exceeded we need to create openstack pools that will have enough big PG numbers, size etc. so they could not be created if some of OSD will not be up at the pool creation, but this is just result of the wait-for-osd not happening at first place.

example config:

Actual results:
TASK [ceph-osd : wait for all osd to be up]
skipping: [OSD-1]

Expected results:
TASK [ceph-osd : wait for all osd to be up]
skipping: [OSD-1]
skipping: [OSD-2]
skipping: [OSD-3]
FAILED - RETRYING: wait for all osd to be up (60 retries left).
FAILED - RETRYING: wait for all osd to be up (59 retries left).
ok: [OSD-4 -> MON-1]

Additional info:
This issue is present only in ceph-ansible stable-3.2 (stable-4.0 and stable-5.0 looks fine).

In the master branch it was fixed with (plus some other changes):
https://github.com/ceph/ceph-ansible/commit/af6875706af93f133299156403f51d3ad48d17d3

Comment 1 RPietrzak 2020-08-21 08:17:22 UTC
https://github.com/ceph/ceph-ansible/pull/5704

Comment 7 errata-xmlrpc 2021-05-06 18:32:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 3.3 Security and Bug Fix Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:1518