Bug 1876447

Summary: [16.1.0->16.1.1][DCN] Ceph upgrade is failing for as ansible is not considering the ceph cluster name
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Khomesh Thakre <kthakre>
Component: Ceph-AnsibleAssignee: Francesco Pantano <fpantano>
Status: CLOSED ERRATA QA Contact: Ameena Suhani S H <amsyedha>
Severity: urgent Docs Contact: Aron Gunn <agunn>
Priority: urgent    
Version: 4.1CC: agunn, aschoen, ceph-eng-bugs, fpantano, gabrioux, gfidente, gmeno, gsitlani, johfulto, mbultel, nchandek, nthomas, owalsh, schhabdi, tlapierr, tserlin, vereddy, ykaul, yrabl
Target Milestone: z2Flags: yrabl: automate_bug-
Target Release: 4.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-ansible-4.0.30-1.el8cp, ceph-ansible-4.0.30-1.el7cp Doc Type: Bug Fix
Doc Text:
.The {storage-product} rolling update fails when multiple storage clusters exist Running the Ceph Ansible `rolling_update.yml` playbook when multiple storage clusters are configured, would cause the rolling update to fail because a storage cluster name could not be specified. With this release, the `rolling_update.yml` playbook uses the `--cluster` option to allow for a specific storage cluster name.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-09-30 17:26:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1760354, 1816167    

Description Khomesh Thakre 2020-09-07 08:56:11 UTC
Description of problem:
During the upgrade of OSP with ceph from 16.0 to 16.1 for DCN environment, ceph upgrade is failing below belor error 

~~~

        "fatal: [rhosp-con1 -> 172.10.10.101]: FAILED! => {\"changed\": true, \"cmd\": [\"podman\", \"exec\", \"ceph-mon-rhosp-con1\", \"ceph\", \"osd\", \"require-osd-release\", \"nautilus\"], \"delta\": \"0:00:00.435547\", \"end\": \"2020-09-06 15:51:37.055340\", \"msg\": \"non-zero return code\", \"rc\": 1, \"start\": \"2020-09-06 15:51:36.619793\", \"stderr\": \"Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)\\nError: non zero exit code: 1: OCI runtime error\", \"stderr_lines\": [\"Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)\", \"Error: non zero exit code: 1: OCI runtime error\"], \"stdout\": \"\", \"stdout_lines\": []}",
        "NO MORE HOSTS LEFT *************************************************************",
        "PLAY RECAP *********************************************************************",
        "localhost                  : ok=1    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   ",
        "rhosp-con1                 : ok=295  changed=30   unreachable=0    failed=1    skipped=491  rescued=0    ignored=1   ",
        "rhosp-con2                 : ok=213  changed=20   unreachable=0    failed=0    skipped=395  rescued=0    ignored=1   ",
        "rhosp-con3                 : ok=213  changed=20   unreachable=0    failed=0    skipped=395  rescued=0    ignored=1   ",
        "rhosp-hci1                 : ok=207  changed=13   unreachable=0    failed=0    skipped=359  rescued=0    ignored=0   ",
        "rhosp-hci2                 : ok=202  changed=12   unreachable=0    failed=0    skipped=348  rescued=0    ignored=0   ",
        "rhosp-hci3                 : ok=203  changed=12   unreachable=0    failed=0    skipped=347  rescued=0    ignored=0   ",
        "rhosp-nfv1                 : ok=105  changed=5    unreachable=0    failed=0    skipped=225  rescued=0    ignored=0   ",
        "rhosp-nfv2                 : ok=105  changed=5    unreachable=0    failed=0    skipped=225  rescued=0    ignored=0   ",
        "Sunday 06 September 2020  15:51:37 +0200 (0:00:00.747)       0:18:25.016 ****** ",
~~~

Ansible is running the wrong command 

~~~
[root@rhosp-con1 ~]# podman exec ceph-mon-rhosp-con1 ceph osd require-osd-release nautilus
Error initializing cluster client: ObjectNotFound('error calling conf_read_file',)
Error: non zero exit code: 1: OCI runtime error
~~~

The correct command is :
~~~
[root@rhosp-con1 ~]# podman exec ceph-mon-rhosp-con1 ceph -c /etc/ceph/central.conf osd require-osd-release nautilus
~~~
As the ceph cluster name is "central".


Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform 16.1 Train 

How reproducible:


Steps to Reproduce:
1. deploy a osp with ceph on central site with custom cluster name like central
2. Perform a upgrade of the env 
3. Ceph upgrade playbook will fail 

Actual results:
Ansible is not considering the custom ceph cluster name 

Expected results:
Ansible is shoud consider the custom ceph cluster name 

Additional info:

Comment 20 errata-xmlrpc 2020-09-30 17:26:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 4.1 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4144