Bug 2228783

Summary: [OSP 17] overcloud deployment is failing trying to use a non-existing ceph admin socket
Product: Red Hat OpenStack Reporter: Flavio Piccioni <fpiccion>
Component: ceph-ansibleAssignee: Teoman ONAY <tonay>
Status: CLOSED DUPLICATE QA Contact: Yogev Rabl <yrabl>
Severity: high Docs Contact:
Priority: high    
Version: 17.0 (Wallaby)CC: fpantano, gfidente, tonay
Target Milestone: ---Flags: ifrangs: needinfo? (tonay)
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-08-04 10:58:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Flavio Piccioni 2023-08-03 09:04:06 UTC
Description of problem:
after successfully deploying ceph cluster, customer moved to the overcloud deployment, including ceph's tuning steps like creating additional crush rules and pools, facing an issue trying to set the (new) default crush rule.


Version-Release number of selected component (if applicable):
RHOSP 17 + RHCS 5

openstack-tripleo-heat-templates.noarch          14.3.1-0.20221208160327.feca772.el9ost   @openstack-17-for-rhel-9-x86_64-rpms 
tripleo-ansible.noarch                           3.3.1-0.20221208161844.fa5422f.el9ost    @openstack-17-for-rhel-9-x86_64-rpms 
cephadm.noarch                                   2:16.2.10-94.el9cp                       @rhos-17.0-RHCS-5              


How reproducible:
tune ceph and run deployment


Steps to Reproduce:
1) Tune ceph-config.yaml in order create a couple of new crush rules setting one of them as default:

parameter_defaults:
  CephCrushRules:
    - name: HDD
      root: default
      type: host
      class: hdd
      default: true
    - name: SSD
      root: default
      type: host
      class: ssd
      default: false
  CinderRbdExtraPools: ssdpool 
  CephPools:
    - name: ssdpool
      rule_name: SSD
      application: rbd
    - name: volumes
      target_size_ratio: 0.4
      application: rbd
    - name: images
      target_size_ratio: 0.1
      application: rbd
    - name: vms
      target_size_ratio: 0.3
      application: rbd


2) run overcloud deploy


Actual results:
Deployment is failing as task is trying to use a non-existing socket (/var/run/ceph/ceph-mon.overcloud-controller-0.mydomain.com.asok)...

2023-08-02 21:25:37,195 p=150594 u=stack n=ansible | 2023-08-02 21:25:37.193048 | {uuid} |      FATAL | insert new default crush rule into daemon to prevent restart | overcloud-controller-0 -> {ip} | item=overcloud-controller-0 | error={"ansible_loop_var": "item", "changed": false, "cmd": ["podman", "run", "--rm", "--net=host", "--ipc=host", "--volume", "/etc/ceph:/etc/ceph:z", "--volume", "/home/ceph-admin/assimilate_ceph.conf:/home/assimilate_ceph.conf:z", "--volume", "/var/run/ceph/{fsid}:/var/run/ceph:z", "--entrypoint", "ceph", "director17.ctlplane.mydomain.com:8787/rhceph/rhceph-5-rhel8:latest", "--admin-daemon", "/var/run/ceph/ceph-mon.overcloud-controller-0.mydomain.com.asok", "config", "set", "osd_pool_default_crush_rule", "1"], "delta": "0:00:00.456442", "end": "2023-08-02 21:25:37.152697", "item": "overcloud-controller-0", "msg": "non-zero return code", "rc": 22, "start": "2023-08-02 21:25:36.696255", "stderr": "admin_socket: exception getting command descriptions: [Errno 2] No such file or directory", "stderr_lines": ["admin_socket: exception getting command descriptions: [Errno 2] No such file or directory"], "stdout": "", "stdout_lines": []}


...while the admin socket file is: /var/run/ceph/ceph-mon.overcloud-controller-0.asok


[root@overcloud-controller-0 {fsid}]# ls -las
total 0
0 drwxrwx---. 2  167  167 80 Aug  1 23:11 .
0 drwxr-xr-x. 3 root root 60 Aug  1 23:10 ..
0 srwxr-xr-x. 1  167  167  0 Aug  1 23:11 ceph-mgr.overcloud-controller-0.schoqv.asok
0 srwxr-xr-x. 1  167  167  0 Aug  1 23:10 ceph-mon.overcloud-controller-0.asok


Additional info:

cat etc/hosts |grep controller-0
{ip} overcloud-controller-0.mydomain.com overcloud-controller-0
{ip} overcloud-controller-0.storage.mydomain.com overcloud-controller-0.storage
{ip} overcloud-controller-0.storagemgmt.mydomain.com overcloud-controller-0.storagemgmt
{ip} overcloud-controller-0.internalapi.mydomain.com overcloud-controller-0.internalapi
{ip} overcloud-controller-0.tenant.mydomain.com overcloud-controller-0.tenant
{ip} overcloud-controller-0.external.mydomain.com overcloud-controller-0.external
{ip} overcloud-controller-0.ctlplane.mydomain.com overcloud-controller-0.ctlplane