Deployment with internal Ceph fails with the following message: TASK [ceph : Get OSD stat percentage] ****************************************************************** Friday 05 June 2020 20:09:42 +0000 (0:00:00.298) 0:33:33.740 *********** fatal: [undercloud -> 192.168.24.14]: FAILED! => {"ansible_facts": {"discovered_interpreter_python": "/u sr/libexec/platform-python"}, "changed": true, "cmd": "\"podman\" exec \"ceph-mon-oc0-controller-0\" cep h --cluster \"ceph\" osd stat -f json | jq '( (.num_in_osds) / (.num_osds) ) * 100'", "delta": "0:00:00. 664333", "end": "2020-06-05 20:09:43.389273", "msg": "non-zero return code", "rc": 5, "start": "2020-06- 05 20:09:42.724940", "stderr": "jq: error (at <stdin>:1): null (null) and null (null) cannot be divided" , "stderr_lines": ["jq: error (at <stdin>:1): null (null) and null (null) cannot be divided"], "stdout": "", "stdout_lines": []}
Reproduced this problem on 16.1 build with rhcsv4.1: [root@central-controller0-0 ~]# podman exec -ti ceph-mon-$HOSTNAME ceph --cluster central osd stat -f json | jq '( (.num_in_osds) / (.num_osds) ) * 100' jq: error (at <stdin>:1): null (null) and null (null) cannot be divided [root@central-controller0-0 ~]# [root@central-controller0-0 ~]# podman exec -ti ceph-mon-$HOSTNAME ceph --cluster central osd stat -f json | jq '( (.osdmap.num_in_osds) / (.osdmap.num_osds) ) * 100' 100 [root@central-controller0-0 ~]# podman exec -ti ceph-mon-$HOSTNAME ceph --cluster central --version ceph version 14.2.8-59.el8cp (53387608e81e6aa2487c952a604db06faa5b2cd0) nautilus (stable) [root@central-controller0-0 ~]# podman images | grep ceph site-undercloud-0.ctlplane.localdomain:8787/rh-osbs/rhceph ceph-4.1-rhel-8-containers-candidate-19505-20200528060838-x86_64 680c9c0d38c3 11 days ago 957 MB [root@central-controller0-0 ~]#
If you're deploying with validations enabled then you should hit this bug. The in flight validation should cause the deployment to fail early by design if the requested OSDs were not configured. However, the mechanism (in openstack validations)to check if the requested OSDs was obsoleted by a change in the JSON output of the 'ceph osd stat' command. The mechanism needs to be udpated to deal with the new output.
WORKAROUND: Create a disable_osd_validation.yaml with the following content: parameter_defaults: CephOsdPercentageMin: 0 re-run your 'openstack overcloud deploy ...' command and add "-e disable_osd_validation.yaml" as the last argument More detail: As per the template "Set this value to 0 to disable this check." https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/ceph-ansible/ceph-base.yaml#L237-L242
Hi, So we are hitting this during update of OSP16.0/latest_cdn to OSP16.1 and it breaks openstack overcloud external-update run \ --stack qe-Cloud-0 \ --tags ceph 2>&1 020-06-10 21:02:14 | TASK [ceph : Get OSD stat percentage] ****************************************** 2020-06-10 21:02:14 | Wednesday 10 June 2020 21:02:11 +0000 (0:00:00.300) 0:22:19.514 ******** 2020-06-10 21:02:14 | fatal: [undercloud -> 192.168.24.47]: FAILED! => {"ansible_facts": {"discovered_interpreter_python": "/usr/libexec/platform-python"}, "changed": true, "cmd": "\"podman\" exec \"ceph-mon-controller-0\" ceph --cluster \"ceph\" osd stat -f json | jq '( (.num_in_osds) / (.num_osds) ) * 100'", "delta": "0:00:01.066946", "end": "2020-06-10 21:02:13.417533", "msg": "non-zero return code", "rc": 5, "start": "2020-06-10 21:02:12.350587", "stderr": "jq: error (at <stdin>:1): null (null) and null (null) cannot be divided", "stderr_lines": ["jq: error (at <stdin>:1): null (null) and null (null) cannot be divided"], "stdout": "", "stdout_lines": []} 2020-06-10 21:02:14 | So this is definitively a blocker for GA of osp16.1.
Looking at the workaround in #c6, in the update context that would mean that one has to add -e disable_osd_validation.yaml to his/her overcloud update command that happen before all overcloud step: openstack overcloud update prepare \ <DEPLOY OPTIONS> -e disable_osd_validation.yaml When the failure happen during ceph update run (just before converge step), then one has to re-run the overcloud update prepare command and re-run the ceph update command mentioned in #9.
If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text. If this bug does not require doc text, please set the 'requires_doc_text' flag to '-'.
Verified on CI
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3148