Bug 1845079 - TASK [ceph : Get OSD stat percentage]: null and null cannot be divided
Summary: TASK [ceph : Get OSD stat percentage]: null and null cannot be divided
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-validations
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 16.1 (Train on RHEL 8.2)
Assignee: John Fulton
QA Contact: Yogev Rabl
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-08 12:39 UTC by John Fulton
Modified: 2023-10-06 20:28 UTC (History)
10 users (show)

Fixed In Version: openstack-tripleo-validations-11.3.2-0.20200512143424.aeea5d7.el8ost
Doc Type: Bug Fix
Doc Text:
Before this update, the data structure format that the `ceph osd stat -f json` command returns changed. As a result, the validation to stop the deployment unless a certain percentage of Red Hat Ceph Storage (RHCS) OSDs are running did not function correctly, and stopped the deployment regardless of how many OSDs were running. + With this update, the new version of `openstack-tripleo-validations` computes the percentage of running RHCS OSDs correctly and the deployment stops early if a percentage of RHCS OSDs are not running. You can use the parameter `CephOsdPercentageMin` to customize the percentage of RHCS OSDs that must be running. The default value is 66%. Set this parameter to `0` to disable the validation.
Clone Of:
Environment:
Last Closed: 2020-07-29 07:52:57 UTC
Target Upstream Version:
Embargoed:
yrabl: automate_bug+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1882387 0 None None None 2020-06-08 12:39:35 UTC
OpenStack gerrit 734069 0 None MERGED Update Ceph role's Get OSD stat to use new data structure 2020-12-24 10:34:53 UTC
Red Hat Product Errata RHBA-2020:3148 0 None None None 2020-07-29 07:53:17 UTC

Description John Fulton 2020-06-08 12:39:05 UTC
Deployment with internal Ceph fails with the following message:

TASK [ceph : Get OSD stat percentage] ******************************************************************
Friday 05 June 2020 20:09:42 +0000 (0:00:00.298) 0:33:33.740 ***********
fatal: [undercloud -> 192.168.24.14]: FAILED! => {"ansible_facts": {"discovered_interpreter_python": "/u
sr/libexec/platform-python"}, "changed": true, "cmd": "\"podman\" exec \"ceph-mon-oc0-controller-0\" cep
h --cluster \"ceph\" osd stat -f json | jq '( (.num_in_osds) / (.num_osds) ) * 100'", "delta": "0:00:00.
664333", "end": "2020-06-05 20:09:43.389273", "msg": "non-zero return code", "rc": 5, "start": "2020-06-
05 20:09:42.724940", "stderr": "jq: error (at <stdin>:1): null (null) and null (null) cannot be divided"
, "stderr_lines": ["jq: error (at <stdin>:1): null (null) and null (null) cannot be divided"], "stdout":
 "", "stdout_lines": []}

Comment 1 John Fulton 2020-06-08 12:40:53 UTC
Reproduced this problem on 16.1 build with rhcsv4.1:

[root@central-controller0-0 ~]# podman exec -ti ceph-mon-$HOSTNAME ceph --cluster central osd stat  -f json | jq '( (.num_in_osds) / (.num_osds) ) * 100'
jq: error (at <stdin>:1): null (null) and null (null) cannot be divided
[root@central-controller0-0 ~]#

[root@central-controller0-0 ~]# podman exec -ti ceph-mon-$HOSTNAME ceph --cluster central osd stat  -f json | jq '( (.osdmap.num_in_osds) / (.osdmap.num_osds) ) * 100'
100
[root@central-controller0-0 ~]# podman exec -ti ceph-mon-$HOSTNAME ceph --cluster central --version
ceph version 14.2.8-59.el8cp (53387608e81e6aa2487c952a604db06faa5b2cd0) nautilus (stable)
[root@central-controller0-0 ~]# podman images | grep ceph 
site-undercloud-0.ctlplane.localdomain:8787/rh-osbs/rhceph                                        ceph-4.1-rhel-8-containers-candidate-19505-20200528060838-x86_64   680c9c0d38c3   11 days ago   957 MB
[root@central-controller0-0 ~]#

Comment 2 John Fulton 2020-06-08 12:44:13 UTC
If you're deploying with validations enabled then you should hit this bug. 
The in flight validation should cause the deployment to fail early by design if the requested OSDs were not configured.
However, the mechanism (in openstack validations)to check if the requested OSDs was obsoleted by a change in the JSON output of the 'ceph osd stat' command.
The mechanism needs to be udpated to deal with the new output.

Comment 6 John Fulton 2020-06-10 14:27:30 UTC
WORKAROUND:

Create a disable_osd_validation.yaml with the following content:

parameter_defaults:
  CephOsdPercentageMin: 0


re-run your 'openstack overcloud deploy ...' command and add "-e disable_osd_validation.yaml" as the last argument



More detail:

As per the template "Set this value to 0 to disable this check."

https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/ceph-ansible/ceph-base.yaml#L237-L242

Comment 9 Sofer Athlan-Guyot 2020-06-11 06:17:32 UTC
Hi,

So we are hitting this during update of OSP16.0/latest_cdn to OSP16.1 and it breaks

openstack overcloud external-update run \
--stack qe-Cloud-0 \
--tags ceph 2>&1

020-06-10 21:02:14 | TASK [ceph : Get OSD stat percentage] ******************************************
2020-06-10 21:02:14 | Wednesday 10 June 2020  21:02:11 +0000 (0:00:00.300)       0:22:19.514 ********
2020-06-10 21:02:14 | fatal: [undercloud -> 192.168.24.47]: FAILED! => {"ansible_facts": {"discovered_interpreter_python": "/usr/libexec/platform-python"}, "changed": true, "cmd": "\"podman\" exec \"ceph-mon-controller-0\" ceph --cluster \"ceph\" osd stat -f json | jq '( (.num_in_osds) / (.num_osds) ) * 100'", "delta": "0:00:01.066946", "end": "2020-06-10 21:02:13.417533", "msg": "non-zero return code", "rc": 5, "start": "2020-06-10 21:02:12.350587", "stderr": "jq: error (at <stdin>:1): null (null) and null (null) cannot be divided", "stderr_lines": ["jq: error (at <stdin>:1): null (null) and null (null) cannot be divided"], "stdout": "", "stdout_lines": []}
2020-06-10 21:02:14 |

So this is definitively a blocker for GA of osp16.1.

Comment 10 Sofer Athlan-Guyot 2020-06-11 07:48:49 UTC
Looking at the workaround in #c6, in the update context that would mean that one has to add

 -e disable_osd_validation.yaml

to his/her overcloud update command that happen before all overcloud step:

  openstack overcloud update prepare \
   <DEPLOY OPTIONS>
   -e disable_osd_validation.yaml

When the failure happen during ceph update run (just before converge step), then
one has to re-run the overcloud update prepare command and re-run the ceph update command mentioned in #9.

Comment 15 Alex McLeod 2020-06-16 12:34:09 UTC
If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to '-'.

Comment 16 Yogev Rabl 2020-06-22 13:46:06 UTC
Verified on CI

Comment 18 errata-xmlrpc 2020-07-29 07:52:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3148


Note You need to log in before you can comment on or make changes to this bug.