Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1845079

Summary:	TASK [ceph : Get OSD stat percentage]: null and null cannot be divided
Product:	Red Hat OpenStack	Reporter:	John Fulton <johfulto>
Component:	openstack-tripleo-validations	Assignee:	John Fulton <johfulto>
Status:	CLOSED ERRATA	QA Contact:	Yogev Rabl <yrabl>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	16.1 (Train)	CC:	amcleod, gcharot, gfidente, jjoyce, jschluet, ravsingh, sathlang, slinaber, svaezi, tvignaud
Target Milestone:	rc	Keywords:	Triaged
Target Release:	16.1 (Train on RHEL 8.2)	Flags:	yrabl: automate_bug+
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-tripleo-validations-11.3.2-0.20200512143424.aeea5d7.el8ost	Doc Type:	Bug Fix
Doc Text:	Before this update, the data structure format that the `ceph osd stat -f json` command returns changed. As a result, the validation to stop the deployment unless a certain percentage of Red Hat Ceph Storage (RHCS) OSDs are running did not function correctly, and stopped the deployment regardless of how many OSDs were running. + With this update, the new version of `openstack-tripleo-validations` computes the percentage of running RHCS OSDs correctly and the deployment stops early if a percentage of RHCS OSDs are not running. You can use the parameter `CephOsdPercentageMin` to customize the percentage of RHCS OSDs that must be running. The default value is 66%. Set this parameter to `0` to disable the validation.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-07-29 07:52:57 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description John Fulton 2020-06-08 12:39:05 UTC

Deployment with internal Ceph fails with the following message:

TASK [ceph : Get OSD stat percentage] ******************************************************************
Friday 05 June 2020 20:09:42 +0000 (0:00:00.298) 0:33:33.740 ***********
fatal: [undercloud -> 192.168.24.14]: FAILED! => {"ansible_facts": {"discovered_interpreter_python": "/u
sr/libexec/platform-python"}, "changed": true, "cmd": "\"podman\" exec \"ceph-mon-oc0-controller-0\" cep
h --cluster \"ceph\" osd stat -f json | jq '( (.num_in_osds) / (.num_osds) ) * 100'", "delta": "0:00:00.
664333", "end": "2020-06-05 20:09:43.389273", "msg": "non-zero return code", "rc": 5, "start": "2020-06-
05 20:09:42.724940", "stderr": "jq: error (at <stdin>:1): null (null) and null (null) cannot be divided"
, "stderr_lines": ["jq: error (at <stdin>:1): null (null) and null (null) cannot be divided"], "stdout":
 "", "stdout_lines": []}

Comment 1 John Fulton 2020-06-08 12:40:53 UTC

Reproduced this problem on 16.1 build with rhcsv4.1:

[root@central-controller0-0 ~]# podman exec -ti ceph-mon-$HOSTNAME ceph --cluster central osd stat  -f json | jq '( (.num_in_osds) / (.num_osds) ) * 100'
jq: error (at <stdin>:1): null (null) and null (null) cannot be divided
[root@central-controller0-0 ~]#

[root@central-controller0-0 ~]# podman exec -ti ceph-mon-$HOSTNAME ceph --cluster central osd stat  -f json | jq '( (.osdmap.num_in_osds) / (.osdmap.num_osds) ) * 100'
100
[root@central-controller0-0 ~]# podman exec -ti ceph-mon-$HOSTNAME ceph --cluster central --version
ceph version 14.2.8-59.el8cp (53387608e81e6aa2487c952a604db06faa5b2cd0) nautilus (stable)
[root@central-controller0-0 ~]# podman images | grep ceph 
site-undercloud-0.ctlplane.localdomain:8787/rh-osbs/rhceph                                        ceph-4.1-rhel-8-containers-candidate-19505-20200528060838-x86_64   680c9c0d38c3   11 days ago   957 MB
[root@central-controller0-0 ~]#

Comment 2 John Fulton 2020-06-08 12:44:13 UTC

If you're deploying with validations enabled then you should hit this bug. 
The in flight validation should cause the deployment to fail early by design if the requested OSDs were not configured.
However, the mechanism (in openstack validations)to check if the requested OSDs was obsoleted by a change in the JSON output of the 'ceph osd stat' command.
The mechanism needs to be udpated to deal with the new output.

Comment 6 John Fulton 2020-06-10 14:27:30 UTC

WORKAROUND:

Create a disable_osd_validation.yaml with the following content:

parameter_defaults:
  CephOsdPercentageMin: 0


re-run your 'openstack overcloud deploy ...' command and add "-e disable_osd_validation.yaml" as the last argument



More detail:

As per the template "Set this value to 0 to disable this check."

https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/ceph-ansible/ceph-base.yaml#L237-L242

Comment 9 Sofer Athlan-Guyot 2020-06-11 06:17:32 UTC

Hi,

So we are hitting this during update of OSP16.0/latest_cdn to OSP16.1 and it breaks

openstack overcloud external-update run \
--stack qe-Cloud-0 \
--tags ceph 2>&1

020-06-10 21:02:14 | TASK [ceph : Get OSD stat percentage] ******************************************
2020-06-10 21:02:14 | Wednesday 10 June 2020  21:02:11 +0000 (0:00:00.300)       0:22:19.514 ********
2020-06-10 21:02:14 | fatal: [undercloud -> 192.168.24.47]: FAILED! => {"ansible_facts": {"discovered_interpreter_python": "/usr/libexec/platform-python"}, "changed": true, "cmd": "\"podman\" exec \"ceph-mon-controller-0\" ceph --cluster \"ceph\" osd stat -f json | jq '( (.num_in_osds) / (.num_osds) ) * 100'", "delta": "0:00:01.066946", "end": "2020-06-10 21:02:13.417533", "msg": "non-zero return code", "rc": 5, "start": "2020-06-10 21:02:12.350587", "stderr": "jq: error (at <stdin>:1): null (null) and null (null) cannot be divided", "stderr_lines": ["jq: error (at <stdin>:1): null (null) and null (null) cannot be divided"], "stdout": "", "stdout_lines": []}
2020-06-10 21:02:14 |

So this is definitively a blocker for GA of osp16.1.

Comment 10 Sofer Athlan-Guyot 2020-06-11 07:48:49 UTC

Looking at the workaround in #c6, in the update context that would mean that one has to add

 -e disable_osd_validation.yaml

to his/her overcloud update command that happen before all overcloud step:

  openstack overcloud update prepare \
   <DEPLOY OPTIONS>
   -e disable_osd_validation.yaml

When the failure happen during ceph update run (just before converge step), then
one has to re-run the overcloud update prepare command and re-run the ceph update command mentioned in #9.

Comment 15 Alex McLeod 2020-06-16 12:34:09 UTC

If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to '-'.

Comment 16 Yogev Rabl 2020-06-22 13:46:06 UTC

Verified on CI

Comment 18 errata-xmlrpc 2020-07-29 07:52:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3148