Description of problem: We are trying to perform an upgrade from RHOSP 13 to RHOSP 16.1 where while running: openstack overcloud external-upgrade run --stack STACK NAME --tags ceph_systemd -e ceph_ansible_limit=overcloud-controller-0 the upgrade step fails with: +++ TASK [ceph : Get OSD stat percentage] ****************************************** Wednesday 01 July 2020 09:11:52 -0400 (0:00:00.241) 0:01:40.860 ******** changed: [undercloud -> 10.10.0.104] => {"changed": true, "cmd": "\"docker\" exec \"ceph-mon-overcloud-controller-0\" ceph --cluster \"ceph\" osd stat -f json | jq '( (.osdmap.num_in_osds) / (.osdmap.num_osds) $ * 100'", "delta": "0:00:00.435386", "end": "2020-07-01 13:11:53.353584", "rc": 0, "start": "2020-07-01 13:11:52.918198", "stderr": "jq: error: null and null cannot be divided", "stderr_lines": ["jq: error: null and null cannot be divided"], "stdout": "", "stdout_lines": []} TASK [ceph : Fail if there is an unacceptable percentage of in OSDs] *********** Wednesday 01 July 2020 09:11:53 -0400 (0:00:00.898) 0:01:41.759 ******** fatal: [undercloud -> 10.10.0.104]: FAILED! => {"changed": false, "msg": "Only 0.0% of OSDs are in, but 66% are required"} +++ so the command that the validation is running is: +++ [heat-admin@overcloud-controller-0 tmp]$ sudo docker exec ceph-mon-overcloud-controller-0 ceph --cluster ceph osd stat -f json| jq '( (.osdmap.num_in_osds) / (.osdmap.num_osds) ) * 100' jq: error: null and null cannot be divided +++ While the command that should be run is: +++ [heat-admin@overcloud-controller-0 tmp]$ sudo docker exec ceph-mon-overcloud-controller-0 ceph --cluster ceph osd stat -f json| jq '( (.num_in_osds) / (.num_osds) ) * 100' 100 [heat-admin@overcloud-controller-0 tmp]$ +++ We can see that the osd(s) are all up and running fine: +++ [heat-admin@overcloud-controller-0 tmp]$ sudo docker exec ceph-mon-overcloud-controller-0 ceph --cluster ceph -s cluster: id: 999157a6-ba94-11ea-9cd3-fa163e7b60c7 health: HEALTH_WARN too few PGs per OSD (26 < min 30) services: mon: 3 daemons, quorum overcloud-controller-0,overcloud-controller-1,overcloud-controller-2 mgr: overcloud-controller-2(active), standbys: overcloud-controller-1, overcloud-controller-0 osd: 3 osds: 3 up, 3 in data: pools: 5 pools, 80 pgs objects: 325 objects, 251MiB usage: 487MiB used, 284GiB / 285GiB avail pgs: 80 active+clean +++ This validation comes from file /usr/share/openstack-tripleo-validations/roles/ceph/tasks/ceph-health.yaml: +++ - when: - osd_percentage_min|default(0) > 0 block: - name: set jq osd percentage filter set_fact: jq_osd_percentage_filter: '( (.osdmap.num_in_osds) / (.osdmap.num_osds) ) * 100' - name: Get OSD stat percentage become: true shell: >- "{{ container_client }}" exec "{{ ceph_mon_container.stdout }}" ceph --cluster "{{ ceph_cluster_name.stdout }}" osd stat -f json | jq '{{ jq_osd_percentage_filter }}' register: ceph_osd_in_percentage +++ Version-Release number of selected component (if applicable): [root@undercloud ceph-ansible]# rpm -qa | grep -i openstack-tripleo-validations openstack-tripleo-validations-11.3.2-0.20200611115252.08f469d.el8ost.noarch [root@undercloud ceph-ansible]# How reproducible: Always Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
WORKAROUND: Create a disable_osd_validation.yaml with the following content: parameter_defaults: CephOsdPercentageMin: 0 re-run your 'openstack overcloud deploy ...' command and add "-e disable_osd_validation.yaml" as the last argument More detail: As per the template "Set this value to 0 to disable this check." https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/ceph-ansible/ceph-base.yaml#L237-L242
Facing the same issue while running ceph upgrade (FFU 13-16) $ openstack overcloud external-upgrade run --stack overcloud --tags ceph_systemd -e ceph_ansible_limit=computehci0 ~~ TASK [ceph : Get OSD stat percentage] ****************************************** Thursday 02 July 2020 12:19:36 +0000 (0:00:00.220) 0:01:18.448 ********* fatal: [undercloud -> 192.168.24.12]: FAILED! => {"changed": true, "cmd": "\"podman\" exec \"ceph-mon-controller-0\" ceph --cluster \"ceph\" osd stat -f json | jq '( (.osdmap.num_in_osds) / (.osdmap.num_osds) ) * 100'", "delta": "0:00:01.104584", "end": "2020-07-02 12:19:37.468272", "msg": "non-zero return code", "rc": 5, "start": "2020-07-02 12:19:36.363688", "stderr": "jq: error (at <stdin>:1): null (null) and null ( null) cannot be divided", "stderr_lines": ["jq: error (at <stdin>:1): null (null) and null (null) cannot be divided"], "stdout": "", "stdout_lines": []} NO MORE HOSTS LEFT ************************************************************* PLAY RECAP ********************************************************************* compute-0 : ok=4 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 compute-1 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 computehci-0 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 computehci-1 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 computehci-2 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 controller-0 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 controller-1 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 controller-2 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 undercloud : ok=51 changed=12 unreachable=0 failed=1 skipped=60 rescued=0 ignored=0 ~~~ Re-running with "--skip-tags opendev-validation-ceph"
Upstream patch merged in main branch but we need backports: https://review.opendev.org/738855
So, we did try the patch submitted as a fix in our FFU manual testing and it failed with: TASK [ceph : set jq osd percentage filter] ************************************* Tuesday 28 July 2020 10:22:16 -0400 (0:00:00.270) 0:01:11.723 ********** ok: [undercloud -> 192.168.24.15] => {"ansible_facts": {"jq_osd_percentage_filter": "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100"}, "changed": false} TASK [ceph : Get OSD stat percentage] ****************************************** Tuesday 28 July 2020 10:22:17 -0400 (0:00:00.272) 0:01:11.996 ********** fatal: [undercloud -> 192.168.24.15]: FAILED! => {"changed": true, "cmd": "\"docker\" exec \"ceph-mon-controller-0\" ceph --cluster \"ceph\" osd stat -f json | jq '( (try .o sdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100'", "delta": "0:00:00.404082", "end": "2020-07-27 15:43:28.260232", "msg": "non-zero return code", "rc": 1, "start": "2020-07-27 15:43:27.856150", "stderr": "error: try is not defined\n( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100\n ^^^\nerror: try is not defined\n( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100\n ^^^\nerror: try is not defined\n( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100\n ^^^\nerror: try is not defined\n( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100\n ^^^\n4 compile errors", "stderr_lines": ["error: try is not defined", "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100", " ^^^", "error: try is not defined", "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100", " ^^^", "error: try is not defined", "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100", " ^^^", "error: try is not defined", "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100", " ^^^", "4 compile errors"], "stdout": "", "stdout_lines": []} NO MORE HOSTS LEFT ************************************************************* The jq package version found in the controller is: [root@controller-0 ~]# rpm -qa | grep jq python-XStatic-jquery-ui-1.10.4.1-1.el7ost.1.noarch jq-1.3-4.el7ost.x86_64 The try-catch syntax was added in jq-1.5 and we have jq-1.3, so the proposed solution won't work. Moving the BZ back to ASSIGNED so the Ceph Squad can re-work the fix.
the fix was not implemented in openstack-tripleo-validations-11.3.2-0.20200611115252.08f469d.el8ost.noarch
Verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1 director bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3542