Bug 1852868
Summary: | [RHOSP 16.1 Upgrades] openstack overcloud external-upgrade run --stack STACK NAME --tags ceph_systemd -e ceph_ansible_limit=overcloud-controller-0 fails with validation error for task "Get OSD stat percentage" | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Punit Kundal <pkundal> |
Component: | openstack-tripleo-validations | Assignee: | Francesco Pantano <fpantano> |
Status: | CLOSED ERRATA | QA Contact: | Yogev Rabl <yrabl> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 16.1 (Train) | CC: | cjeanner, dmacpher, fpantano, gfidente, jamsmith, jfrancoa, jjoyce, johfulto, jpretori, jschluet, pgrist, sbandyop, slinaber, spower, svigan, tvignaud |
Target Milestone: | z1 | Keywords: | Triaged |
Target Release: | 16.1 (Train on RHEL 8.2) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openstack-tripleo-validations-11.3.2-0.20200611115253.08f469d.el8ost | Doc Type: | Bug Fix |
Doc Text: |
This update fixes a Red Hat Ceph Storage (RHCS) version compatibility issue that caused failures during upgrades from Red Hat OpenStack platform 13 to 16.1. Before this fix, validations performed during the upgrade worked with RHCS3 clusters but not RHCS4 clusters. Now the validation works with both RHCS3 and RHCS4 clusters.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2020-08-27 15:19:10 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Punit Kundal
2020-07-01 13:46:50 UTC
WORKAROUND: Create a disable_osd_validation.yaml with the following content: parameter_defaults: CephOsdPercentageMin: 0 re-run your 'openstack overcloud deploy ...' command and add "-e disable_osd_validation.yaml" as the last argument More detail: As per the template "Set this value to 0 to disable this check." https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/ceph-ansible/ceph-base.yaml#L237-L242 Facing the same issue while running ceph upgrade (FFU 13-16) $ openstack overcloud external-upgrade run --stack overcloud --tags ceph_systemd -e ceph_ansible_limit=computehci0 ~~ TASK [ceph : Get OSD stat percentage] ****************************************** Thursday 02 July 2020 12:19:36 +0000 (0:00:00.220) 0:01:18.448 ********* fatal: [undercloud -> 192.168.24.12]: FAILED! => {"changed": true, "cmd": "\"podman\" exec \"ceph-mon-controller-0\" ceph --cluster \"ceph\" osd stat -f json | jq '( (.osdmap.num_in_osds) / (.osdmap.num_osds) ) * 100'", "delta": "0:00:01.104584", "end": "2020-07-02 12:19:37.468272", "msg": "non-zero return code", "rc": 5, "start": "2020-07-02 12:19:36.363688", "stderr": "jq: error (at <stdin>:1): null (null) and null ( null) cannot be divided", "stderr_lines": ["jq: error (at <stdin>:1): null (null) and null (null) cannot be divided"], "stdout": "", "stdout_lines": []} NO MORE HOSTS LEFT ************************************************************* PLAY RECAP ********************************************************************* compute-0 : ok=4 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 compute-1 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 computehci-0 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 computehci-1 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 computehci-2 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 controller-0 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 controller-1 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 controller-2 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 undercloud : ok=51 changed=12 unreachable=0 failed=1 skipped=60 rescued=0 ignored=0 ~~~ Re-running with "--skip-tags opendev-validation-ceph" Upstream patch merged in main branch but we need backports: https://review.opendev.org/738855 So, we did try the patch submitted as a fix in our FFU manual testing and it failed with: TASK [ceph : set jq osd percentage filter] ************************************* Tuesday 28 July 2020 10:22:16 -0400 (0:00:00.270) 0:01:11.723 ********** ok: [undercloud -> 192.168.24.15] => {"ansible_facts": {"jq_osd_percentage_filter": "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100"}, "changed": false} TASK [ceph : Get OSD stat percentage] ****************************************** Tuesday 28 July 2020 10:22:17 -0400 (0:00:00.272) 0:01:11.996 ********** fatal: [undercloud -> 192.168.24.15]: FAILED! => {"changed": true, "cmd": "\"docker\" exec \"ceph-mon-controller-0\" ceph --cluster \"ceph\" osd stat -f json | jq '( (try .o sdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100'", "delta": "0:00:00.404082", "end": "2020-07-27 15:43:28.260232", "msg": "non-zero return code", "rc": 1, "start": "2020-07-27 15:43:27.856150", "stderr": "error: try is not defined\n( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100\n ^^^\nerror: try is not defined\n( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100\n ^^^\nerror: try is not defined\n( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100\n ^^^\nerror: try is not defined\n( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100\n ^^^\n4 compile errors", "stderr_lines": ["error: try is not defined", "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100", " ^^^", "error: try is not defined", "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100", " ^^^", "error: try is not defined", "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100", " ^^^", "error: try is not defined", "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100", " ^^^", "4 compile errors"], "stdout": "", "stdout_lines": []} NO MORE HOSTS LEFT ************************************************************* The jq package version found in the controller is: [root@controller-0 ~]# rpm -qa | grep jq python-XStatic-jquery-ui-1.10.4.1-1.el7ost.1.noarch jq-1.3-4.el7ost.x86_64 The try-catch syntax was added in jq-1.5 and we have jq-1.3, so the proposed solution won't work. Moving the BZ back to ASSIGNED so the Ceph Squad can re-work the fix. the fix was not implemented in openstack-tripleo-validations-11.3.2-0.20200611115252.08f469d.el8ost.noarch Verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1 director bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3542 |