Bug 1852868
| Summary: | [RHOSP 16.1 Upgrades] openstack overcloud external-upgrade run --stack STACK NAME --tags ceph_systemd -e ceph_ansible_limit=overcloud-controller-0 fails with validation error for task "Get OSD stat percentage" | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Punit Kundal <pkundal> |
| Component: | openstack-tripleo-validations | Assignee: | Francesco Pantano <fpantano> |
| Status: | CLOSED ERRATA | QA Contact: | Yogev Rabl <yrabl> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 16.1 (Train) | CC: | cjeanner, dmacpher, fpantano, gfidente, jamsmith, jfrancoa, jjoyce, johfulto, jpretori, jschluet, pgrist, sbandyop, slinaber, spower, svigan, tvignaud |
| Target Milestone: | z1 | Keywords: | Triaged |
| Target Release: | 16.1 (Train on RHEL 8.2) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | openstack-tripleo-validations-11.3.2-0.20200611115253.08f469d.el8ost | Doc Type: | Bug Fix |
| Doc Text: |
This update fixes a Red Hat Ceph Storage (RHCS) version compatibility issue that caused failures during upgrades from Red Hat OpenStack platform 13 to 16.1. Before this fix, validations performed during the upgrade worked with RHCS3 clusters but not RHCS4 clusters. Now the validation works with both RHCS3 and RHCS4 clusters.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-08-27 15:19:10 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
WORKAROUND: Create a disable_osd_validation.yaml with the following content: parameter_defaults: CephOsdPercentageMin: 0 re-run your 'openstack overcloud deploy ...' command and add "-e disable_osd_validation.yaml" as the last argument More detail: As per the template "Set this value to 0 to disable this check." https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/ceph-ansible/ceph-base.yaml#L237-L242 Facing the same issue while running ceph upgrade (FFU 13-16)
$ openstack overcloud external-upgrade run --stack overcloud --tags ceph_systemd -e ceph_ansible_limit=computehci0
~~
TASK [ceph : Get OSD stat percentage] ******************************************
Thursday 02 July 2020 12:19:36 +0000 (0:00:00.220) 0:01:18.448 *********
fatal: [undercloud -> 192.168.24.12]: FAILED! => {"changed": true, "cmd": "\"podman\" exec \"ceph-mon-controller-0\" ceph --cluster \"ceph\" osd stat -f json | jq '( (.osdmap.num_in_osds) / (.osdmap.num_osds) )
* 100'", "delta": "0:00:01.104584", "end": "2020-07-02 12:19:37.468272", "msg": "non-zero return code", "rc": 5, "start": "2020-07-02 12:19:36.363688", "stderr": "jq: error (at <stdin>:1): null (null) and null (
null) cannot be divided", "stderr_lines": ["jq: error (at <stdin>:1): null (null) and null (null) cannot be divided"], "stdout": "", "stdout_lines": []}
NO MORE HOSTS LEFT *************************************************************
PLAY RECAP *********************************************************************
compute-0 : ok=4 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
compute-1 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
computehci-0 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
computehci-1 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
computehci-2 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
controller-0 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
controller-1 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
controller-2 : ok=3 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
undercloud : ok=51 changed=12 unreachable=0 failed=1 skipped=60 rescued=0 ignored=0
~~~
Re-running with "--skip-tags opendev-validation-ceph"
Upstream patch merged in main branch but we need backports: https://review.opendev.org/738855 So, we did try the patch submitted as a fix in our FFU manual testing and it failed with:
TASK [ceph : set jq osd percentage filter] *************************************
Tuesday 28 July 2020 10:22:16 -0400 (0:00:00.270) 0:01:11.723 **********
ok: [undercloud -> 192.168.24.15] => {"ansible_facts": {"jq_osd_percentage_filter": "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds))
* 100"}, "changed": false}
TASK [ceph : Get OSD stat percentage] ******************************************
Tuesday 28 July 2020 10:22:17 -0400 (0:00:00.272) 0:01:11.996 **********
fatal: [undercloud -> 192.168.24.15]: FAILED! => {"changed": true, "cmd": "\"docker\" exec \"ceph-mon-controller-0\" ceph --cluster \"ceph\" osd stat -f json | jq '( (try .o
sdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100'", "delta": "0:00:00.404082", "end": "2020-07-27 15:43:28.260232", "msg": "non-zero return code", "rc": 1, "start": "2020-07-27 15:43:27.856150", "stderr": "error: try is not defined\n( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100\n ^^^\nerror: try is not defined\n( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100\n
^^^\nerror: try is not defined\n( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100\n
^^^\nerror: try is not defined\n( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100\n
^^^\n4 compile errors", "stderr_lines": ["error: try is not defined", "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100", " ^^^", "error: try is not defined", "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100", "
^^^", "error: try is not defined", "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100", "
^^^", "error: try is not defined", "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100",
" ^^^", "4 compile errors"], "stdout": "", "stdout_lines": []}
NO MORE HOSTS LEFT *************************************************************
The jq package version found in the controller is:
[root@controller-0 ~]# rpm -qa | grep jq
python-XStatic-jquery-ui-1.10.4.1-1.el7ost.1.noarch
jq-1.3-4.el7ost.x86_64
The try-catch syntax was added in jq-1.5 and we have jq-1.3, so the proposed solution won't work. Moving the BZ back to ASSIGNED so the Ceph Squad can re-work the fix.
the fix was not implemented in openstack-tripleo-validations-11.3.2-0.20200611115252.08f469d.el8ost.noarch Verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1 director bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3542 |
Description of problem: We are trying to perform an upgrade from RHOSP 13 to RHOSP 16.1 where while running: openstack overcloud external-upgrade run --stack STACK NAME --tags ceph_systemd -e ceph_ansible_limit=overcloud-controller-0 the upgrade step fails with: +++ TASK [ceph : Get OSD stat percentage] ****************************************** Wednesday 01 July 2020 09:11:52 -0400 (0:00:00.241) 0:01:40.860 ******** changed: [undercloud -> 10.10.0.104] => {"changed": true, "cmd": "\"docker\" exec \"ceph-mon-overcloud-controller-0\" ceph --cluster \"ceph\" osd stat -f json | jq '( (.osdmap.num_in_osds) / (.osdmap.num_osds) $ * 100'", "delta": "0:00:00.435386", "end": "2020-07-01 13:11:53.353584", "rc": 0, "start": "2020-07-01 13:11:52.918198", "stderr": "jq: error: null and null cannot be divided", "stderr_lines": ["jq: error: null and null cannot be divided"], "stdout": "", "stdout_lines": []} TASK [ceph : Fail if there is an unacceptable percentage of in OSDs] *********** Wednesday 01 July 2020 09:11:53 -0400 (0:00:00.898) 0:01:41.759 ******** fatal: [undercloud -> 10.10.0.104]: FAILED! => {"changed": false, "msg": "Only 0.0% of OSDs are in, but 66% are required"} +++ so the command that the validation is running is: +++ [heat-admin@overcloud-controller-0 tmp]$ sudo docker exec ceph-mon-overcloud-controller-0 ceph --cluster ceph osd stat -f json| jq '( (.osdmap.num_in_osds) / (.osdmap.num_osds) ) * 100' jq: error: null and null cannot be divided +++ While the command that should be run is: +++ [heat-admin@overcloud-controller-0 tmp]$ sudo docker exec ceph-mon-overcloud-controller-0 ceph --cluster ceph osd stat -f json| jq '( (.num_in_osds) / (.num_osds) ) * 100' 100 [heat-admin@overcloud-controller-0 tmp]$ +++ We can see that the osd(s) are all up and running fine: +++ [heat-admin@overcloud-controller-0 tmp]$ sudo docker exec ceph-mon-overcloud-controller-0 ceph --cluster ceph -s cluster: id: 999157a6-ba94-11ea-9cd3-fa163e7b60c7 health: HEALTH_WARN too few PGs per OSD (26 < min 30) services: mon: 3 daemons, quorum overcloud-controller-0,overcloud-controller-1,overcloud-controller-2 mgr: overcloud-controller-2(active), standbys: overcloud-controller-1, overcloud-controller-0 osd: 3 osds: 3 up, 3 in data: pools: 5 pools, 80 pgs objects: 325 objects, 251MiB usage: 487MiB used, 284GiB / 285GiB avail pgs: 80 active+clean +++ This validation comes from file /usr/share/openstack-tripleo-validations/roles/ceph/tasks/ceph-health.yaml: +++ - when: - osd_percentage_min|default(0) > 0 block: - name: set jq osd percentage filter set_fact: jq_osd_percentage_filter: '( (.osdmap.num_in_osds) / (.osdmap.num_osds) ) * 100' - name: Get OSD stat percentage become: true shell: >- "{{ container_client }}" exec "{{ ceph_mon_container.stdout }}" ceph --cluster "{{ ceph_cluster_name.stdout }}" osd stat -f json | jq '{{ jq_osd_percentage_filter }}' register: ceph_osd_in_percentage +++ Version-Release number of selected component (if applicable): [root@undercloud ceph-ansible]# rpm -qa | grep -i openstack-tripleo-validations openstack-tripleo-validations-11.3.2-0.20200611115252.08f469d.el8ost.noarch [root@undercloud ceph-ansible]# How reproducible: Always Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: