Bug 1852868 - [RHOSP 16.1 Upgrades] openstack overcloud external-upgrade run --stack STACK NAME --tags ceph_systemd -e ceph_ansible_limit=overcloud-controller-0 fails with validation error for task "Get OSD stat percentage"
Summary: [RHOSP 16.1 Upgrades] openstack overcloud external-upgrade run --stack STACK ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-validations
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z1
: 16.1 (Train on RHEL 8.2)
Assignee: Francesco Pantano
QA Contact: Yogev Rabl
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-01 13:46 UTC by Punit Kundal
Modified: 2020-08-27 15:19 UTC (History)
16 users (show)

Fixed In Version: openstack-tripleo-validations-11.3.2-0.20200611115253.08f469d.el8ost
Doc Type: Bug Fix
Doc Text:
This update fixes a Red Hat Ceph Storage (RHCS) version compatibility issue that caused failures during upgrades from Red Hat OpenStack platform 13 to 16.1. Before this fix, validations performed during the upgrade worked with RHCS3 clusters but not RHCS4 clusters. Now the validation works with both RHCS3 and RHCS4 clusters.
Clone Of:
Environment:
Last Closed: 2020-08-27 15:19:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1882387 0 None None None 2020-07-01 14:18:13 UTC
Launchpad 1889279 0 None None None 2020-07-28 16:42:43 UTC
OpenStack gerrit 738855 0 None MERGED Make Get OSD stat percentage compatible with both Luminous and Nautilus 2020-12-05 11:34:09 UTC
OpenStack gerrit 741427 0 None MERGED Make Get OSD stat percentage compatible with both Luminous and Nautilus 2020-12-05 11:34:08 UTC
OpenStack gerrit 743572 0 None MERGED Make Get OSD stat percentage compatible with jq < 1.5 2020-12-05 11:34:34 UTC
OpenStack gerrit 743592 0 None MERGED Make Get OSD stat percentage compatible with jq < 1.5 2020-12-05 11:34:07 UTC
OpenStack gerrit 743598 0 None MERGED Make Get OSD stat percentage compatible with jq < 1.5 2020-12-05 11:34:35 UTC
Red Hat Product Errata RHBA-2020:3542 0 None None None 2020-08-27 15:19:32 UTC

Description Punit Kundal 2020-07-01 13:46:50 UTC
Description of problem:

We are trying to perform an upgrade from RHOSP 13 to RHOSP 16.1 where while running:

openstack overcloud external-upgrade run --stack STACK NAME --tags ceph_systemd -e ceph_ansible_limit=overcloud-controller-0

the upgrade step fails with:

+++
TASK [ceph : Get OSD stat percentage] ******************************************
Wednesday 01 July 2020  09:11:52 -0400 (0:00:00.241)       0:01:40.860 ********
changed: [undercloud -> 10.10.0.104] => {"changed": true, "cmd": "\"docker\" exec \"ceph-mon-overcloud-controller-0\" ceph --cluster \"ceph\" osd stat -f json | jq '( (.osdmap.num_in_osds) / (.osdmap.num_osds) $
 * 100'", "delta": "0:00:00.435386", "end": "2020-07-01 13:11:53.353584", "rc": 0, "start": "2020-07-01 13:11:52.918198", "stderr": "jq: error: null and null cannot be divided", "stderr_lines": ["jq: error: null
 and null cannot be divided"], "stdout": "", "stdout_lines": []}

TASK [ceph : Fail if there is an unacceptable percentage of in OSDs] ***********
Wednesday 01 July 2020  09:11:53 -0400 (0:00:00.898)       0:01:41.759 ********
fatal: [undercloud -> 10.10.0.104]: FAILED! => {"changed": false, "msg": "Only 0.0% of OSDs are in, but 66% are required"}                                                                                        
+++

so the command that the validation is running is:

+++
[heat-admin@overcloud-controller-0 tmp]$ sudo docker exec ceph-mon-overcloud-controller-0 ceph --cluster ceph osd stat -f json| jq '( (.osdmap.num_in_osds) / (.osdmap.num_osds) ) * 100'                         
jq: error: null and null cannot be divided
+++

While the command that should be run is:

+++
[heat-admin@overcloud-controller-0 tmp]$ sudo docker exec ceph-mon-overcloud-controller-0 ceph --cluster ceph osd stat -f json| jq '( (.num_in_osds) / (.num_osds) ) * 100'                                       
100
[heat-admin@overcloud-controller-0 tmp]$
+++

We can see that the osd(s) are all up and running fine:

+++
[heat-admin@overcloud-controller-0 tmp]$ sudo docker exec ceph-mon-overcloud-controller-0 ceph --cluster ceph -s                                                                                                  
  cluster:
    id:     999157a6-ba94-11ea-9cd3-fa163e7b60c7
    health: HEALTH_WARN
            too few PGs per OSD (26 < min 30)

  services:
    mon: 3 daemons, quorum overcloud-controller-0,overcloud-controller-1,overcloud-controller-2
    mgr: overcloud-controller-2(active), standbys: overcloud-controller-1, overcloud-controller-0
    osd: 3 osds: 3 up, 3 in

  data:
    pools:   5 pools, 80 pgs
    objects: 325 objects, 251MiB
    usage:   487MiB used, 284GiB / 285GiB avail
    pgs:     80 active+clean
+++

This validation comes from file /usr/share/openstack-tripleo-validations/roles/ceph/tasks/ceph-health.yaml:

+++
    - when:
        - osd_percentage_min|default(0) > 0
      block:
        - name: set jq osd percentage filter
          set_fact:
            jq_osd_percentage_filter: '( (.osdmap.num_in_osds) / (.osdmap.num_osds) ) * 100'

        - name: Get OSD stat percentage
          become: true
          shell: >-
            "{{ container_client }}" exec "{{ ceph_mon_container.stdout }}" ceph
            --cluster "{{ ceph_cluster_name.stdout }}" osd stat -f json | jq '{{ jq_osd_percentage_filter }}'
          register: ceph_osd_in_percentage
+++



Version-Release number of selected component (if applicable):
[root@undercloud ceph-ansible]# rpm -qa | grep -i openstack-tripleo-validations
openstack-tripleo-validations-11.3.2-0.20200611115252.08f469d.el8ost.noarch
[root@undercloud ceph-ansible]# 


How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 5 John Fulton 2020-07-01 14:24:54 UTC
WORKAROUND:

Create a disable_osd_validation.yaml with the following content:

parameter_defaults:
  CephOsdPercentageMin: 0


re-run your 'openstack overcloud deploy ...' command and add "-e disable_osd_validation.yaml" as the last argument



More detail:

As per the template "Set this value to 0 to disable this check."

https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/ceph-ansible/ceph-base.yaml#L237-L242

Comment 6 Shatadru Bandyopadhyay 2020-07-02 12:37:44 UTC
Facing the same issue while running ceph upgrade (FFU 13-16)

$ openstack overcloud external-upgrade run --stack overcloud --tags ceph_systemd -e ceph_ansible_limit=computehci0
~~
TASK [ceph : Get OSD stat percentage] ******************************************
Thursday 02 July 2020  12:19:36 +0000 (0:00:00.220)       0:01:18.448 ********* 
fatal: [undercloud -> 192.168.24.12]: FAILED! => {"changed": true, "cmd": "\"podman\" exec \"ceph-mon-controller-0\" ceph --cluster \"ceph\" osd stat -f json | jq '( (.osdmap.num_in_osds) / (.osdmap.num_osds) ) 
* 100'", "delta": "0:00:01.104584", "end": "2020-07-02 12:19:37.468272", "msg": "non-zero return code", "rc": 5, "start": "2020-07-02 12:19:36.363688", "stderr": "jq: error (at <stdin>:1): null (null) and null (
null) cannot be divided", "stderr_lines": ["jq: error (at <stdin>:1): null (null) and null (null) cannot be divided"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
compute-0                  : ok=4    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
compute-1                  : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
computehci-0               : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
computehci-1               : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
computehci-2               : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
controller-0               : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
controller-1               : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
controller-2               : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
undercloud                 : ok=51   changed=12   unreachable=0    failed=1    skipped=60   rescued=0    ignored=0   
~~~

Re-running with "--skip-tags opendev-validation-ceph"

Comment 10 John Fulton 2020-07-14 16:49:46 UTC
Upstream patch merged in main branch but we need backports:

 https://review.opendev.org/738855

Comment 13 Jose Luis Franco 2020-07-28 14:47:19 UTC
So, we did try the patch submitted as a fix in our FFU manual testing and it failed with:

TASK [ceph : set jq osd percentage filter] *************************************
Tuesday 28 July 2020  10:22:16 -0400 (0:00:00.270)       0:01:11.723 ********** 
ok: [undercloud -> 192.168.24.15] => {"ansible_facts": {"jq_osd_percentage_filter": "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds))
 * 100"}, "changed": false}

TASK [ceph : Get OSD stat percentage] ******************************************

Tuesday 28 July 2020  10:22:17 -0400 (0:00:00.272)       0:01:11.996 **********                                                                                     
fatal: [undercloud -> 192.168.24.15]: FAILED! => {"changed": true, "cmd": "\"docker\" exec \"ceph-mon-controller-0\" ceph --cluster \"ceph\" osd stat -f json | jq '( (try .o
sdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100'", "delta": "0:00:00.404082", "end": "2020-07-27 15:43:28.260232", "msg": "non-zero return code", "rc": 1, "start": "2020-07-27 15:43:27.856150", "stderr": "error: try is not defined\n( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100\n   ^^^\nerror: try is not defined\n( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100\n                      
       ^^^\nerror: try is not defined\n( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100\n                                      
            ^^^\nerror: try is not defined\n( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100\n                                 
                                        ^^^\n4 compile errors", "stderr_lines": ["error: try is not defined", "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100", "   ^^^", "error: try is not defined", "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100", "
                             ^^^", "error: try is not defined", "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100", "          
                                        ^^^", "error: try is not defined", "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100", 
"                                                                         ^^^", "4 compile errors"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT *************************************************************


The jq package version found in the controller is:

[root@controller-0 ~]# rpm -qa | grep jq
python-XStatic-jquery-ui-1.10.4.1-1.el7ost.1.noarch
jq-1.3-4.el7ost.x86_64

The try-catch syntax was added in jq-1.5 and we have jq-1.3, so the proposed solution won't work. Moving the BZ back to ASSIGNED so the Ceph Squad can re-work the fix.

Comment 21 Yogev Rabl 2020-08-06 13:30:43 UTC
the fix was not implemented in openstack-tripleo-validations-11.3.2-0.20200611115252.08f469d.el8ost.noarch

Comment 22 Yogev Rabl 2020-08-07 16:17:20 UTC
Verified

Comment 27 errata-xmlrpc 2020-08-27 15:19:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 director bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3542


Note You need to log in before you can comment on or make changes to this bug.