1852868 – [RHOSP 16.1 Upgrades] openstack overcloud external-upgrade run --stack STACK NAME --tags ceph_systemd -e ceph_ansible_limit=overcloud-controller-0 fails with validation error for task "Get OSD stat percentage"

Bug 1852868 - [RHOSP 16.1 Upgrades] openstack overcloud external-upgrade run --stack STACK NAME --tags ceph_systemd -e ceph_ansible_limit=overcloud-controller-0 fails with validation error for task "Get OSD stat percentage"

Summary: [RHOSP 16.1 Upgrades] openstack overcloud external-upgrade run --stack STACK ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-validations
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	z1
Target Release:	16.1 (Train on RHEL 8.2)
Assignee:	Francesco Pantano
QA Contact:	Yogev Rabl
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-01 13:46 UTC by Punit Kundal
Modified:	2020-08-27 15:19 UTC (History)
CC List:	16 users (show)
Fixed In Version:	openstack-tripleo-validations-11.3.2-0.20200611115253.08f469d.el8ost
Doc Type:	Bug Fix
Doc Text:	This update fixes a Red Hat Ceph Storage (RHCS) version compatibility issue that caused failures during upgrades from Red Hat OpenStack platform 13 to 16.1. Before this fix, validations performed during the upgrade worked with RHCS3 clusters but not RHCS4 clusters. Now the validation works with both RHCS3 and RHCS4 clusters.
Clone Of:
Environment:
Last Closed:	2020-08-27 15:19:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1882387	None	None	None	2020-07-01 14:18:13 UTC
Launchpad	1889279	None	None	None	2020-07-28 16:42:43 UTC
OpenStack gerrit	738855	None	MERGED	Make Get OSD stat percentage compatible with both Luminous and Nautilus	2020-12-05 11:34:09 UTC
OpenStack gerrit	741427	None	MERGED	Make Get OSD stat percentage compatible with both Luminous and Nautilus	2020-12-05 11:34:08 UTC
OpenStack gerrit	743572	None	MERGED	Make Get OSD stat percentage compatible with jq < 1.5	2020-12-05 11:34:34 UTC
OpenStack gerrit	743592	None	MERGED	Make Get OSD stat percentage compatible with jq < 1.5	2020-12-05 11:34:07 UTC
OpenStack gerrit	743598	None	MERGED	Make Get OSD stat percentage compatible with jq < 1.5	2020-12-05 11:34:35 UTC
Red Hat Product Errata	RHBA-2020:3542	None	None	None	2020-08-27 15:19:32 UTC

Description Punit Kundal 2020-07-01 13:46:50 UTC

Description of problem:

We are trying to perform an upgrade from RHOSP 13 to RHOSP 16.1 where while running:

openstack overcloud external-upgrade run --stack STACK NAME --tags ceph_systemd -e ceph_ansible_limit=overcloud-controller-0

the upgrade step fails with:

+++
TASK [ceph : Get OSD stat percentage] ******************************************
Wednesday 01 July 2020  09:11:52 -0400 (0:00:00.241)       0:01:40.860 ********
changed: [undercloud -> 10.10.0.104] => {"changed": true, "cmd": "\"docker\" exec \"ceph-mon-overcloud-controller-0\" ceph --cluster \"ceph\" osd stat -f json | jq '( (.osdmap.num_in_osds) / (.osdmap.num_osds) $
 * 100'", "delta": "0:00:00.435386", "end": "2020-07-01 13:11:53.353584", "rc": 0, "start": "2020-07-01 13:11:52.918198", "stderr": "jq: error: null and null cannot be divided", "stderr_lines": ["jq: error: null
 and null cannot be divided"], "stdout": "", "stdout_lines": []}

TASK [ceph : Fail if there is an unacceptable percentage of in OSDs] ***********
Wednesday 01 July 2020  09:11:53 -0400 (0:00:00.898)       0:01:41.759 ********
fatal: [undercloud -> 10.10.0.104]: FAILED! => {"changed": false, "msg": "Only 0.0% of OSDs are in, but 66% are required"}                                                                                        
+++

so the command that the validation is running is:

+++
[heat-admin@overcloud-controller-0 tmp]$ sudo docker exec ceph-mon-overcloud-controller-0 ceph --cluster ceph osd stat -f json| jq '( (.osdmap.num_in_osds) / (.osdmap.num_osds) ) * 100'                         
jq: error: null and null cannot be divided
+++

While the command that should be run is:

+++
[heat-admin@overcloud-controller-0 tmp]$ sudo docker exec ceph-mon-overcloud-controller-0 ceph --cluster ceph osd stat -f json| jq '( (.num_in_osds) / (.num_osds) ) * 100'                                       
100
[heat-admin@overcloud-controller-0 tmp]$
+++

We can see that the osd(s) are all up and running fine:

+++
[heat-admin@overcloud-controller-0 tmp]$ sudo docker exec ceph-mon-overcloud-controller-0 ceph --cluster ceph -s                                                                                                  
  cluster:
    id:     999157a6-ba94-11ea-9cd3-fa163e7b60c7
    health: HEALTH_WARN
            too few PGs per OSD (26 < min 30)

  services:
    mon: 3 daemons, quorum overcloud-controller-0,overcloud-controller-1,overcloud-controller-2
    mgr: overcloud-controller-2(active), standbys: overcloud-controller-1, overcloud-controller-0
    osd: 3 osds: 3 up, 3 in

  data:
    pools:   5 pools, 80 pgs
    objects: 325 objects, 251MiB
    usage:   487MiB used, 284GiB / 285GiB avail
    pgs:     80 active+clean
+++

This validation comes from file /usr/share/openstack-tripleo-validations/roles/ceph/tasks/ceph-health.yaml:

+++
    - when:
        - osd_percentage_min|default(0) > 0
      block:
        - name: set jq osd percentage filter
          set_fact:
            jq_osd_percentage_filter: '( (.osdmap.num_in_osds) / (.osdmap.num_osds) ) * 100'

        - name: Get OSD stat percentage
          become: true
          shell: >-
            "{{ container_client }}" exec "{{ ceph_mon_container.stdout }}" ceph
            --cluster "{{ ceph_cluster_name.stdout }}" osd stat -f json | jq '{{ jq_osd_percentage_filter }}'
          register: ceph_osd_in_percentage
+++



Version-Release number of selected component (if applicable):
[root@undercloud ceph-ansible]# rpm -qa | grep -i openstack-tripleo-validations
openstack-tripleo-validations-11.3.2-0.20200611115252.08f469d.el8ost.noarch
[root@undercloud ceph-ansible]# 


How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 5 John Fulton 2020-07-01 14:24:54 UTC

WORKAROUND:

Create a disable_osd_validation.yaml with the following content:

parameter_defaults:
  CephOsdPercentageMin: 0


re-run your 'openstack overcloud deploy ...' command and add "-e disable_osd_validation.yaml" as the last argument



More detail:

As per the template "Set this value to 0 to disable this check."

https://github.com/openstack/tripleo-heat-templates/blob/stable/train/deployment/ceph-ansible/ceph-base.yaml#L237-L242

Comment 6 Shatadru Bandyopadhyay 2020-07-02 12:37:44 UTC

Facing the same issue while running ceph upgrade (FFU 13-16)

$ openstack overcloud external-upgrade run --stack overcloud --tags ceph_systemd -e ceph_ansible_limit=computehci0
~~
TASK [ceph : Get OSD stat percentage] ******************************************
Thursday 02 July 2020  12:19:36 +0000 (0:00:00.220)       0:01:18.448 ********* 
fatal: [undercloud -> 192.168.24.12]: FAILED! => {"changed": true, "cmd": "\"podman\" exec \"ceph-mon-controller-0\" ceph --cluster \"ceph\" osd stat -f json | jq '( (.osdmap.num_in_osds) / (.osdmap.num_osds) ) 
* 100'", "delta": "0:00:01.104584", "end": "2020-07-02 12:19:37.468272", "msg": "non-zero return code", "rc": 5, "start": "2020-07-02 12:19:36.363688", "stderr": "jq: error (at <stdin>:1): null (null) and null (
null) cannot be divided", "stderr_lines": ["jq: error (at <stdin>:1): null (null) and null (null) cannot be divided"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
compute-0                  : ok=4    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
compute-1                  : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
computehci-0               : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
computehci-1               : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
computehci-2               : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
controller-0               : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
controller-1               : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
controller-2               : ok=3    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
undercloud                 : ok=51   changed=12   unreachable=0    failed=1    skipped=60   rescued=0    ignored=0   
~~~

Re-running with "--skip-tags opendev-validation-ceph"

Comment 10 John Fulton 2020-07-14 16:49:46 UTC

Upstream patch merged in main branch but we need backports:

 https://review.opendev.org/738855

Comment 13 Jose Luis Franco 2020-07-28 14:47:19 UTC

So, we did try the patch submitted as a fix in our FFU manual testing and it failed with:

TASK [ceph : set jq osd percentage filter] *************************************
Tuesday 28 July 2020  10:22:16 -0400 (0:00:00.270)       0:01:11.723 ********** 
ok: [undercloud -> 192.168.24.15] => {"ansible_facts": {"jq_osd_percentage_filter": "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds))
 * 100"}, "changed": false}

TASK [ceph : Get OSD stat percentage] ******************************************

Tuesday 28 July 2020  10:22:17 -0400 (0:00:00.272)       0:01:11.996 **********                                                                                     
fatal: [undercloud -> 192.168.24.15]: FAILED! => {"changed": true, "cmd": "\"docker\" exec \"ceph-mon-controller-0\" ceph --cluster \"ceph\" osd stat -f json | jq '( (try .o
sdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100'", "delta": "0:00:00.404082", "end": "2020-07-27 15:43:28.260232", "msg": "non-zero return code", "rc": 1, "start": "2020-07-27 15:43:27.856150", "stderr": "error: try is not defined\n( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100\n   ^^^\nerror: try is not defined\n( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100\n                      
       ^^^\nerror: try is not defined\n( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100\n                                      
            ^^^\nerror: try is not defined\n( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100\n                                 
                                        ^^^\n4 compile errors", "stderr_lines": ["error: try is not defined", "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100", "   ^^^", "error: try is not defined", "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100", "
                             ^^^", "error: try is not defined", "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100", "          
                                        ^^^", "error: try is not defined", "( (try .osdmap.num_in_osds + try .num_in_osds) / (try .osdmap.num_osds + try .num_osds)) * 100", 
"                                                                         ^^^", "4 compile errors"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT *************************************************************


The jq package version found in the controller is:

[root@controller-0 ~]# rpm -qa | grep jq
python-XStatic-jquery-ui-1.10.4.1-1.el7ost.1.noarch
jq-1.3-4.el7ost.x86_64

The try-catch syntax was added in jq-1.5 and we have jq-1.3, so the proposed solution won't work. Moving the BZ back to ASSIGNED so the Ceph Squad can re-work the fix.

Comment 21 Yogev Rabl 2020-08-06 13:30:43 UTC

the fix was not implemented in openstack-tripleo-validations-11.3.2-0.20200611115252.08f469d.el8ost.noarch

Comment 22 Yogev Rabl 2020-08-07 16:17:20 UTC

Verified

Comment 27 errata-xmlrpc 2020-08-27 15:19:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1 director bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3542

Note You need to log in before you can comment on or make changes to this bug.