Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1889159

Summary: 'openstack tripleo validator run' should not run the ceph-pg validation
Product: Red Hat OpenStack Reporter: Alex Stupnikov <astupnik>
Component: openstack-tripleo-validationsAssignee: John Fulton <johfulto>
Status: CLOSED ERRATA QA Contact: Yogev Rabl <yrabl>
Severity: low Docs Contact:
Priority: low    
Version: 16.1 (Train)CC: dhill, emacchi, gfidente, jjoyce, johfulto, jschluet, slinaber, tvignaud
Target Milestone: z4Keywords: Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: openstack-tripleo-validations-11.3.2-1.20201114040743.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-17 15:33:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
upstream gerrit down. will apply this when it's back up none

Description Alex Stupnikov 2020-10-18 15:54:48 UTC
Description of problem:

ceph-pg validation always fails with the same error:

{
    "task": {
        "hosts": {
            "undercloud": {
                "_ansible_no_log": false,
                "action": "fail",
                "changed": false,
                "failed": true,
                "msg": "In order to simulate Tripleo Heat Template behavior this role requires\nthat it be run with Ansible's hash_behaviour set to merge. Please\nre-run with 'export ANSIBLE_HASH_BEHAVIOUR=merge'\"\n"
            }
        },
        "name": "Fail unless ANSIBLE_HASH_BEHAVIOUR=merge",
        "status": "FAILED"
    }
}

Basically, this validation is currently broken when executed by pre-deployment and post-deployment checks and doesn't provide useful information.

Another problem is if proper variable is exported, this validation still fails for deployments without ceph (I haven't tried it in deployment with Ceph, it could also be broken there):

  {
      "task": {
          "hosts": {
              "undercloud": {
                  "_ansible_no_log": false,
                  "action": "fail",
                  "changed": false,
                  "failed": true,
                  "msg": "Please pass the expected number of OSDs, e.g. '-e num_osds=36'"
              }
          },
          "name": "Fail if number of OSDs is not specified",
          "status": "FAILED"
    }
  }

Comment 1 John Fulton 2020-10-19 15:10:53 UTC
You're not using the validation as it was meant to be used. How to use it correctly is described in the upstream documentation:

 https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features/ceph_config.html#ceph-placement-group-validation

Comment 2 John Fulton 2020-10-19 15:44:44 UTC
(In reply to Alex Stupnikov from comment #0)
> Description of problem:
> 
> ceph-pg validation always fails with the same error:
> 
> {
>     "task": {
>         "hosts": {
>             "undercloud": {
>                 "_ansible_no_log": false,
>                 "action": "fail",
>                 "changed": false,
>                 "failed": true,
>                 "msg": "In order to simulate Tripleo Heat Template behavior
> this role requires\nthat it be run with Ansible's hash_behaviour set to
> merge. Please\nre-run with 'export ANSIBLE_HASH_BEHAVIOUR=merge'\"\n"
>             }
>         },
>         "name": "Fail unless ANSIBLE_HASH_BEHAVIOUR=merge",
>         "status": "FAILED"
>     }
> }

This is the expected behaviour. It will always fail by design unless, as the above message says, you set ANSIBLE_HASH_BEHAVIOUR=merge. Also, note you can NEVER run the validation during deployment. Only before deployment on the command line as documented. I think the issue is that you didn't have the documentation.

> Basically, this validation is currently broken when executed by
> pre-deployment and post-deployment checks 

If the validation is run pre-deployment and as documented, then it will tell in advance if a deployment will fail the PG overdose protection check.

> and doesn't provide useful information.

If the validation didn't run, then it didn't provide any information. If that information wasn't provided it couldn't be judged useful or not useful.

> Another problem is if proper variable is exported, this validation still
> fails for deployments without ceph (I haven't tried it in deployment with
> Ceph, it could also be broken there):
> 
>   {
>       "task": {
>           "hosts": {
>               "undercloud": {
>                   "_ansible_no_log": false,
>                   "action": "fail",
>                   "changed": false,
>                   "failed": true,
>                   "msg": "Please pass the expected number of OSDs, e.g. '-e
> num_osds=36'"
>               }
>           },
>           "name": "Fail if number of OSDs is not specified",
>           "status": "FAILED"
>     }
>   }

The validation tells you if you have a valid Ceph configuration by succeeding (in the case of a valid ceph config definition in yaml files). It will tell you if you have an invalid ceph configuration by failing. That's how it is was designed. If you don't have a Ceph configuration, then why would you try to validate it? I.e. if you don't have a ceph configuration, then don't use this validation.

by the same reasoning, it cannot simulate PG creation unless you pass the expected number of OSDs. That's why it told you to pass the OSDs and provided an example of how to pass the number of OSDs e.g. '-e num_osds=36'.

I really think this is a documentation issue. If the validation is run like this:

export THT=/usr/share/openstack-tripleo-heat-templates/
ansible-playbook -i inventory $BASE/playbooks/ceph-pg.yaml \
  -e @$THT/environments/ceph-ansible/ceph-rgw.yaml \
  -e @$THT/environments/ceph-ansible/ceph-mds.yaml \
  -e @$THT/environments/manila-cephfsganesha-config.yaml \
  -e @ceph.yaml -e num_osds=36

Then it can tell if the PG numbers, as set in the ceph.yaml, are correct relative to the rest of the ceph configuration. It is not mandatory to run this validation and it shouldn't block a deployment.

 https://docs.openstack.org/project-deploy-guide/tripleo-docs/latest/features/ceph_config.html#ceph-placement-group-validation

Comment 3 John Fulton 2020-10-19 15:51:54 UTC
Validating Ceph Configuration
-----------------------------

The tripleo-validations framework contains validations for Ceph
which may be run before deployment to save time debugging possible
failures.

Create an inventory on the undercloud which refers to itself::

  echo "undercloud ansible_connection=local" > inventory

Set Ansible environment variables::

  BASE="/usr/share/openstack-tripleo-validations"
  export ANSIBLE_RETRY_FILES_ENABLED=false
  export ANSIBLE_KEEP_REMOTE_FILES=1
  export ANSIBLE_CALLBACK_PLUGINS="${BASE}/callback_plugins"
  export ANSIBLE_ROLES_PATH="${BASE}/roles"
  export ANSIBLE_LOOKUP_PLUGINS="${BASE}/lookup_plugins"
  export ANSIBLE_LIBRARY="${BASE}/library"

See what Ceph validations are available::

  ls $BASE/playbooks | grep ceph

Run a Ceph validation with command like the following::

  ansible-playbook -i inventory $BASE/playbooks/ceph-ansible-installed.yaml

For Stein and newer it is possible to run validations using the
`openstack tripleo validator run` command with a syntax like the
following::

  openstack tripleo validator run --validation ceph-ansible-installed

The `ceph-ansible-installed` validation warns if the `ceph-ansible`
RPM is not installed on the undercloud. This validation is also run
automatically during deployment unless validations are disabled.

Ceph Placement Group Validation
-------------------------------

Ceph will refuse to take certain actions if they are harmful to the
cluster. E.g. if the placement group numbers are not correct for the
amount of available OSDs, then Ceph will refuse to create pools which
are required for OpenStack. Rather than wait for the deployment to
reach the point where Ceph is going to be configured only to find out
that the deployment failed because the parameters were not correct,
you may run a validation before deployment starts to quickly determine
if Ceph will create your OpenStack pools based on the overrides which
will be passed to the overcloud.

.. note::

   Unless there are at least 8 OSDs, the TripleO defaults will
   cause the deployment to fail unless you modify the CephPools,
   CephPoolDefaultSize, or CephPoolDefaultPgNum parameters. This
   validation will help you find the appropriate values.

To run the `ceph-pg` validation, configure your environment as
described in the previous section but also run the following
command to switch Ansible's `hash_behaviour` from `replace`
(the default) to `merge`. This is done to make Ansible behave
the same way that TripleO Heat Templates behaves when multiple
environment files are passed with the `-e @file.yaml` syntax::

  export ANSIBLE_HASH_BEHAVIOUR=merge

Then use a command like the following::

  ansible-playbook -i inventory $BASE/playbooks/ceph-pg.yaml -e @ceph.yaml -e num_osds=36

The `num_osds` parameter is required. This value should be the number
of expected OSDs that will be in the Ceph deployment. It should be
equal to the number of devices and lvm_volumes under
`CephAnsibleDisksConfig` multiplied by the number of nodes running the
`CephOSD` service (e.g. nodes in the CephStorage role, nodes in the
ComputeHCI role, and any custom roles, etc.). This value should also
be adjusted to compensate for the number of OSDs used by nodes with
node-specific overrides as covered earlier in this document.

In the above example, `ceph.yaml` should be the same file passed to
the overcloud deployment, e.g. `opesntack overcloud deploy ... -e
ceph.yaml`, as covered earlier in this document. As many files as
required may be passed using `-e @file.yaml` in order to get the
following parameters passed to the `ceph-pg` validation.

* CephPoolDefaultSize
* CephPoolDefaultPgNum
* CephPools

If the above parameters are not passed, then the TripleO defaults will
be used for the parameters above.

The above example is based only on Ceph pools created for RBD. If Ceph
RGW and/or Manila via NFS Ganesha is also being deployed, then simply
pass the same environment files for enabling these services you would
as if you were running `openstack overcloud deploy`. For example::

  export THT=/usr/share/openstack-tripleo-heat-templates/
  ansible-playbook -i inventory $BASE/playbooks/ceph-pg.yaml \
    -e @$THT/environments/ceph-ansible/ceph-rgw.yaml \
    -e @$THT/environments/ceph-ansible/ceph-mds.yaml \
    -e @$THT/environments/manila-cephfsganesha-config.yaml \
    -e @ceph.yaml -e num_osds=36

In the above example, the validation will simulate the creation of the
pools required for the RBD, RGW and MDS services and the validation
will fail if the placement group numbers are not correct.

Comment 4 John Fulton 2020-10-19 15:54:15 UTC
The content of comment #3 may be used as copy to begin documenting the Ceph validations including the PG validation

Comment 5 Alex Stupnikov 2020-10-20 10:18:36 UTC
John, thank you very much for the follow-up. I understand your point and I wasn't the person who identified this as a problem in ceph-pg validation. My complain is that:

- all validations are grouped and ceph-pg validation belongs to two groups [1]: pre-deployment and post-ceph. In our official documentation we recommend customer to use those groups [2]. AFAIU all customers would see this validation fail with error [3] when they would run pre-deployment validations.
- AFAIU, ceph-pg validation would still fail when called by pre-deployment group if Ceph is not used even if customers would follow recommendations from comment #3.

That's my understanding of this bug. Sorry if its description looked ambiguous.

As this bug is currently triaged as doc issue, I will add needinfo for John, so my comment would not be lost.

[1]
https://github.com/openstack/tripleo-validations/blob/stable/train/playbooks/ceph-pg.yaml#L17-L19

[2]
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/director_installation_and_usage/index#using-the-validation-framework

[3]
{
    "task": {
        "hosts": {
            "undercloud": {
                "_ansible_no_log": false,
                "action": "fail",
                "changed": false,
                "failed": true,
                "msg": "In order to simulate Tripleo Heat Template behavior this role requires\nthat it be run with Ansible's hash_behaviour set to merge. Please\nre-run with 'export ANSIBLE_HASH_BEHAVIOUR=merge'\"\n"
            }
        },
        "name": "Fail unless ANSIBLE_HASH_BEHAVIOUR=merge",
        "status": "FAILED"
    }
}

Comment 6 John Fulton 2020-10-20 14:15:26 UTC
(In reply to Alex Stupnikov from comment #5)
> John, thank you very much for the follow-up. I understand your point and I
> wasn't the person who identified this as a problem in ceph-pg validation. My
> complain is that:
> 
> - all validations are grouped and ceph-pg validation belongs to two groups
> [1]: pre-deployment and post-ceph. In our official documentation we
> recommend customer to use those groups [2]. AFAIU all customers would see
> this validation fail with error [3] when they would run pre-deployment
> validations.

Yes, I see your point. Thanks for making that more clear. We don't want this validation to be run under that condition so that's the bug. I'll make a patch for it. 

  John

> [1]
> https://github.com/openstack/tripleo-validations/blob/stable/train/playbooks/
> ceph-pg.yaml#L17-L19
> 
> [2]
> https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.
> 1/html-single/director_installation_and_usage/index#using-the-validation-
> framework
> 
> [3]
> {
>     "task": {
>         "hosts": {
>             "undercloud": {
>                 "_ansible_no_log": false,
>                 "action": "fail",
>                 "changed": false,
>                 "failed": true,
>                 "msg": "In order to simulate Tripleo Heat Template behavior
> this role requires\nthat it be run with Ansible's hash_behaviour set to
> merge. Please\nre-run with 'export ANSIBLE_HASH_BEHAVIOUR=merge'\"\n"
>             }
>         },
>         "name": "Fail unless ANSIBLE_HASH_BEHAVIOUR=merge",
>         "status": "FAILED"
>     }
> }

Comment 7 John Fulton 2020-10-20 14:53:45 UTC
Created attachment 1722914 [details]
upstream gerrit down. will apply this when it's back up

Comment 10 John Fulton 2020-11-23 16:47:03 UTC
*** Bug 1898743 has been marked as a duplicate of this bug. ***

Comment 21 Yogev Rabl 2021-03-05 16:15:31 UTC
Verified

Comment 25 errata-xmlrpc 2021-03-17 15:33:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.4 director bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0817

Comment 26 errata-xmlrpc 2021-03-17 15:38:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.4 director bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0817