1714922 – Director Ceph LVM deployment fails with DeriveParams "No Ceph OSDs found in the overcloud definition ('ceph::profile::params::osds')"

Bug 1714922 - Director Ceph LVM deployment fails with DeriveParams "No Ceph OSDs found in the overcloud definition ('ceph::profile::params::osds')"

Summary: Director Ceph LVM deployment fails with DeriveParams "No Ceph OSDs found in t...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-common
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	z8
Target Release:	13.0 (Queens)
Assignee:	Francesco Pantano
QA Contact:	Yogev Rabl
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-29 08:08 UTC by Nick Satsia
Modified:	2019-09-03 16:55 UTC (History)
CC List:	11 users (show)
Fixed In Version:	openstack-tripleo-common-8.6.8-14.el7ost
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-09-03 16:55:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1831458	None	None	None	2019-06-03 14:13:33 UTC
OpenStack gerrit	663538	None	MERGED	Add the ability to compute osds number counting lvm devices	2020-06-17 21:05:10 UTC
Red Hat Product Errata	RHBA-2019:2624	None	None	None	2019-09-03 16:55:53 UTC

Description Nick Satsia 2019-05-29 08:08:09 UTC

Description of problem:
When deploying Ceph by Director using osd_scrnario LVM and Bluestore, if you configure the osd/disk devices under lvm_volumes then they are not counted as OSDs and the deployment fails as follows:

(undercloud) [stack@director deployment]$ ./deploy.sh
Wed May 29 14:07:16 AEST 2019
Started Mistral Workflow tripleo.validations.v1.check_pre_deployment_validations. Execution ID: 3cb68871-ceda-4603-b664-9fd63618bd07
Waiting for messages on queue 'tripleo' with no timeout.
Creating Swift container to store the plan
Creating plan from template files in: /tmp/tripleoclient-no1D_g/tripleo-heat-templates
Started Mistral Workflow tripleo.plan_management.v1.create_deployment_plan. Execution ID: 8c5fe859-b307-4be9-8f3c-9367680eb370
Plan created.
Processing templates in the directory /tmp/tripleoclient-no1D_g/tripleo-heat-templates
Invoking workflow (tripleo.derive_params.v1.derive_parameters) specified in plan-environment file
Started Mistral Workflow tripleo.derive_params.v1.derive_parameters. Execution ID: 55b451c5-8df4-43c8-ac87-0cff347383e3
Workflow execution is failed: Role 'ComputeCeph': No Ceph OSDs found in the overcloud definition ('ceph::profile::params::osds').

real    6m17.781s
user    0m4.440s
sys     0m0.482s
Wed May 29 14:13:34 AEST 2019
(undercloud) [stack@director deployment]$




Version-Release number of selected component (if applicable):
RHOSP13z6

How reproducible:
Every time. 100%

Steps to Reproduce:
1.Deploy Ceph by Director with a config file similar to the following:

parameter_defaults:
  CephAnsiblePlaybookVerbosity: 1
#  CephPoolDefaultSize: 1
  CephConfigOverrides:
    mon_max_pg_per_osd: 500

  CephAnsibleDisksConfig:
    osd_scenario: lvm
    osd_objectstore: bluestore
    dmcrypt: false
#
    lvm_volumes:
      - data: /dev/sdb
      - data: /dev/sdc
      - data: /dev/sdd
      - data: /dev/sde
        crush_device_class: ssd
      - data: /dev/sdf
        crush_device_class: ssd
      - data: /dev/sdg
        crush_device_class: ssd


Actual results:
Fails with: 
    Workflow execution is failed: Role 'ComputeCeph': No Ceph OSDs found in the overcloud definition ('ceph::profile::params::osds').


Expected results:
Should successfully deploy.

Additional info:

Workaround for now is to confiure at least one device under "devices":

  CephAnsibleDisksConfig:
    osd_scenario: lvm
    osd_objectstore: bluestore
    dmcrypt: false
    devices:
      - /dev/sdb
#
    lvm_volumes:
#      - data: /dev/sdb
      - data: /dev/sdc
      - data: /dev/sdd
      - data: /dev/sde
        crush_device_class: ssd
      - data: /dev/sdf
        crush_device_class: ssd
      - data: /dev/sdg
        crush_device_class: ssd


I believe the problem is here:
------------------------------
ONLY COUNTS "devices"

/usr/share/openstack-tripleo-common/workbooks/derive_params_formulas.yaml
      get_num_osds:
        publish:
          num_osds: <% $.heat_resource_tree.parameters.get('CephAnsibleDisksConfig', {}).get('default', {}).get('devices', []).count() %>
        on-success:
          - get_memory_mb: <% $.num_osds %>
          # If there's no CephAnsibleDisksConfig then look for OSD configuration in hiera data
          - get_num_osds_from_hiera: <% not $.num_osds %>

Comment 1 John Fulton 2019-05-29 13:53:32 UTC

nsatsia is right. The following only uses the devices list because it was written for a pre ceph-volume world:

 https://github.com/openstack/tripleo-common/blob/master/workbooks/derive_params_formulas.yaml#L650

With ceph-ansible 3.2 it's also valid to pass lvm_volumes so the above Mistral workbook needs to be able to work with both. Perhaps starting with the following and then accounting for exceptions:

 num_osds = count(devices) + count(lvm_volumes)

Comment 2 John Fulton 2019-05-29 14:12:33 UTC

WORKAROUND:

Until this is fixed, if you're going to use lvm_volumes instead of devices, then deploy without the -p as described in [1] and instead derive the reserved_host_memory and cpu_allocation_ratio values manually and pass them in an env file [2].

[1] https://access.redhat.com/documentation/en-us/red_hat_hyperconverged_infrastructure_for_cloud/13/html-single/deployment_guide/index#running-the-deploy-command-rhhi
[2] https://access.redhat.com/documentation/en-us/red_hat_hyperconverged_infrastructure_for_cloud/13/html-single/deployment_guide/index#changing-nova-reserved-memory-and-cpu-allocation-manually

Comment 3 Nick Satsia 2019-05-29 23:55:23 UTC

(In reply to John Fulton from comment #2)
> WORKAROUND:
> 
> Until this is fixed, if you're going to use lvm_volumes instead of devices,
> then deploy without the -p as described in [1] and instead derive the
> reserved_host_memory and cpu_allocation_ratio values manually and pass them
> in an env file [2].
> 
> [1]
> https://access.redhat.com/documentation/en-us/
> red_hat_hyperconverged_infrastructure_for_cloud/13/html-single/
> deployment_guide/index#running-the-deploy-command-rhhi
> [2]
> https://access.redhat.com/documentation/en-us/
> red_hat_hyperconverged_infrastructure_for_cloud/13/html-single/
> deployment_guide/index#changing-nova-reserved-memory-and-cpu-allocation-
> manually

Thanks John.
didn't realise it was triggered by using the "-p" option. Anyway, as I'm only deploying the data block LVM, moving one disk to devices builds the OSDs the same way anyway.

Comment 10 errata-xmlrpc 2019-09-03 16:55:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2624

Note You need to log in before you can comment on or make changes to this bug.