Bug 1575623 - Deployment fails with HCI enabled and SchedulerHints
Summary: Deployment fails with HCI enabled and SchedulerHints
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-common
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: z2
: 13.0 (Queens)
Assignee: Alan Bishop
QA Contact: Yogev Rabl
URL:
Whiteboard:
Depends On:
Blocks: 1552759
TreeView+ depends on / blocked
 
Reported: 2018-05-07 12:59 UTC by Alan Bishop
Modified: 2018-12-24 11:40 UTC (History)
14 users (show)

Fixed In Version: openstack-tripleo-common-8.6.3-2.el7ost
Doc Type: Bug Fix
Doc Text:
The Derived Parameters workflow now supports the use of SchedulerHints to identify overcloud nodes. Previously, the workflow could not use use SchedulerHints to identify overcloud nodes associated with the corresponding TripleO overcloud role. This caused the overcloud deployment to fail. SchedulerHints support prevents these failures.
Clone Of: 1552759
Environment:
Last Closed: 2018-08-29 16:35:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1760659 0 None None None 2018-05-07 12:59:08 UTC
OpenStack gerrit 557318 0 None MERGED Reduce scope of the lock for image volume cache 2021-01-11 17:55:41 UTC
OpenStack gerrit 558313 0 None MERGED Use scheduler hints in derived_parameters workflow 2021-01-11 17:55:42 UTC
OpenStack gerrit 566110 0 None MERGED Use scheduler hints in derived_parameters workflow 2021-01-11 17:55:04 UTC
Red Hat Product Errata RHBA-2018:2574 0 None None None 2018-08-29 16:36:50 UTC

Description Alan Bishop 2018-05-07 12:59:09 UTC
+++ This bug was initially created as a clone of Bug #1552759 +++

Description of problem:

Deployment of HCI enabled OpenStack Platform 12 fails when using Nova Scheduler Hints.

(undercloud) [stack@director ~]$ ./deploy-now-hci.sh
Started Mistral Workflow tripleo.validations.v1.check_pre_deployment_validations. Execution ID: 79e884c9-05cb-474b-9f30-6292a70cdba4
Waiting for messages on queue '51484ca6-4916-4d91-acfc-57145bf63494' with no timeout.
Removing the current plan files
Uploading new plan files
Started Mistral Workflow tripleo.plan_management.v1.update_deployment_plan. Execution ID: e71ebf09-9c80-44a9-82d7-64538c6291eb
Plan updated.
Processing templates in the directory /tmp/tripleoclient-koTgrN/tripleo-heat-templates
Invoking workflow (tripleo.derive_params.v1.derive_parameters) specified in plan-environment file
Started Mistral Workflow tripleo.derive_params.v1.derive_parameters. Execution ID: acea8816-4e38-466a-a8e8-cb223eca0ac4
Workflow execution is failed: [{u'status': u'SUCCESS', u'message': u'', u'role_name': u'Controller'}, {u'status': u'FAILED', u'message': u'Unable to determine profile for flavor (flavor name: baremetal)', u'role_name': u'Compute'}]


It doesn't matter whether I use Compute or ComputeHCI roles. As soon as OS::TripleO::Services::CephOSD is added to the role deployment fails with the error above.


Version-Release number of selected component (if applicable):
[root@director stack]# rpm -qa | grep -i tripleo
python-tripleoclient-7.3.3-7.el7ost.noarch
openstack-tripleo-ui-7.4.3-4.el7ost.noarch
openstack-tripleo-image-elements-7.0.1-1.el7ost.noarch
puppet-tripleo-7.4.3-11.el7ost.noarch
openstack-tripleo-common-containers-7.6.3-10.el7ost.noarch
openstack-tripleo-heat-templates-7.0.3-22.el7ost.noarch
openstack-tripleo-validations-7.4.2-1.el7ost.noarch
openstack-tripleo-puppet-elements-7.0.1-2.el7ost.noarch
openstack-tripleo-common-7.6.3-10.el7ost.noarch

How reproducible:
Every time when deploying OS::TripleO::Services::CephOSD on Compute node.


Steps to Reproduce:

1. Generate roles_data.yaml file:
[stack@director templates]$ openstack overcloud roles generate -o /home/stack/templates/hci/roles_data.yaml Controller Compute

2. Add OS::TripleO::Services::CephOSD service to the Compute role.

3. Use scheduler hints file to control node placement:
[stack@director templates]$ cat scheduler_hints_env.yaml 
parameter_defaults:
  ControllerSchedulerHints:
    'capabilities:node': 'overcloud-controller-%index%'
  ComputeSchedulerHints:
    'capabilities:node': 'overcloud-compute-%index%'
  CephStorageSchedulerHints:
    'capabilities:node': 'overcloud-ceph-%index%'

4. Run the deployment including customized roles_data.yaml and scheduler_hints_env.yaml

5. Observe error:
Workflow execution is failed: [{u'status': u'SUCCESS', u'message': u'', u'role_name': u'Controller'}, {u'status': u'FAILED', u'message': u'Unable to determine profile for flavor (flavor name: baremetal)', u'role_name': u'Compute'}]

Actual results:
Deployment fails.

Expected results:
Deployment uses scheduler hints instead of flavor/profiles and finish successfully.

Additional info:

--- Additional comment from Alex Schultz on 2018-03-07 16:14:24 EST ---

Can you please provide a sosreport from the undercloud? Thanks.

--- Additional comment from Rafal Szmigiel on 2018-03-07 17:59:06 EST ---

Hey Alex,

It will take a while because I have to revert the environment to the previous state.

In the meantime not 100% sure but I think I found it.

(undercloud) [stack@director ~]$ mistral workflow-get-definition tripleo.derive_params.v1._derive_parameters_per_role | grep -B4 TODO
    # Getting introspection data workflow, which will take care of
    # 1) profile and flavor based mapping
    # 2) Nova placement api based mapping
    # Currently we have implemented profile and flavor based mapping
    # TODO-Nova placement api based mapping is pending, we will enchance it later.

(undercloud) [stack@director ~]$ mistral workflow-get-definition tripleo.derive_params.v1._get_role_info | grep -A8 -E 'check_features:$'
    check_features:
      on-success: build_feature_dict
      publish:
        # TODO: Need to update this logic for ODL integration.
        # The role supports the DPDK feature if the NeutronDatapathType parameter is present.
        dpdk: <% $.role_services.any($.get('parameters', []).contains('NeutronDatapathType')) %>

        # The role supports the HCI feature if it includes both NovaCompute and CephOSD services.
        hci: <% $.role_services.any($.get('type', '').endsWith('::NovaCompute')) and $.role_services.any($.get('type', '').endsWith('::CephOSD')) %>

--- Additional comment from Rafal Szmigiel on 2018-03-07 18:21:38 EST ---

Uploaded to dropbox.redhat.com (sosreport-director.lab.rhpoc.net-20180307180957.tar.xz).

Thanks in advance,

Rafal

--- Additional comment from Saravanan KR on 2018-03-28 05:52:04 EDT ---

This deployment is using the derive parameters workflow by using the "-p" option in the deploy command. In order to use this feature, the nodes and flavors should be tagged with matching profile. And Overcloud<RoleName>Flavor parameters should provide the matching flavor name to use. In this error, there are not flavor mentioned in the parameters, which defaults to 'baremetal' and it is failing. Ensure the correct flavor name is provided.

--- Additional comment from Rafal Szmigiel on 2018-03-28 06:37:21 EDT ---

Hey Saravanan,

This deployment uses SchedulerHints therefore no flavors other than baremetal should be used. Please check https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/12/html/advanced_overcloud_customization/sect-controlling_node_placement#sect-Assign_Specific_Node_IDs for more details.

Rafał

--- Additional comment from Saravanan KR on 2018-03-28 06:47:30 EDT ---

(In reply to Rafal Szmigiel from comment #5)
> Hey Saravanan,
> 
> This deployment uses SchedulerHints therefore no flavors other than
> baremetal should be used. Please check
> https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/12/
> html/advanced_overcloud_customization/sect-controlling_node_placement#sect-
> Assign_Specific_Node_IDs for more details.

Derive parameters workflow supports only the role tagging and does NOT support SchedulerHints yet. Though it was earlier planned to support, but work has not started yet. Two options from here - Either you could use derive parameters with role-tagging OR use scheduler hints by providing the parameters manually without -p option. I have added Alan Bishop and Jagan who were working on the current version of derived parameters.

--- Additional comment from Rafal Szmigiel on 2018-03-28 06:51:28 EDT ---

Thanks for the clarification and looping Alan and Jagan.

Best Regards,

Rafal

--- Additional comment from Alan Bishop on 2018-03-28 07:56:49 EDT ---

Just to clarify Saravanan's comment, the Derived Parameters workflow relies on role tagging, but this is not incompatible with SchedulerHints. It's OK to continue to specify SchedulerHints, but you also need the nodes for which you want parameters derived (i.e. HCI) to be tagged with a role/profile. This is necessary for the Derived Parameters workflow to identify the nodes so that it can determine their hardware characteristics. This should provide a workaround until we can fix the workflow so that it can use just the SchedulerHints.

--- Additional comment from Alan Bishop on 2018-05-03 12:33:28 EDT ---

Patch merged upstream, and I've begun upstream backports to stable/queens and stable/pike.

Comment 2 Alan Bishop 2018-05-07 13:01:57 UTC
Patch has merged on upstream stable/queens.

Comment 12 Yogev Rabl 2018-08-22 01:54:08 UTC
Verified

Comment 14 errata-xmlrpc 2018-08-29 16:35:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2574


Note You need to log in before you can comment on or make changes to this bug.