Bug 1787592

Summary: [OSP16]Sriov minor update fails in controllers
Product: Red Hat OpenStack Reporter: Candido Campos <ccamposr>
Component: openstack-tripleo-heat-templatesAssignee: Saravanan KR <skramaja>
Status: CLOSED ERRATA QA Contact: Candido Campos <ccamposr>
Severity: high Docs Contact:
Priority: urgent    
Version: 16.0 (Train)CC: cfontain, ekuris, kfida, mburns, ramishra, sclewis, skramaja, supadhya
Target Milestone: z1Keywords: Reopened, Triaged
Target Release: 16.0 (Train on RHEL 8.1)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-11.3.2-0.20200131231705.39bf6c2.el8ost Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-03 09:45:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Candido Campos 2020-01-03 14:49:21 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:
deploy osp16 with sriov
minor update between:
RHOS_TRUNK-16.0-RHEL-8-20191213.n.5
RHOS_TRUNK-16.0-RHEL-8-20191224.n.0


Steps to Reproduce:

for i in controller-0 controller-1 controller-2; do openstack overcloud update run --stack overcloud --playbook all --limit $i ; done

Actual results:
Fails

Expected results:
Pass

Additional info:


TASK [tuned : Enable tuned profile] ********************************************
Friday 03 January 2020  12:48:19 +0000 (0:00:00.770)       0:11:34.215 ******** 
fatal: [controller-2]: FAILED! => {"changed": true, "cmd": ["tuned-adm", "profile", "cpu-partitioning"], "delta": "0:00:00.433576", "end": "2020-01-03 12:48:20.209823", "msg": "non-zero return code", "rc": 1, "start": "2020-01-03 12:48:19.776247", "stderr": "Cannot load profile(s) 'cpu-partitioning': Assertion 'isolated_cores contains online CPU(s)' failed.", "stderr_lines": ["Cannot load profile(s) 'cpu-partitioning': Assertion 'isolated_cores contains online CPU(s)' failed."], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
controller-2               : ok=182  changed=82   unreachable=0    failed=1    skipped=370  rescued=0    ignored=1   

Friday 03 January 2020  12:48:20 +0000 (0:00:00.994)       0:11:35.209 ******** 
=============================================================================== 

Ansible failed, check log at /var/lib/mistral/0a9a4c4e-c385-476a-bdbc-b11dfcd134ed/ansible.log.
2020-01-03 12:48:22.097 86514 ERROR tripleoclient.v1.overcloud_update.MinorUpdateRun [-] Exception occured while running the command: RuntimeError: Update failed with: Ansible failed, check log at /var/lib/mistral/0a9a4c4e-c385-476a-bdbc-b11dfcd134ed/ansible.log.
2020-01-03 12:48:22.097 86514 ERROR tripleoclient.v1.overcloud_update.MinorUpdateRun Traceback (most recent call last):
2020-01-03 12:48:22.097 86514 ERROR tripleoclient.v1.overcloud_update.MinorUpdateRun   File "/usr/lib/python3.6/site-packages/tripleoclient/command.py", line 32, in run
2020-01-03 12:48:22.097 86514 ERROR tripleoclient.v1.overcloud_update.MinorUpdateRun     super(Command, self).run(parsed_args)
2020-01-03 12:48:22.097 86514 ERROR tripleoclient.v1.overcloud_update.MinorUpdateRun   File "/usr/lib/python3.6/site-packages/osc_lib/command/command.py", line 41, in run
2020-01-03 12:48:22.097 86514 ERROR tripleoclient.v1.overcloud_update.MinorUpdateRun     return super(Command, self).run(parsed_args)
2020-01-03 12:48:22.097 86514 ERROR tripleoclient.v1.overcloud_update.MinorUpdateRun   File "/usr/lib/python3.6/site-packages/cliff/command.py", line 185, in run
2020-01-03 12:48:22.097 86514 ERROR tripleoclient.v1.overcloud_update.MinorUpdateRun     return_code = self.take_action(parsed_args) or 0
2020-01-03 12:48:22.097 86514 ERROR tripleoclient.v1.overcloud_update.MinorUpdateRun   File "/usr/lib/python3.6/site-packages/tripleoclient/v1/overcloud_update.py", line 171, in take_action
2020-01-03 12:48:22.097 86514 ERROR tripleoclient.v1.overcloud_update.MinorUpdateRun     priv_key=key)
2020-01-03 12:48:22.097 86514 ERROR tripleoclient.v1.overcloud_update.MinorUpdateRun   File "/usr/lib/python3.6/site-packages/tripleoclient/utils.py", line 1194, in run_update_ansible_action
2020-01-03 12:48:22.097 86514 ERROR tripleoclient.v1.overcloud_update.MinorUpdateRun     verbosity=verbosity, extra_vars=extra_vars)
2020-01-03 12:48:22.097 86514 ERROR tripleoclient.v1.overcloud_update.MinorUpdateRun   File "/usr/lib/python3.6/site-packages/tripleoclient/workflows/package_update.py", line 127, in update_ansible
2020-01-03 12:48:22.097 86514 ERROR tripleoclient.v1.overcloud_update.MinorUpdateRun     raise RuntimeError('Update failed with: {}'.format(payload['message']))
2020-01-03 12:48:22.097 86514 ERROR tripleoclient.v1.overcloud_update.MinorUpdateRun RuntimeError: Update failed with: Ansible failed, check log at /var/lib/mistral/0a9a4c4e-c385-476a-bdbc-b11dfcd134ed/ansible.log.
2020-01-03 12:48:22.097 86514 ERROR tripleoclient.v1.overcloud_update.MinorUpdateRun 
2020-01-03 12:48:22.105 86514 ERROR openstack [-] Update failed with: Ansible failed, check log at /var/lib/mistral/0a9a4c4e-c385-476a-bdbc-b11dfcd134ed/ansible.log.: RuntimeError: Update failed with: Ansible failed, check log at /var/lib/mistral/0a9a4c4e-c385-476a-bdbc-b11dfcd134ed/ansible.log.
2020-01-03 12:48:22.105 86514 INFO osc_lib.shell [-] END return value: 1

Comment 2 Rabi Mishra 2020-01-03 15:30:52 UTC

*** This bug has been marked as a duplicate of bug 1787459 ***

Comment 3 Eran Kuris 2020-01-07 13:55:56 UTC
(In reply to Rabi Mishra from comment #2)
> 
> *** This bug has been marked as a duplicate of bug 1787459 ***

are you sure its the same issue as bzz 1787459?

Comment 4 Rabi Mishra 2020-01-07 15:15:27 UTC
> are you sure its the same issue as bzz 1787459?

What makes you think it's not? The traceback and the error are the same. Also I see the last run (on 2nd Jan) of the job is green[1].

[1]https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/neutron/job/DFG-network-neutron-16_director-rhel-virthost-3cont_2comp-ipv4-vlan-sriov/lastBuild/

Comment 9 Saravanan KR 2020-01-10 06:10:12 UTC
Controller Role's tuned_profile is "throughput-performance".
ComputeSriov Role's tuned_profile is "cpu-partitioning".

Variable 'tuned_profile' with value 'cpu-partitioning' is applied to
Controller Role, when import_role is used with vars.

    - import_role:
        name: tuned
      vars:
        tuned_profile: 'cpu-partitioning'

Eventhough 'cpu-partitioning' profile is defined only for the ComputeSriov
role under the condition of 'ComptueSriov' role name, because of using
import_role, the variable 'tuned_profile' with value 'cpu-partitioning' is
applied to the whole PLAY itself. As per the TripleO's Role-specific
implementation, the Role-specific variables should be applied only to the
specific TripleO Role, and should not affect the other Roles.

Firstly, is the this expected ansible behavior for import_role? 
  https://docs.ansible.com/ansible/latest/modules/import_role_module.html#notes
As per the ansible import_role documentation, the behavior is expected. "Since
Ansible 2.7 variables defined in vars and defaults for the role are exposed at
playbook parsing time. Due to this, these variables will be accessible to
roles and tasks executed before the location of the import_role task."


For TripleO's Role-specific parameter support, using 'include_role' gives the
expected behavior of apply the variable only to the included role (of the
TripleO's Role) only and not affecting the entier PLAY. I have create a small
gist to understand the behavior.

  - name: play1
    hosts: test
    tasks:
      - debug: var=test_role_var1
      - import_role:
          name: test-role
        vars:
          test_role_var1: 'test_var1_local'

  - name: play2
    hosts: test
      - debug: var=test_role_var1

In this play book, the variable 'test_role_var1' is defined for the entire
PLAY 'play1' and it does not affect 'play2'. Here changing import_role to
include_role, makes the variable 'test_role_var1' to be defined only inside
the 'test-role'. This whole sample code is availble in this git repo for
better understanding.

  https://github.com/krsacme/ansible-include-vs-import


But the same import_role is present in deployment too, why is it affecting
minor update only?

The static import will affect only the PLAY wher it is included. And it does
not affect other PLAYs. That is the difference for deployment and minor
update.


Deployment (only relevant content of deploy_steps_playbook.yaml):

    - hosts: Controller:overcloud                                                         
      name: Overcloud deploy step tasks for step 0                                        
      tasks:                                                                              
        - import_tasks: deploy_steps_tasks_step_0.yaml                                    
      tags:                                                                               
        - step0  


Minor Update (only relevant content of update_steps_playbook.yaml):

    - hosts: Controller
      name: Run update
      tasks:
        - include_tasks: update_steps_tasks.yaml
          with_sequence: start=0 end=5
          loop_control:
            loop_var: step
        - import_tasks: Controller/host_prep_tasks.yaml
          when: tripleo_role_name == 'Controller'
        - import_tasks: deploy_steps_tasks_step_0.yaml
          vars:
            step: 0
        - import_tasks: common_deploy_steps_tasks_step_1.yaml


In case of deployment, 'deploy_steps_tasks_step_0.yaml' is used in a separate
PLAY, which will affect only the step0.

In case of minor update, 'deploy_steps_tasks_step_0.yaml' is used along with
'host_prep_tasks.yaml', where tuned is invoked for all the roles with
repsective tuned profile (incase of controller, it should be
throughput-performance). Because of static import of import_role, the variable
'tuned_profile' is importated from the ComptueSriov role inside the 
'deploy_steps_tasks_step_0.yaml' file, which is causing the variable
'tuned_profile' updated as 'cpu-partitioning' for the whole PLAY itself.

Solution to this minor update problem:

option-1) As like deployment separate out the step0 tasks to a different play.
Though this will solve the problem for now, it may cause trouble when there is
a change in step0 which may support differnt Role-specific variables. 

option-2) Move all import_role to include_role, where Role-specific parameter
are used. This will gives the expected behavior, but may use static import
advantages. 

I believe include_role should be the appropriate solution to handle TripleO's
Role-specific parameters, I will raise bug upstream to continue the discussion
and conclude the solution.

Comment 15 Alex McLeod 2020-02-19 12:39:18 UTC
If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to '-'.

Comment 18 errata-xmlrpc 2020-03-03 09:45:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0655