Bug 1975240

Summary: [update] from 16.1 to 16.2, when enabling tsx flag, compute node get restarted during update and ping loss occurs.
Product: Red Hat OpenStack Reporter: Sofer Athlan-Guyot <sathlang>
Component: tripleo-ansibleAssignee: David Vallee Delisle <dvd>
Status: CLOSED ERRATA QA Contact: Jason Grosso <jgrosso>
Severity: urgent Docs Contact: Vlada Grosu <vgrosu>
Priority: urgent    
Version: 16.2 (Train)CC: astillma, dvalleed, dvd, jamsmith, jgrosso, jniu, jpateteg, jpretori, mburns, mciecier, mschuppe, shrjoshi, spower, supadhya, vgrosu
Target Milestone: rcKeywords: Patch, Triaged
Target Release: 16.2 (Train on RHEL 8.4)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tripleo-ansible-0.7.1-2.20210603175839.el8ost.7 openstack-tripleo-heat-templates-11.5.1-2.20210603174821.el8ost.7 Doc Type: Known Issue
Doc Text:
Starting with Red Hat Enterprise Linux (RHEL) version 8.3, support for the Intel Transactional Synchronization Extensions (TSX) feature is disabled by default. Currently, this causes instance live migration to fail when migrating from hosts where the TSX kernel argument is enabled to hosts where the TSX kernel argument is disabled. + This impact applies only to Intel hosts that support the TSX feature. For more information about the CPUs that are affected by this issue, see link:https://access.redhat.com/articles/6101171#affected-configurations-17[Affected Configurations]. + For more information, review the following Red Hat Knowledgebase solution link:https://access.redhat.com/solutions/6036141[Guidance on Intel TSX impact on OpenStack guests].
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-15 07:16:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sofer Athlan-Guyot 2021-06-23 10:26:21 UTC
Description of problem:

Hi,

Updating OSP16.1 to OSP16.2 require the new kernel flag TSX=on so that we are able to migrate vms[1].

Enabling the new parameter has the side effect of rebooting the compute node *during* update: this cause a first data plane (ping loss) cut. After the data plane doesn't recover.

To workaround this issue we need to add a preparation step: the /etc/default/grub file must already contain the tsx parameter before update.

Somehow related to https://bugzilla.redhat.com/show_bug.cgi?id=1923165 but for update.

[1] See "Minor update: from RHOSP-16.1 to RHOSP-16.2" there https://access.redhat.com/solutions/6036141

Comment 1 Sofer Athlan-Guyot 2021-06-23 11:13:40 UTC
Hey @skramaja,

do you think there's a way to take this situation into account by mangling https://opendev.org/openstack/tripleo-ansible/src/branch/stable/train/tripleo_ansible/roles/tripleo-kernel/tasks/kernelargs.yml#L21-L27 ?

I think the logic would be too complicated, but I'd rather have your input on this.

The problem now is that customer have to manually update /etc/default/grub before running an update from 16.1 to 16.2. And that could be a lot of nodes.

Thanks,

Comment 2 Sofer Athlan-Guyot 2021-06-23 11:16:34 UTC
Hi @vgrosu,

I've added some more action to be done for update from 16.1 to 16.2 in [1]. Basically one has to do this:

On every node from compute role:

   grep tsx /etc/default/grub || sed -ie "s/rhgb/rhgb tsx=on/" /etc/default/grub

and then make sure to update their templates to have:

parameter_defaults:
    ComputeParameters:
       KernelArgs: "tsx=on"

we need a section in the warning part of the update from 16.1 to 16.2.

Thanks,

[1] https://access.redhat.com/node/6036141/draft

Comment 3 Sofer Athlan-Guyot 2021-06-24 16:33:11 UTC
Hi,

so the previous solution didn't solve it, we need :


   echo "#TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS" |sudo tee -a /etc/default/grub

to be run on every node of the compute role to prevent the reboot during update.

I've updated https://access.redhat.com/node/6036141/draft accordingly.

Comment 4 Vlada Grosu 2021-06-28 10:37:56 UTC
Hi Sofer,

I've opened BZ#1975450 to track all the docs changes required for the 16.2  Keeping Red Hat OpenStack Platform Updated guide.

I'll update this ticket with details of the docs draft.

Many thanks,
Vlada

Comment 5 Yaniv Kaul 2021-07-01 09:43:13 UTC
It's unclear to me why this is a TestOnly item - yet it's assigned - can you clarify?

Comment 6 Mikolaj Ciecierski 2021-07-01 13:43:17 UTC
Applying the tsx flag is left at the discretion of the user, there's no automation, but there is a validation though.
It's described as a manual procedure in a already existing kb article, so it was assigned to sathlang to test/improve the existing kb article.

Comment 9 Sofer Athlan-Guyot 2021-07-06 12:44:40 UTC
Started the review.

Unrelated but I noticed:

BZ#1872404 - restarting nodes in parallel while maintaining quorum creates an unexpected node shutdown
    Until this issue is resolved, for nodes based on composable roles, you must update the Database role first, before you can update Controller, Messaging, Compute, Ceph, and other roles. 

I'm not sure where it comes from but doesn't look right to me. Do you want another bz for this?

Comment 10 Vlada Grosu 2021-07-07 09:27:09 UTC
Hi Sofer,

Thank you for the review. I'll address your comments shortly.

Regarding BZ#1872404 - the bug is still open and it looks like you've reported it. However, if this instruction to update the nodes in a specific sequence is wrong, I'd be happy to remove it. 

(In reply to Sofer Athlan-Guyot from comment #9)
> Started the review.
> 
> Unrelated but I noticed:
> 
> BZ#1872404 - restarting nodes in parallel while maintaining quorum creates
> an unexpected node shutdown
>     Until this issue is resolved, for nodes based on composable roles, you
> must update the Database role first, before you can update Controller,
> Messaging, Compute, Ceph, and other roles. 
> 
> I'm not sure where it comes from but doesn't look right to me. Do you want
> another bz for this?

I have opened BZ#1975450 to update the doc for 16.2 so we can use that bug to track this change. Please feel free to comment on it.

Many thanks,
Vlada

Comment 12 David Vallee Delisle 2021-07-22 15:24:44 UTC
Adding 2 patches to help with this issue:
- 801518 will give the operator the possibility to opt-out automated reboots no matter what.
- 801509 will prevent a reboot from the nodes when the only kernelargs added was tsx=xxx and node is already provisioned (validating the presence of nova_libvirt and nova.conf).

Comment 14 Sofer Athlan-Guyot 2021-07-29 16:25:08 UTC
Hi @vgrosu ,

so we the patch mentioned above we have to modify the kcs article and remove the manual modification of the /etc/default/grub configuration.

@dvalleed, do you need help with testing the patch mentioned above.  I'm on pto for some time, but I can prep the workaround and start testing this, @mciecier should be able to validate them.

Comment 21 Vlada Grosu 2021-08-03 12:40:20 UTC
Hi folks,

I can confirm I've removed the manual steps to edit `/etc/default/grub` configuration from the solution article: https://access.redhat.com/solutions/6036141 as requested in comment #11 and comment #14.

Thank you.

Comment 33 Steve Baker 2021-08-17 19:51:01 UTC
*** Bug 1993299 has been marked as a duplicate of this bug. ***

Comment 41 errata-xmlrpc 2021-09-15 07:16:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483