Bug 1923165 - [OSP-16.2] [Upgrades][TripleO] Add a config to disable Intel "TSX" on RHEL-8.3 kernel
Summary: [OSP-16.2] [Upgrades][TripleO] Add a config to disable Intel "TSX" on RHEL-8....
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.2 (Train)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: 16.2 (Train on RHEL 8.4)
Assignee: David Vallee Delisle
QA Contact: James Parker
URL:
Whiteboard:
Depends On:
Blocks: 1921070 1965811 1970949 1981432 2002346
TreeView+ depends on / blocked
 
Reported: 2021-02-01 13:45 UTC by Kashyap Chamarthy
Modified: 2023-07-28 20:44 UTC (History)
9 users (show)

Fixed In Version: python-tripleoclient-12.5.1-2.20210527114807.0a0296f.el8ost openstack-tripleo-heat-templates-11.5.1-2.20210528194817.d7fdfee.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1965811 1970949 2002346 (view as bug list)
Environment:
Last Closed: 2021-09-15 07:11:33 UTC
Target Upstream Version: Train
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gerrithub.io 517855 0 None None None 2021-05-28 16:46:24 UTC
Launchpad 1916758 0 None None None 2021-02-24 14:26:42 UTC
OpenStack gerrit 775729 0 None ABANDONED Integrating the validation of TSX flag for computes 2021-05-18 06:18:43 UTC
OpenStack gerrit 782993 0 None ABANDONED Introducing TsxEnabled global config for overcloud 2021-05-18 06:18:49 UTC
OpenStack gerrit 783969 0 None ABANDONED Computes the TSX Flags for the compute nodes 2021-05-18 06:18:47 UTC
OpenStack gerrit 791089 0 None MERGED [train-only] post stack creation tsx validation 2021-05-27 05:53:30 UTC
OpenStack gerrit 792309 0 None MERGED [train-only] Adding ForceNoTsx flag 2021-06-23 10:26:23 UTC
Red Hat Issue Tracker OSP-802 0 None None None 2022-09-07 09:06:50 UTC
Red Hat Product Errata RHEA-2021:3483 0 None None None 2021-09-15 07:12:16 UTC

Description Kashyap Chamarthy 2021-02-01 13:45:06 UTC
Description
-----------

Fast-forward upgrade from OSP-13 (RHEL-7.9) to OSP-16.2 (RHEL-8.3)
fails[1] during live migration with:

    [...] libvirt.libvirtError: operation failed: guest CPU doesn't
    match specification: missing features: hle,rtm

The failure is due to RHEL-8.3 (destination host) disabling an Intel
"TSX".  And disabling TSX disables the 'hle' and 'rtm' features.

This was discovered during OSP fast-forward upgrades testing[+] where a
guest was being live-migrated from RHEL-7.9 (with TSX=on) to RHEL-8.3
(breaking change: TSX=off), and the migration failed with the
above-mentioned error.

[+] https://bugzilla.redhat.com/show_bug.cgi?id=1921070#c14 — Live
    migration during OSP16.2 hybrid state from RHEL7.9 to RHEL8.3 not
    working


Why?
----

RHEL-8.3 kernel disabled Intel TSX by default, because it is considered
a potential security risk:

    https://bugzilla.redhat.com/show_bug.cgi?id=1828642
    kernel: Disable Intel TSX by default on newer CPUs

Still, it is not acceptable for RHEL-8.3 kernel to break user-space in a
minor RHEL release.  (See also:
https://bugzilla.redhat.com/show_bug.cgi?id=1921070#c16)


Workaround for OSP upgrades
---------------------------

This is unpalatable, but unfortunately there's no other option currently:

(1) have a TripleO config attribute that will enable TSX on the
    destination RHEL-8.3 host; set the following in /etc/default/grub:

        GRUB_CMDLINE_LINUX_DEFAULT="[...] tsx=on" 

    ... and reboot the 8.3 host;

(2) live-migrate the guests from RHEL-7.9 to the RHEL-8.3;

(3) now turn off TSX on the RHEL-8.3 host kernel command-line;
    shutdown the guests;

(4) reboot the 8.3 host again, and start the guests

Comment 7 Sofer Athlan-Guyot 2021-06-24 16:37:48 UTC
Hi,

I've tested the tsx=on flag during update from 16.1 to 16.2 according to https://access.redhat.com/node/6036141/ and this fail, see[1].

There is a reboot of the compute node that happen during update due to tripleo-ansible/.../kernelargs.yaml [2].

The workaround is to have:

echo "#TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS" |sudo tee -a /etc/default/grub

executed on every compute nodes before update.

I've updated the kb according in [3], but this need to be reviewed and published. 


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1975240
[2] https://opendev.org/openstack/tripleo-ansible/src/branch/stable/train/tripleo_ansible/roles/tripleo-kernel/tasks/kernelargs.yml#L89-L103
[3] https://access.redhat.com/node/6036141/draft

Comment 18 errata-xmlrpc 2021-09-15 07:11:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483


Note You need to log in before you can comment on or make changes to this bug.