Bug 1923165

Summary: [OSP-16.2] [Upgrades][TripleO] Add a config to disable Intel "TSX" on RHEL-8.3 kernel
Product: Red Hat OpenStack Reporter: Kashyap Chamarthy <kchamart>
Component: openstack-tripleo-heat-templatesAssignee: David Vallee Delisle <dvd>
Status: CLOSED ERRATA QA Contact: James Parker <jparker>
Severity: high Docs Contact:
Priority: high    
Version: 16.2 (Train)CC: amodi, dvd, jamsmith, jparker, jpretori, mburns, mschuppe, sathlang, vgrosu
Target Milestone: rcKeywords: Patch, Triaged
Target Release: 16.2 (Train on RHEL 8.4)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-tripleoclient-12.5.1-2.20210527114807.0a0296f.el8ost openstack-tripleo-heat-templates-11.5.1-2.20210528194817.d7fdfee.el8ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1965811 1970949 2002346 (view as bug list) Environment:
Last Closed: 2021-09-15 07:11:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version: Train
Embargoed:
Bug Depends On:    
Bug Blocks: 1921070, 1965811, 1970949, 1981432, 2002346    

Description Kashyap Chamarthy 2021-02-01 13:45:06 UTC
Description
-----------

Fast-forward upgrade from OSP-13 (RHEL-7.9) to OSP-16.2 (RHEL-8.3)
fails[1] during live migration with:

    [...] libvirt.libvirtError: operation failed: guest CPU doesn't
    match specification: missing features: hle,rtm

The failure is due to RHEL-8.3 (destination host) disabling an Intel
"TSX".  And disabling TSX disables the 'hle' and 'rtm' features.

This was discovered during OSP fast-forward upgrades testing[+] where a
guest was being live-migrated from RHEL-7.9 (with TSX=on) to RHEL-8.3
(breaking change: TSX=off), and the migration failed with the
above-mentioned error.

[+] https://bugzilla.redhat.com/show_bug.cgi?id=1921070#c14 — Live
    migration during OSP16.2 hybrid state from RHEL7.9 to RHEL8.3 not
    working


Why?
----

RHEL-8.3 kernel disabled Intel TSX by default, because it is considered
a potential security risk:

    https://bugzilla.redhat.com/show_bug.cgi?id=1828642
    kernel: Disable Intel TSX by default on newer CPUs

Still, it is not acceptable for RHEL-8.3 kernel to break user-space in a
minor RHEL release.  (See also:
https://bugzilla.redhat.com/show_bug.cgi?id=1921070#c16)


Workaround for OSP upgrades
---------------------------

This is unpalatable, but unfortunately there's no other option currently:

(1) have a TripleO config attribute that will enable TSX on the
    destination RHEL-8.3 host; set the following in /etc/default/grub:

        GRUB_CMDLINE_LINUX_DEFAULT="[...] tsx=on" 

    ... and reboot the 8.3 host;

(2) live-migrate the guests from RHEL-7.9 to the RHEL-8.3;

(3) now turn off TSX on the RHEL-8.3 host kernel command-line;
    shutdown the guests;

(4) reboot the 8.3 host again, and start the guests

Comment 7 Sofer Athlan-Guyot 2021-06-24 16:37:48 UTC
Hi,

I've tested the tsx=on flag during update from 16.1 to 16.2 according to https://access.redhat.com/node/6036141/ and this fail, see[1].

There is a reboot of the compute node that happen during update due to tripleo-ansible/.../kernelargs.yaml [2].

The workaround is to have:

echo "#TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS" |sudo tee -a /etc/default/grub

executed on every compute nodes before update.

I've updated the kb according in [3], but this need to be reviewed and published. 


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1975240
[2] https://opendev.org/openstack/tripleo-ansible/src/branch/stable/train/tripleo_ansible/roles/tripleo-kernel/tasks/kernelargs.yml#L89-L103
[3] https://access.redhat.com/node/6036141/draft

Comment 18 errata-xmlrpc 2021-09-15 07:11:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2021:3483