Bug 1975240
Summary: | [update] from 16.1 to 16.2, when enabling tsx flag, compute node get restarted during update and ping loss occurs. | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Sofer Athlan-Guyot <sathlang> |
Component: | tripleo-ansible | Assignee: | David Vallee Delisle <dvd> |
Status: | CLOSED ERRATA | QA Contact: | Jason Grosso <jgrosso> |
Severity: | urgent | Docs Contact: | Vlada Grosu <vgrosu> |
Priority: | urgent | ||
Version: | 16.2 (Train) | CC: | astillma, dvalleed, dvd, jamsmith, jgrosso, jniu, jpateteg, jpretori, mburns, mciecier, mschuppe, shrjoshi, spower, supadhya, vgrosu |
Target Milestone: | rc | Keywords: | Patch, Triaged |
Target Release: | 16.2 (Train on RHEL 8.4) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | tripleo-ansible-0.7.1-2.20210603175839.el8ost.7 openstack-tripleo-heat-templates-11.5.1-2.20210603174821.el8ost.7 | Doc Type: | Known Issue |
Doc Text: |
Starting with Red Hat Enterprise Linux (RHEL) version 8.3, support for the Intel Transactional Synchronization Extensions (TSX) feature is disabled by default. Currently, this causes instance live migration to fail when migrating from hosts where the TSX kernel argument is enabled to hosts where the TSX kernel argument is disabled.
+
This impact applies only to Intel hosts that support the TSX feature. For more information about the CPUs that are affected by this issue, see link:https://access.redhat.com/articles/6101171#affected-configurations-17[Affected Configurations].
+
For more information, review the following Red Hat Knowledgebase solution link:https://access.redhat.com/solutions/6036141[Guidance on Intel TSX impact on OpenStack guests].
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-09-15 07:16:23 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Sofer Athlan-Guyot
2021-06-23 10:26:21 UTC
Hey @skramaja, do you think there's a way to take this situation into account by mangling https://opendev.org/openstack/tripleo-ansible/src/branch/stable/train/tripleo_ansible/roles/tripleo-kernel/tasks/kernelargs.yml#L21-L27 ? I think the logic would be too complicated, but I'd rather have your input on this. The problem now is that customer have to manually update /etc/default/grub before running an update from 16.1 to 16.2. And that could be a lot of nodes. Thanks, Hi @vgrosu, I've added some more action to be done for update from 16.1 to 16.2 in [1]. Basically one has to do this: On every node from compute role: grep tsx /etc/default/grub || sed -ie "s/rhgb/rhgb tsx=on/" /etc/default/grub and then make sure to update their templates to have: parameter_defaults: ComputeParameters: KernelArgs: "tsx=on" we need a section in the warning part of the update from 16.1 to 16.2. Thanks, [1] https://access.redhat.com/node/6036141/draft Hi, so the previous solution didn't solve it, we need : echo "#TRIPLEO_HEAT_TEMPLATE_KERNEL_ARGS" |sudo tee -a /etc/default/grub to be run on every node of the compute role to prevent the reboot during update. I've updated https://access.redhat.com/node/6036141/draft accordingly. Hi Sofer, I've opened BZ#1975450 to track all the docs changes required for the 16.2 Keeping Red Hat OpenStack Platform Updated guide. I'll update this ticket with details of the docs draft. Many thanks, Vlada It's unclear to me why this is a TestOnly item - yet it's assigned - can you clarify? Applying the tsx flag is left at the discretion of the user, there's no automation, but there is a validation though. It's described as a manual procedure in a already existing kb article, so it was assigned to sathlang to test/improve the existing kb article. Started the review. Unrelated but I noticed: BZ#1872404 - restarting nodes in parallel while maintaining quorum creates an unexpected node shutdown Until this issue is resolved, for nodes based on composable roles, you must update the Database role first, before you can update Controller, Messaging, Compute, Ceph, and other roles. I'm not sure where it comes from but doesn't look right to me. Do you want another bz for this? Hi Sofer, Thank you for the review. I'll address your comments shortly. Regarding BZ#1872404 - the bug is still open and it looks like you've reported it. However, if this instruction to update the nodes in a specific sequence is wrong, I'd be happy to remove it. (In reply to Sofer Athlan-Guyot from comment #9) > Started the review. > > Unrelated but I noticed: > > BZ#1872404 - restarting nodes in parallel while maintaining quorum creates > an unexpected node shutdown > Until this issue is resolved, for nodes based on composable roles, you > must update the Database role first, before you can update Controller, > Messaging, Compute, Ceph, and other roles. > > I'm not sure where it comes from but doesn't look right to me. Do you want > another bz for this? I have opened BZ#1975450 to update the doc for 16.2 so we can use that bug to track this change. Please feel free to comment on it. Many thanks, Vlada Adding 2 patches to help with this issue: - 801518 will give the operator the possibility to opt-out automated reboots no matter what. - 801509 will prevent a reboot from the nodes when the only kernelargs added was tsx=xxx and node is already provisioned (validating the presence of nova_libvirt and nova.conf). Hi @vgrosu , so we the patch mentioned above we have to modify the kcs article and remove the manual modification of the /etc/default/grub configuration. @dvalleed, do you need help with testing the patch mentioned above. I'm on pto for some time, but I can prep the workaround and start testing this, @mciecier should be able to validate them. Hi folks, I can confirm I've removed the manual steps to edit `/etc/default/grub` configuration from the solution article: https://access.redhat.com/solutions/6036141 as requested in comment #11 and comment #14. Thank you. *** Bug 1993299 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform (RHOSP) 16.2 enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2021:3483 |