Bug 1993299 - Changing manually /proc/cmdline on a compute node will cause server reboot during stack update
Summary: Changing manually /proc/cmdline on a compute node will cause server reboot du...
Keywords:
Status: CLOSED DUPLICATE of bug 1975240
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: tripleo-ansible
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: OSP Team
QA Contact: Joe H. Rahme
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-12 17:59 UTC by jpateteg
Modified: 2022-08-10 15:10 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-08-17 19:51:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-7081 0 None None None 2022-08-10 15:10:07 UTC

Description jpateteg 2021-08-12 17:59:22 UTC
Description of problem:

When manually changing cmdline or kernelArgs in a compute node and run a stack update, ansible will trigger a Server reboot (killing all the workloads causing impact)

When changing the same kernelargs but in the templates, it will not reboot it. The behaviour is not consistent


Version-Release number of selected component (if applicable):

16.1.1
How reproducible: Always

1. Deploy 16.1.1 on a compute node with this kernelargs:

[heat-admin@srvrhpb508-computemme-0 ~]$ cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos2)/boot/vmlinuz-4.18.0-193.14.3.el8_2.x86_64 root=UUID=99a3522b-0b91-43d5-9c37-47911cb1682b ro console=ttyS0 console=ttyS0,115200n81 no_timer_check crashkernel=auto rhgb quiet default_hugepagesz=1GB hugepagesz=1G hugepages=120 intel_iommu=on iommu=pt transparent_hugepage=never isolcpus=1-13,29-41,15-27,43-55 ixgbe.max_vfs=8 skew_tick=1 nohz=on nohz_full=1-13,29-41,15-27,43-55 rcu_nocbs=1-13,29-41,15-27,43-55 tuned.non_isolcpus=00000400,10004001 intel_pstate=disable nosoftlockup skew_tick=1 nohz=on nohz_full=1-13,29-41,15-27,43-55 rcu_nocbs=1-13,29-41,15-27,43-55 tuned.non_isolcpus=00000400,10004001 intel_pstate=disable nosoftlockup


ComputemmeParameters:
    IsolCpusList: "1-13,29-41,15-27,43-55"
    NovaComputeCpuDedicatedSet: ['1-13','29-41','15-27','43-55']
    NovaComputeCpuSharedSet: "0,14,28,42"
    KernelArgs: "default_hugepagesz=1GB hugepagesz=1G hugepages=120 intel_iommu=on iommu=pt transparent_hugepage=never isolcpus=1-13,29-41,15-27,43-55 ixgbe.max_vfs=8"
    TunedProfileName: "cpu-partitioning"
    NeutronBridgeMappings: 'Multi:br-Multi,MME:br-MME'
    NovaLibvirtRxQueueSize: 1024
    NovaLibvirtTxQueueSize: 1024

Then manually change the /proc/cmdline to remove Hugepages and CPU Isolation

Steps to Reproduce:
1. deploy osp16.1.1 with the above kernelargs and tuned profile
2. manually change /proc/cmdline using available tools to modify grub and remove cpu isolation and hugepages
3. run a stack update, ansible will reboot the compute node in question to apply the "missing" change

Actual results:

stack update causes a reboot of the compute node to apply the "missing" kernelargs

Expected results:

kernelargs should only be applied during the firstboot.

Additional info:
Thinking on this as a new feature, changed the kernelargs on the templates by removing hugepages and cpuisolation, in this case the server did not reboot, so I need to apply changes manually again.

Comment 1 Steve Baker 2021-08-17 19:51:01 UTC
Since kernel arguments are managed by the Director tooling we *strongly* recommend never to manually change kernel arguments, since the tooling is designed to bring nodes into the declared state.

16.2 will have a new role parameter KernelArgsDeferReboot so if you need to you can prevent nodes from rebooting when kernel arguments have diverged. On that basis I'm going to close this as a duplicate of bug #1975240.

*** This bug has been marked as a duplicate of bug 1975240 ***


Note You need to log in before you can comment on or make changes to this bug.