1993299 – Changing manually /proc/cmdline on a compute node will cause server reboot during stack update

Bug 1993299 - Changing manually /proc/cmdline on a compute node will cause server reboot during stack update

Summary: Changing manually /proc/cmdline on a compute node will cause server reboot du...

Keywords:
Status:	CLOSED DUPLICATE of bug 1975240
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	tripleo-ansible
Sub Component:
Version:	16.1 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	OSP Team
QA Contact:	Joe H. Rahme
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-12 17:59 UTC by jpateteg
Modified:	2022-08-10 15:10 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-08-17 19:51:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-7081	0	None	None	None	2022-08-10 15:10:07 UTC

Description jpateteg 2021-08-12 17:59:22 UTC

Description of problem:

When manually changing cmdline or kernelArgs in a compute node and run a stack update, ansible will trigger a Server reboot (killing all the workloads causing impact)

When changing the same kernelargs but in the templates, it will not reboot it. The behaviour is not consistent


Version-Release number of selected component (if applicable):

16.1.1
How reproducible: Always

1. Deploy 16.1.1 on a compute node with this kernelargs:

[heat-admin@srvrhpb508-computemme-0 ~]$ cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos2)/boot/vmlinuz-4.18.0-193.14.3.el8_2.x86_64 root=UUID=99a3522b-0b91-43d5-9c37-47911cb1682b ro console=ttyS0 console=ttyS0,115200n81 no_timer_check crashkernel=auto rhgb quiet default_hugepagesz=1GB hugepagesz=1G hugepages=120 intel_iommu=on iommu=pt transparent_hugepage=never isolcpus=1-13,29-41,15-27,43-55 ixgbe.max_vfs=8 skew_tick=1 nohz=on nohz_full=1-13,29-41,15-27,43-55 rcu_nocbs=1-13,29-41,15-27,43-55 tuned.non_isolcpus=00000400,10004001 intel_pstate=disable nosoftlockup skew_tick=1 nohz=on nohz_full=1-13,29-41,15-27,43-55 rcu_nocbs=1-13,29-41,15-27,43-55 tuned.non_isolcpus=00000400,10004001 intel_pstate=disable nosoftlockup


ComputemmeParameters:
    IsolCpusList: "1-13,29-41,15-27,43-55"
    NovaComputeCpuDedicatedSet: ['1-13','29-41','15-27','43-55']
    NovaComputeCpuSharedSet: "0,14,28,42"
    KernelArgs: "default_hugepagesz=1GB hugepagesz=1G hugepages=120 intel_iommu=on iommu=pt transparent_hugepage=never isolcpus=1-13,29-41,15-27,43-55 ixgbe.max_vfs=8"
    TunedProfileName: "cpu-partitioning"
    NeutronBridgeMappings: 'Multi:br-Multi,MME:br-MME'
    NovaLibvirtRxQueueSize: 1024
    NovaLibvirtTxQueueSize: 1024

Then manually change the /proc/cmdline to remove Hugepages and CPU Isolation

Steps to Reproduce:
1. deploy osp16.1.1 with the above kernelargs and tuned profile
2. manually change /proc/cmdline using available tools to modify grub and remove cpu isolation and hugepages
3. run a stack update, ansible will reboot the compute node in question to apply the "missing" change

Actual results:

stack update causes a reboot of the compute node to apply the "missing" kernelargs

Expected results:

kernelargs should only be applied during the firstboot.

Additional info:
Thinking on this as a new feature, changed the kernelargs on the templates by removing hugepages and cpuisolation, in this case the server did not reboot, so I need to apply changes manually again.

Comment 1 Steve Baker 2021-08-17 19:51:01 UTC

Since kernel arguments are managed by the Director tooling we *strongly* recommend never to manually change kernel arguments, since the tooling is designed to bring nodes into the declared state.

16.2 will have a new role parameter KernelArgsDeferReboot so if you need to you can prevent nodes from rebooting when kernel arguments have diverged. On that basis I'm going to close this as a duplicate of bug #1975240.

*** This bug has been marked as a duplicate of bug 1975240 ***

Note You need to log in before you can comment on or make changes to this bug.