Bug 1993299

Summary: Changing manually /proc/cmdline on a compute node will cause server reboot during stack update
Product: Red Hat OpenStack Reporter: jpateteg
Component: tripleo-ansibleAssignee: OSP Team <rhos-maint>
Status: CLOSED DUPLICATE QA Contact: Joe H. Rahme <jhakimra>
Severity: high Docs Contact:
Priority: unspecified    
Version: 16.1 (Train)CC: bshephar, mflusche, sbaker
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-08-17 19:51:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description jpateteg 2021-08-12 17:59:22 UTC
Description of problem:

When manually changing cmdline or kernelArgs in a compute node and run a stack update, ansible will trigger a Server reboot (killing all the workloads causing impact)

When changing the same kernelargs but in the templates, it will not reboot it. The behaviour is not consistent


Version-Release number of selected component (if applicable):

16.1.1
How reproducible: Always

1. Deploy 16.1.1 on a compute node with this kernelargs:

[heat-admin@srvrhpb508-computemme-0 ~]$ cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos2)/boot/vmlinuz-4.18.0-193.14.3.el8_2.x86_64 root=UUID=99a3522b-0b91-43d5-9c37-47911cb1682b ro console=ttyS0 console=ttyS0,115200n81 no_timer_check crashkernel=auto rhgb quiet default_hugepagesz=1GB hugepagesz=1G hugepages=120 intel_iommu=on iommu=pt transparent_hugepage=never isolcpus=1-13,29-41,15-27,43-55 ixgbe.max_vfs=8 skew_tick=1 nohz=on nohz_full=1-13,29-41,15-27,43-55 rcu_nocbs=1-13,29-41,15-27,43-55 tuned.non_isolcpus=00000400,10004001 intel_pstate=disable nosoftlockup skew_tick=1 nohz=on nohz_full=1-13,29-41,15-27,43-55 rcu_nocbs=1-13,29-41,15-27,43-55 tuned.non_isolcpus=00000400,10004001 intel_pstate=disable nosoftlockup


ComputemmeParameters:
    IsolCpusList: "1-13,29-41,15-27,43-55"
    NovaComputeCpuDedicatedSet: ['1-13','29-41','15-27','43-55']
    NovaComputeCpuSharedSet: "0,14,28,42"
    KernelArgs: "default_hugepagesz=1GB hugepagesz=1G hugepages=120 intel_iommu=on iommu=pt transparent_hugepage=never isolcpus=1-13,29-41,15-27,43-55 ixgbe.max_vfs=8"
    TunedProfileName: "cpu-partitioning"
    NeutronBridgeMappings: 'Multi:br-Multi,MME:br-MME'
    NovaLibvirtRxQueueSize: 1024
    NovaLibvirtTxQueueSize: 1024

Then manually change the /proc/cmdline to remove Hugepages and CPU Isolation

Steps to Reproduce:
1. deploy osp16.1.1 with the above kernelargs and tuned profile
2. manually change /proc/cmdline using available tools to modify grub and remove cpu isolation and hugepages
3. run a stack update, ansible will reboot the compute node in question to apply the "missing" change

Actual results:

stack update causes a reboot of the compute node to apply the "missing" kernelargs

Expected results:

kernelargs should only be applied during the firstboot.

Additional info:
Thinking on this as a new feature, changed the kernelargs on the templates by removing hugepages and cpuisolation, in this case the server did not reboot, so I need to apply changes manually again.

Comment 1 Steve Baker 2021-08-17 19:51:01 UTC
Since kernel arguments are managed by the Director tooling we *strongly* recommend never to manually change kernel arguments, since the tooling is designed to bring nodes into the declared state.

16.2 will have a new role parameter KernelArgsDeferReboot so if you need to you can prevent nodes from rebooting when kernel arguments have diverged. On that basis I'm going to close this as a duplicate of bug #1975240.

*** This bug has been marked as a duplicate of bug 1975240 ***