Bug 2029371

Summary: patch pipeline--worker nodes unexpectedly reboot during scale out
Product: OpenShift Container Platform Reporter: Jiří Mencák <jmencak>
Component: Node Tuning OperatorAssignee: Jiří Mencák <jmencak>
Status: CLOSED ERRATA QA Contact: liqcui
Severity: high Docs Contact:
Priority: high    
Version: 4.10CC: aos-bugs, asoto, dagray, jerzhang, liqcui, mkrejci, ndabhi, tsedovic
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2024682 Environment:
Last Closed: 2022-03-10 16:32:18 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2029693    

Comment 2 Jiří Mencák 2021-12-07 13:29:21 UTC
A note to QE.  I believe this is fixed as of 4.10.0-0.nightly-2021-12-07-095056.  Just tested by the method described in
https://bugzilla.redhat.com/show_bug.cgi?id=2024682#c7

None of the nodes sharing the same MCP rebooted apart from the one being scaled up and no trace of
updated MachineConfig 50-nto-worker-rt with ignition and kernel parameters: []

in the logs.

Can we please get this VERIFIED so that we can backport this down to 4.8 ASAP?  Thank you!

Comment 4 liqcui 2021-12-08 07:13:32 UTC
Verified Result:

oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-12-06-201335   True        False         175m    Cluster version is 4.10.0-0.nightly-2021-12-06-201335

node1=ip-10-0-170-206.us-east-2.compute.internal
node2=ip-10-0-221-240.us-east-2.compute.internal

oc label no $node1 ${nodeLabelRealtime}= --overwrite
node/ip-10-0-170-206.us-east-2.compute.internal not labeled

oc label no $node2 ${nodeLabelRealtime}= --overwrite

oc create -f $profileRealtime
profileRealtime="../testing_manifests/stalld.yaml"
mcpRealtime="../../../examples/realtime-mcp.yaml"
nodeLabelRealtime="node-role.kubernetes.io/worker-rt"
oc create -f $mcpRealtime
machineconfigpool.machineconfiguration.openshift.io/worker-rt created

oc scale machineset/liqcui-ocaws410-v9vjb-worker-us-east-2a --replicas=1 -n openshift-machine-api

 oc logs tuned-zn9wd  -n openshift-cluster-node-tuning-operator |tail -5
E1208 06:54:40.904915    1833 tuned.go:776] unable to sync(daemon/) requeued (3)
E1208 06:54:40.905075    1833 tuned.go:776] unable to sync(daemon/) requeued (4)
2021-12-08 06:54:40,953 INFO     tuned.plugins.plugin_script: calling script '/usr/lib/tuned/realtime/script.sh' with arguments '['start']'
E1208 06:54:41.190305    1833 tuned.go:776] unable to sync(daemon/) requeued (5)
2021-12-08 06:54:41,372 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-realtime' applied
[ocpadmin@ec2-18-217-45-133 ~]$ oc logs tuned-86lpp  -n openshift-cluster-node-tuning-operator |tail -5
E1208 06:29:10.384597    1789 tuned.go:776] unable to sync(daemon/) requeued (3)
E1208 06:29:10.384706    1789 tuned.go:776] unable to sync(daemon/) requeued (4)
2021-12-08 06:29:10,433 INFO     tuned.plugins.plugin_script: calling script '/usr/lib/tuned/realtime/script.sh' with arguments '['start']'
E1208 06:29:10.727631    1789 tuned.go:776] unable to sync(daemon/) requeued (5)
2021-12-08 06:29:11,064 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-realtime' applied
[ocpadmin@ec2-18-217-45-133 ~]$ oc logs tuned-6txlp  -n openshift-cluster-node-tuning-operator |tail -5
E1208 06:26:51.002255    2135 tuned.go:776] unable to sync(daemon/) requeued (4)
E1208 06:26:51.002400    2135 tuned.go:776] unable to sync(daemon/) requeued (5)
2021-12-08 06:26:51,029 INFO     tuned.plugins.plugin_script: calling script '/usr/lib/tuned/realtime/script.sh' with arguments '['start']'
E1208 06:26:51.618703    2135 tuned.go:776] unable to sync(daemon/) requeued (6)
2021-12-08 06:26:52,039 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-realtime' applied

No old node rebooted, only new scaled node rebooted

Comment 7 errata-xmlrpc 2022-03-10 16:32:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056