Bug 2029371

Summary:	patch pipeline--worker nodes unexpectedly reboot during scale out
Product:	OpenShift Container Platform	Reporter:	Jiří Mencák <jmencak>
Component:	Node Tuning Operator	Assignee:	Jiří Mencák <jmencak>
Status:	CLOSED ERRATA	QA Contact:	liqcui
Severity:	high	Docs Contact:
Priority:	high
Version:	4.10	CC:	aos-bugs, asoto, dagray, jerzhang, liqcui, mkrejci, ndabhi, tsedovic
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	2024682	Environment:
Last Closed:	2022-03-10 16:32:18 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2029693

Comment 2 Jiří Mencák 2021-12-07 13:29:21 UTC

A note to QE.  I believe this is fixed as of 4.10.0-0.nightly-2021-12-07-095056.  Just tested by the method described in
https://bugzilla.redhat.com/show_bug.cgi?id=2024682#c7

None of the nodes sharing the same MCP rebooted apart from the one being scaled up and no trace of
updated MachineConfig 50-nto-worker-rt with ignition and kernel parameters: []

in the logs.

Can we please get this VERIFIED so that we can backport this down to 4.8 ASAP?  Thank you!

Comment 4 liqcui 2021-12-08 07:13:32 UTC

Verified Result:

oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-12-06-201335   True        False         175m    Cluster version is 4.10.0-0.nightly-2021-12-06-201335

node1=ip-10-0-170-206.us-east-2.compute.internal
node2=ip-10-0-221-240.us-east-2.compute.internal

oc label no $node1 ${nodeLabelRealtime}= --overwrite
node/ip-10-0-170-206.us-east-2.compute.internal not labeled

oc label no $node2 ${nodeLabelRealtime}= --overwrite

oc create -f $profileRealtime
profileRealtime="../testing_manifests/stalld.yaml"
mcpRealtime="../../../examples/realtime-mcp.yaml"
nodeLabelRealtime="node-role.kubernetes.io/worker-rt"
oc create -f $mcpRealtime
machineconfigpool.machineconfiguration.openshift.io/worker-rt created

oc scale machineset/liqcui-ocaws410-v9vjb-worker-us-east-2a --replicas=1 -n openshift-machine-api

 oc logs tuned-zn9wd  -n openshift-cluster-node-tuning-operator |tail -5
E1208 06:54:40.904915    1833 tuned.go:776] unable to sync(daemon/) requeued (3)
E1208 06:54:40.905075    1833 tuned.go:776] unable to sync(daemon/) requeued (4)
2021-12-08 06:54:40,953 INFO     tuned.plugins.plugin_script: calling script '/usr/lib/tuned/realtime/script.sh' with arguments '['start']'
E1208 06:54:41.190305    1833 tuned.go:776] unable to sync(daemon/) requeued (5)
2021-12-08 06:54:41,372 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-realtime' applied
[ocpadmin@ec2-18-217-45-133 ~]$ oc logs tuned-86lpp  -n openshift-cluster-node-tuning-operator |tail -5
E1208 06:29:10.384597    1789 tuned.go:776] unable to sync(daemon/) requeued (3)
E1208 06:29:10.384706    1789 tuned.go:776] unable to sync(daemon/) requeued (4)
2021-12-08 06:29:10,433 INFO     tuned.plugins.plugin_script: calling script '/usr/lib/tuned/realtime/script.sh' with arguments '['start']'
E1208 06:29:10.727631    1789 tuned.go:776] unable to sync(daemon/) requeued (5)
2021-12-08 06:29:11,064 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-realtime' applied
[ocpadmin@ec2-18-217-45-133 ~]$ oc logs tuned-6txlp  -n openshift-cluster-node-tuning-operator |tail -5
E1208 06:26:51.002255    2135 tuned.go:776] unable to sync(daemon/) requeued (4)
E1208 06:26:51.002400    2135 tuned.go:776] unable to sync(daemon/) requeued (5)
2021-12-08 06:26:51,029 INFO     tuned.plugins.plugin_script: calling script '/usr/lib/tuned/realtime/script.sh' with arguments '['start']'
E1208 06:26:51.618703    2135 tuned.go:776] unable to sync(daemon/) requeued (6)
2021-12-08 06:26:52,039 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-realtime' applied

No old node rebooted, only new scaled node rebooted

Comment 7 errata-xmlrpc 2022-03-10 16:32:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056