1906228 – tuned and openshift-tuned sometimes do not terminate gracefully, slowing reboots

Bug 1906228 - tuned and openshift-tuned sometimes do not terminate gracefully, slowing reboots

Summary: tuned and openshift-tuned sometimes do not terminate gracefully, slowing reboots

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node Tuning Operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Jiří Mencák
QA Contact:	Simon
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-10 00:21 UTC by Yu Qi Zhang
Modified:	2021-02-24 15:42 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:41:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-node-tuning-operator pull 192	0	None	closed	Bug 1906228: openshift-tuned and Tuned daemon signal handling fixes.	2021-01-25 14:28:43 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:42:08 UTC

Description Yu Qi Zhang 2020-12-10 00:21:41 UTC

Description of problem:
Very rarely during a reboot, the openshift-tuned and tuned processes do not terminate gracefully, requiring a sigkill after ~90 seconds of inactivity:

[[0;32m  OK  [0m] Reached target Final Step.
         Starting Reboot...
[  257.426056] systemd-shutdown[1]: Syncing filesystems and block devices.
[  257.475893] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
[  257.495153] systemd-journald[856]: Received SIGTERM from PID 1 (systemd-shutdow).
[  347.492635] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[  347.504276] systemd-shutdown[1]: Sending SIGKILL to PID 2638 (openshift-tuned).
[  347.512782] systemd-shutdown[1]: Sending SIGKILL to PID 4119 (tuned).
[  347.588196] device veth4824fb76 left promiscuous mode
[  347.594175] kauditd_printk_skb: 7 callbacks suppressed
[  347.594177] audit: type=1700 audit(1607556350.659:160): dev=veth4824fb76 prom=0 old_prom=256 auid=4294967295 uid=0 gid=0 ses=4294967295
[  347.657166] systemd-shutdown[1]: Unmounting file systems.

This is from GCP console logs. The journal does not capture anything as it has already terminated before this, so there is nowhere else that shows why a particular reboot is slow.

Logs from the tuned pod on that node around the reboot timing:

2020-12-09 23:21:04,824 ERROR    tuned.utils.commands: Executing x86_energy_perf_policy error: x86_energy_perf_policy: /dev/cpu/1/msr offset 0x1ad read failed: Input/output error
2020-12-09 23:21:04,825 WARNING  tuned.plugins.plugin_cpu: your CPU doesn't support MSR_IA32_ENERGY_PERF_BIAS, ignoring CPU energy performance bias
2020-12-09 23:21:04,827 INFO     tuned.plugins.base: instance disk: assigning devices sda
2020-12-09 23:21:04,831 INFO     tuned.plugins.base: instance net: assigning devices ens4
2020-12-09 23:21:04,880 INFO     tuned.plugins.plugin_sysctl: reapplying system sysctl
2020-12-09 23:21:04,891 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-node' applied
E1209 23:26:59.719096    2840 reflector.go:127] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.Tuned: failed to list *v1.Tuned: Get "https://172.30.0.1:443/apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/tuneds?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host

Must-gather from the cluster after the latest reboot is here: https://drive.google.com/file/d/1RnJXARFrU7l8wjEI0oFai3o6QHs85j1o/view?usp=sharing

The node that saw the issue was jerzhang-201208-1-dpfdj-worker-a-p7p7z, during its most recent reboot.

Version-Release number of selected component (if applicable):
4.7


How reproducible:
Very rare (only seen twice so far on GCP only). Maybe 1/20 or 1/50 reboots? Marking as low severity due to its low occurrence. 


Steps to Reproduce:
1. spin up a gcp 4.7 IPI cluster (default install)
2. trigger reboots with machineconfigs
3. observe console logs via gcp console

Expected results:
Tuned terminates gracefully

Actual results:
Tuned needs to be sigkilled

Comment 6 Simon 2021-01-25 14:30:42 UTC

I was not able reproduce the problem with version: 4.7.0-0.nightly-2021-01-22-120049

Comment 9 errata-xmlrpc 2021-02-24 15:41:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.