Bug 1906228

Summary: tuned and openshift-tuned sometimes do not terminate gracefully, slowing reboots
Product: OpenShift Container Platform Reporter: Yu Qi Zhang <jerzhang>
Component: Node Tuning OperatorAssignee: Jiří Mencák <jmencak>
Status: CLOSED ERRATA QA Contact: Simon <skordas>
Severity: low Docs Contact:
Priority: medium    
Version: 4.7CC: sejug, wking
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:41:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Yu Qi Zhang 2020-12-10 00:21:41 UTC
Description of problem:
Very rarely during a reboot, the openshift-tuned and tuned processes do not terminate gracefully, requiring a sigkill after ~90 seconds of inactivity:

[[0;32m  OK  [0m] Reached target Final Step.
         Starting Reboot...
[  257.426056] systemd-shutdown[1]: Syncing filesystems and block devices.
[  257.475893] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
[  257.495153] systemd-journald[856]: Received SIGTERM from PID 1 (systemd-shutdow).
[  347.492635] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[  347.504276] systemd-shutdown[1]: Sending SIGKILL to PID 2638 (openshift-tuned).
[  347.512782] systemd-shutdown[1]: Sending SIGKILL to PID 4119 (tuned).
[  347.588196] device veth4824fb76 left promiscuous mode
[  347.594175] kauditd_printk_skb: 7 callbacks suppressed
[  347.594177] audit: type=1700 audit(1607556350.659:160): dev=veth4824fb76 prom=0 old_prom=256 auid=4294967295 uid=0 gid=0 ses=4294967295
[  347.657166] systemd-shutdown[1]: Unmounting file systems.

This is from GCP console logs. The journal does not capture anything as it has already terminated before this, so there is nowhere else that shows why a particular reboot is slow.

Logs from the tuned pod on that node around the reboot timing:

2020-12-09 23:21:04,824 ERROR    tuned.utils.commands: Executing x86_energy_perf_policy error: x86_energy_perf_policy: /dev/cpu/1/msr offset 0x1ad read failed: Input/output error
2020-12-09 23:21:04,825 WARNING  tuned.plugins.plugin_cpu: your CPU doesn't support MSR_IA32_ENERGY_PERF_BIAS, ignoring CPU energy performance bias
2020-12-09 23:21:04,827 INFO     tuned.plugins.base: instance disk: assigning devices sda
2020-12-09 23:21:04,831 INFO     tuned.plugins.base: instance net: assigning devices ens4
2020-12-09 23:21:04,880 INFO     tuned.plugins.plugin_sysctl: reapplying system sysctl
2020-12-09 23:21:04,891 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-node' applied
E1209 23:26:59.719096    2840 reflector.go:127] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.Tuned: failed to list *v1.Tuned: Get "https://172.30.0.1:443/apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/tuneds?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host

Must-gather from the cluster after the latest reboot is here: https://drive.google.com/file/d/1RnJXARFrU7l8wjEI0oFai3o6QHs85j1o/view?usp=sharing

The node that saw the issue was jerzhang-201208-1-dpfdj-worker-a-p7p7z, during its most recent reboot.

Version-Release number of selected component (if applicable):
4.7


How reproducible:
Very rare (only seen twice so far on GCP only). Maybe 1/20 or 1/50 reboots? Marking as low severity due to its low occurrence. 


Steps to Reproduce:
1. spin up a gcp 4.7 IPI cluster (default install)
2. trigger reboots with machineconfigs
3. observe console logs via gcp console

Expected results:
Tuned terminates gracefully

Actual results:
Tuned needs to be sigkilled

Comment 6 Simon 2021-01-25 14:30:42 UTC
I was not able reproduce the problem with version: 4.7.0-0.nightly-2021-01-22-120049

Comment 9 errata-xmlrpc 2021-02-24 15:41:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633