Description of problem:
Very rarely during a reboot, the openshift-tuned and tuned processes do not terminate gracefully, requiring a sigkill after ~90 seconds of inactivity:
[[0;32m OK [0m] Reached target Final Step.
[ 257.426056] systemd-shutdown: Syncing filesystems and block devices.
[ 257.475893] systemd-shutdown: Sending SIGTERM to remaining processes...
[ 257.495153] systemd-journald: Received SIGTERM from PID 1 (systemd-shutdow).
[ 347.492635] systemd-shutdown: Sending SIGKILL to remaining processes...
[ 347.504276] systemd-shutdown: Sending SIGKILL to PID 2638 (openshift-tuned).
[ 347.512782] systemd-shutdown: Sending SIGKILL to PID 4119 (tuned).
[ 347.588196] device veth4824fb76 left promiscuous mode
[ 347.594175] kauditd_printk_skb: 7 callbacks suppressed
[ 347.594177] audit: type=1700 audit(1607556350.659:160): dev=veth4824fb76 prom=0 old_prom=256 auid=4294967295 uid=0 gid=0 ses=4294967295
[ 347.657166] systemd-shutdown: Unmounting file systems.
This is from GCP console logs. The journal does not capture anything as it has already terminated before this, so there is nowhere else that shows why a particular reboot is slow.
Logs from the tuned pod on that node around the reboot timing:
2020-12-09 23:21:04,824 ERROR tuned.utils.commands: Executing x86_energy_perf_policy error: x86_energy_perf_policy: /dev/cpu/1/msr offset 0x1ad read failed: Input/output error
2020-12-09 23:21:04,825 WARNING tuned.plugins.plugin_cpu: your CPU doesn't support MSR_IA32_ENERGY_PERF_BIAS, ignoring CPU energy performance bias
2020-12-09 23:21:04,827 INFO tuned.plugins.base: instance disk: assigning devices sda
2020-12-09 23:21:04,831 INFO tuned.plugins.base: instance net: assigning devices ens4
2020-12-09 23:21:04,880 INFO tuned.plugins.plugin_sysctl: reapplying system sysctl
2020-12-09 23:21:04,891 INFO tuned.daemon.daemon: static tuning from profile 'openshift-node' applied
E1209 23:26:59.719096 2840 reflector.go:127] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.Tuned: failed to list *v1.Tuned: Get "https://172.30.0.1:443/apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/tuneds?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host
Must-gather from the cluster after the latest reboot is here: https://drive.google.com/file/d/1RnJXARFrU7l8wjEI0oFai3o6QHs85j1o/view?usp=sharing
The node that saw the issue was jerzhang-201208-1-dpfdj-worker-a-p7p7z, during its most recent reboot.
Version-Release number of selected component (if applicable):
Very rare (only seen twice so far on GCP only). Maybe 1/20 or 1/50 reboots? Marking as low severity due to its low occurrence.
Steps to Reproduce:
1. spin up a gcp 4.7 IPI cluster (default install)
2. trigger reboots with machineconfigs
3. observe console logs via gcp console
Tuned terminates gracefully
Tuned needs to be sigkilled
I was not able reproduce the problem with version: 4.7.0-0.nightly-2021-01-22-120049
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.