Description of problem: Very rarely during a reboot, the openshift-tuned and tuned processes do not terminate gracefully, requiring a sigkill after ~90 seconds of inactivity: [[0;32m OK [0m] Reached target Final Step. Starting Reboot... [ 257.426056] systemd-shutdown[1]: Syncing filesystems and block devices. [ 257.475893] systemd-shutdown[1]: Sending SIGTERM to remaining processes... [ 257.495153] systemd-journald[856]: Received SIGTERM from PID 1 (systemd-shutdow). [ 347.492635] systemd-shutdown[1]: Sending SIGKILL to remaining processes... [ 347.504276] systemd-shutdown[1]: Sending SIGKILL to PID 2638 (openshift-tuned). [ 347.512782] systemd-shutdown[1]: Sending SIGKILL to PID 4119 (tuned). [ 347.588196] device veth4824fb76 left promiscuous mode [ 347.594175] kauditd_printk_skb: 7 callbacks suppressed [ 347.594177] audit: type=1700 audit(1607556350.659:160): dev=veth4824fb76 prom=0 old_prom=256 auid=4294967295 uid=0 gid=0 ses=4294967295 [ 347.657166] systemd-shutdown[1]: Unmounting file systems. This is from GCP console logs. The journal does not capture anything as it has already terminated before this, so there is nowhere else that shows why a particular reboot is slow. Logs from the tuned pod on that node around the reboot timing: 2020-12-09 23:21:04,824 ERROR tuned.utils.commands: Executing x86_energy_perf_policy error: x86_energy_perf_policy: /dev/cpu/1/msr offset 0x1ad read failed: Input/output error 2020-12-09 23:21:04,825 WARNING tuned.plugins.plugin_cpu: your CPU doesn't support MSR_IA32_ENERGY_PERF_BIAS, ignoring CPU energy performance bias 2020-12-09 23:21:04,827 INFO tuned.plugins.base: instance disk: assigning devices sda 2020-12-09 23:21:04,831 INFO tuned.plugins.base: instance net: assigning devices ens4 2020-12-09 23:21:04,880 INFO tuned.plugins.plugin_sysctl: reapplying system sysctl 2020-12-09 23:21:04,891 INFO tuned.daemon.daemon: static tuning from profile 'openshift-node' applied E1209 23:26:59.719096 2840 reflector.go:127] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.Tuned: failed to list *v1.Tuned: Get "https://172.30.0.1:443/apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/tuneds?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host Must-gather from the cluster after the latest reboot is here: https://drive.google.com/file/d/1RnJXARFrU7l8wjEI0oFai3o6QHs85j1o/view?usp=sharing The node that saw the issue was jerzhang-201208-1-dpfdj-worker-a-p7p7z, during its most recent reboot. Version-Release number of selected component (if applicable): 4.7 How reproducible: Very rare (only seen twice so far on GCP only). Maybe 1/20 or 1/50 reboots? Marking as low severity due to its low occurrence. Steps to Reproduce: 1. spin up a gcp 4.7 IPI cluster (default install) 2. trigger reboots with machineconfigs 3. observe console logs via gcp console Expected results: Tuned terminates gracefully Actual results: Tuned needs to be sigkilled
I was not able reproduce the problem with version: 4.7.0-0.nightly-2021-01-22-120049
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633