Bug 1906228
| Summary: | tuned and openshift-tuned sometimes do not terminate gracefully, slowing reboots | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Yu Qi Zhang <jerzhang> |
| Component: | Node Tuning Operator | Assignee: | Jiří Mencák <jmencak> |
| Status: | CLOSED ERRATA | QA Contact: | Simon <skordas> |
| Severity: | low | Docs Contact: | |
| Priority: | medium | ||
| Version: | 4.7 | CC: | sejug, wking |
| Target Milestone: | --- | ||
| Target Release: | 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-02-24 15:41:51 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
I was not able reproduce the problem with version: 4.7.0-0.nightly-2021-01-22-120049 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |
Description of problem: Very rarely during a reboot, the openshift-tuned and tuned processes do not terminate gracefully, requiring a sigkill after ~90 seconds of inactivity: [[0;32m OK [0m] Reached target Final Step. Starting Reboot... [ 257.426056] systemd-shutdown[1]: Syncing filesystems and block devices. [ 257.475893] systemd-shutdown[1]: Sending SIGTERM to remaining processes... [ 257.495153] systemd-journald[856]: Received SIGTERM from PID 1 (systemd-shutdow). [ 347.492635] systemd-shutdown[1]: Sending SIGKILL to remaining processes... [ 347.504276] systemd-shutdown[1]: Sending SIGKILL to PID 2638 (openshift-tuned). [ 347.512782] systemd-shutdown[1]: Sending SIGKILL to PID 4119 (tuned). [ 347.588196] device veth4824fb76 left promiscuous mode [ 347.594175] kauditd_printk_skb: 7 callbacks suppressed [ 347.594177] audit: type=1700 audit(1607556350.659:160): dev=veth4824fb76 prom=0 old_prom=256 auid=4294967295 uid=0 gid=0 ses=4294967295 [ 347.657166] systemd-shutdown[1]: Unmounting file systems. This is from GCP console logs. The journal does not capture anything as it has already terminated before this, so there is nowhere else that shows why a particular reboot is slow. Logs from the tuned pod on that node around the reboot timing: 2020-12-09 23:21:04,824 ERROR tuned.utils.commands: Executing x86_energy_perf_policy error: x86_energy_perf_policy: /dev/cpu/1/msr offset 0x1ad read failed: Input/output error 2020-12-09 23:21:04,825 WARNING tuned.plugins.plugin_cpu: your CPU doesn't support MSR_IA32_ENERGY_PERF_BIAS, ignoring CPU energy performance bias 2020-12-09 23:21:04,827 INFO tuned.plugins.base: instance disk: assigning devices sda 2020-12-09 23:21:04,831 INFO tuned.plugins.base: instance net: assigning devices ens4 2020-12-09 23:21:04,880 INFO tuned.plugins.plugin_sysctl: reapplying system sysctl 2020-12-09 23:21:04,891 INFO tuned.daemon.daemon: static tuning from profile 'openshift-node' applied E1209 23:26:59.719096 2840 reflector.go:127] github.com/openshift/cluster-node-tuning-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.Tuned: failed to list *v1.Tuned: Get "https://172.30.0.1:443/apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/tuneds?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: connect: no route to host Must-gather from the cluster after the latest reboot is here: https://drive.google.com/file/d/1RnJXARFrU7l8wjEI0oFai3o6QHs85j1o/view?usp=sharing The node that saw the issue was jerzhang-201208-1-dpfdj-worker-a-p7p7z, during its most recent reboot. Version-Release number of selected component (if applicable): 4.7 How reproducible: Very rare (only seen twice so far on GCP only). Maybe 1/20 or 1/50 reboots? Marking as low severity due to its low occurrence. Steps to Reproduce: 1. spin up a gcp 4.7 IPI cluster (default install) 2. trigger reboots with machineconfigs 3. observe console logs via gcp console Expected results: Tuned terminates gracefully Actual results: Tuned needs to be sigkilled