Description of problem: The stalld service has higher scheduler priority than ksoftirq and rcu{b, c} threads, that can lead to the system freeze when running load intensive processes as oslat. Version-Release number of selected component (if applicable): Client Version: 4.6.0-0.nightly-2020-07-25-091217 Server Version: 4.7.0-0.ci-2020-12-08-050547 Kubernetes Version: v1.19.2-1007+ad738ba548b6d6-dirty How reproducible: Always Steps to Reproduce: 1. run oslat tool on the node to create pressure on the real-time kernel environment 2. 3. Actual results: The host will be freeze until the oslat will finish running Expected results: The host should continue to work as expected. Additional info:
Version: -------- [root@dell-r730-009 cnf-internal-deploy]# oc version Client Version: 4.7.0-0.nightly-2020-12-04-013308 Server Version: 4.7.0-0.nightly-2020-12-21-131655 Kubernetes Version: v1.20.0+87544c5 Version of Performance Operator: --------------------------------- rh-osbs/openshift4-performance-addon-operator-bundle-registry-container-rhel8:v4.7.0-285 Index image v4.7: registry-proxy.engineering.redhat.com/rh-osbs/iib:34633 [root@dell-r730-009 cnf-internal-deploy]# oc logs pods/performance-operator-767ddb449-lknqr -n openshift-performance-addon-operator I1228 07:50:49.908341 1 main.go:72] Operator Version: I1228 07:50:49.908453 1 main.go:73] Git Commit: I1228 07:50:49.908461 1 main.go:74] Build Date: 2020-12-22T13:00:05+0000 I1228 07:50:49.908475 1 main.go:75] Go Version: go1.13.15 I1228 07:50:49.908481 1 main.go:76] Go OS/Arch: linux/amd64 I1228 07:50:50.962708 1 request.go:621] Throttling request took 1.036451752s, request: GET:https://172.30.0.1:443/apis/tuned.openshift.io/v1?timeout=32s Deployed performance profile and checked the tuned profile pod and it contains the relevant information. [main] summary=Openshift node optimized for deterministic performance at the cost of increased power consumption, focused on low latency network performance. Based on Tuned 2.11 and Cluster node tuning (oc 4.5) include=openshift-node,cpu-partitioning [variables] isolated_cores=2-15 not_isolated_cores_expanded=${f:cpulist_invert:${isolated_cores_expanded}} [cpu] force_latency=cstate.id:1|3 # latency-performance (override) governor=performance # latency-performance energy_perf_bias=performance # latency-performance min_perf_pct=100 # latency-performance [service] service.stalld=start,enable [vm] transparent_hugepages=never # network-latency [scheduler] group.ksoftirqd=0:f:11:*:ksoftirqd.* group.rcuc=0:f:11:*:rcuc.* Marking it verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633