Description of problem: The stalld service crashed under the worked node with the error: Oct 06 18:28:30 cnfdc2.clus2.t5g.lab.eng.bos.redhat.com stalld[695415]: rcuc/48-414 might starve on CPU 48 (waiting for 237 seconds) Oct 06 18:28:42 cnfdc2.clus2.t5g.lab.eng.bos.redhat.com stalld[695415]: rcuc/48-414 might starve on CPU 48 (waiting for 249 seconds) Oct 06 18:28:54 cnfdc2.clus2.t5g.lab.eng.bos.redhat.com stalld[695415]: rcuc/48-414 might starve on CPU 48 (waiting for 261 seconds) Oct 06 18:29:06 cnfdc2.clus2.t5g.lab.eng.bos.redhat.com stalld[695415]: rcuc/48-414 might starve on CPU 48 (waiting for 273 seconds) Oct 06 18:29:19 cnfdc2.clus2.t5g.lab.eng.bos.redhat.com stalld[695415]: rcuc/48-414 might starve on CPU 48 (waiting for 286 seconds) Oct 06 18:29:31 cnfdc2.clus2.t5g.lab.eng.bos.redhat.com stalld[695415]: rcuc/48-414 might starve on CPU 48 (waiting for 298 seconds) Oct 06 18:30:22 cnfdc2.clus2.t5g.lab.eng.bos.redhat.com stalld[695415]: Invalid ID '(null)' Oct 06 18:30:22 cnfdc2.clus2.t5g.lab.eng.bos.redhat.com systemd[1]: stalld.service: Main process exited, code=exited, status=255/n/a Oct 06 18:30:22 cnfdc2.clus2.t5g.lab.eng.bos.redhat.com systemd[1]: stalld.service: Failed with result 'exit-code'. Oct 06 18:30:22 cnfdc2.clus2.t5g.lab.eng.bos.redhat.com systemd[1]: stalld.service: Consumed 4min 799ms CPU time Version-Release number of selected component (if applicable): Client Version: 4.6.0-0.nightly-2020-10-03-051134 Server Version: 4.6.0-0.nightly-2020-10-03-051134 Kubernetes Version: v1.19.0+db1fc96 How reproducible: Sometime it just happens Steps to Reproduce: 1. 2. 3. Actual results: Stalld crashed Expected results: It should not crashed Additional info: The bug probably fixed under the stalld https://gitlab.com/rt-linux-tools/stalld/-/commit/2a6dbc98147e8645f5b385322465c7f173df8607 we only need to install a new binary under the NTO
On pod: $ oc project openshift-cluster-node-tuning-operator Now using project "openshift-cluster-node-tuning-operator" on server "https://api.skordas1015.qe.devcluster.openshift.com:6443". $ oc get pods NAME READY STATUS RESTARTS AGE cluster-node-tuning-operator-67dbdbf885-lmd7m 1/1 Running 0 4h45m tuned-7pzdp 1/1 Running 0 5h9m tuned-8qttl 1/1 Running 0 5h9m tuned-jpb24 1/1 Running 0 5h9m tuned-mkl86 1/1 Running 0 5h2m tuned-ntz7f 1/1 Running 0 5h1m tuned-td9rs 1/1 Running 0 5h $ oc rsh tuned-7pzdp sh-4.4# stalld -h 2>&1|grep force_fifo -F/--force_fifo: use SCHED_FIFO for boosting On host node with enabled stalld: $ oc get pods -n openshift-cluster-node-tuning-operator -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-node-tuning-operator-67dbdbf885-lmd7m 1/1 Running 0 4h47m 10.130.0.15 ip-10-0-139-57.us-east-2.compute.internal <none> <none> tuned-7pzdp 1/1 Running 0 5h10m 10.0.201.78 ip-10-0-201-78.us-east-2.compute.internal <none> <none> tuned-8qttl 1/1 Running 0 5h10m 10.0.181.44 ip-10-0-181-44.us-east-2.compute.internal <none> <none> tuned-jpb24 1/1 Running 0 5h10m 10.0.139.57 ip-10-0-139-57.us-east-2.compute.internal <none> <none> tuned-mkl86 1/1 Running 0 5h4m 10.0.199.83 ip-10-0-199-83.us-east-2.compute.internal <none> <none> tuned-ntz7f 1/1 Running 0 5h2m 10.0.173.192 ip-10-0-173-192.us-east-2.compute.internal <none> <none> tuned-td9rs 1/1 Running 0 5h2m 10.0.159.212 ip-10-0-159-212.us-east-2.compute.internal <none> <none> $ oc debug node/ip-10-0-201-78.us-east-2.compute.internal Starting pod/ip-10-0-201-78us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.201.78 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-4.4# ps -ef | grep stalld root 298592 297867 0 19:07 pts/0 00:00:00 stalld -p 1000000000 -r 10000 -d 3 -t 30 --log_syslog --log_kmsg --foreground --pidfile /run/stalld.pid clusterversion 4.7.0-0.nightly-2020-10-15-051208
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633