Bug 1885864 - Stalld service crashed under the worker node
Summary: Stalld service crashed under the worker node
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node Tuning Operator
Version: 4.6
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
: 4.7.0
Assignee: Jiří Mencák
QA Contact: Simon
URL:
Whiteboard:
Depends On:
Blocks: 1886511
TreeView+ depends on / blocked
 
Reported: 2020-10-07 08:22 UTC by Artyom
Modified: 2021-02-24 15:24 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1886511 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:23:22 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:24:10 UTC

Description Artyom 2020-10-07 08:22:34 UTC
Description of problem:
The stalld service crashed under the worked node with the error:
Oct 06 18:28:30 cnfdc2.clus2.t5g.lab.eng.bos.redhat.com stalld[695415]: rcuc/48-414 might starve on CPU 48 (waiting for 237 seconds)
Oct 06 18:28:42 cnfdc2.clus2.t5g.lab.eng.bos.redhat.com stalld[695415]: rcuc/48-414 might starve on CPU 48 (waiting for 249 seconds)
Oct 06 18:28:54 cnfdc2.clus2.t5g.lab.eng.bos.redhat.com stalld[695415]: rcuc/48-414 might starve on CPU 48 (waiting for 261 seconds)
Oct 06 18:29:06 cnfdc2.clus2.t5g.lab.eng.bos.redhat.com stalld[695415]: rcuc/48-414 might starve on CPU 48 (waiting for 273 seconds)
Oct 06 18:29:19 cnfdc2.clus2.t5g.lab.eng.bos.redhat.com stalld[695415]: rcuc/48-414 might starve on CPU 48 (waiting for 286 seconds)
Oct 06 18:29:31 cnfdc2.clus2.t5g.lab.eng.bos.redhat.com stalld[695415]: rcuc/48-414 might starve on CPU 48 (waiting for 298 seconds)
Oct 06 18:30:22 cnfdc2.clus2.t5g.lab.eng.bos.redhat.com stalld[695415]:   Invalid ID '(null)'
Oct 06 18:30:22 cnfdc2.clus2.t5g.lab.eng.bos.redhat.com systemd[1]: stalld.service: Main process exited, code=exited, status=255/n/a
Oct 06 18:30:22 cnfdc2.clus2.t5g.lab.eng.bos.redhat.com systemd[1]: stalld.service: Failed with result 'exit-code'.
Oct 06 18:30:22 cnfdc2.clus2.t5g.lab.eng.bos.redhat.com systemd[1]: stalld.service: Consumed 4min 799ms CPU time

Version-Release number of selected component (if applicable):
Client Version: 4.6.0-0.nightly-2020-10-03-051134
Server Version: 4.6.0-0.nightly-2020-10-03-051134
Kubernetes Version: v1.19.0+db1fc96

How reproducible:
Sometime it just happens

Steps to Reproduce:
1.
2.
3.

Actual results:
Stalld crashed

Expected results:
It should not crashed

Additional info:
The bug probably fixed under the stalld https://gitlab.com/rt-linux-tools/stalld/-/commit/2a6dbc98147e8645f5b385322465c7f173df8607 we only need to install a new binary under the NTO

Comment 4 Simon 2020-10-15 19:11:44 UTC
On pod:
$ oc project openshift-cluster-node-tuning-operator 
Now using project "openshift-cluster-node-tuning-operator" on server "https://api.skordas1015.qe.devcluster.openshift.com:6443".
$ oc get pods
NAME                                            READY   STATUS    RESTARTS   AGE
cluster-node-tuning-operator-67dbdbf885-lmd7m   1/1     Running   0          4h45m
tuned-7pzdp                                     1/1     Running   0          5h9m
tuned-8qttl                                     1/1     Running   0          5h9m
tuned-jpb24                                     1/1     Running   0          5h9m
tuned-mkl86                                     1/1     Running   0          5h2m
tuned-ntz7f                                     1/1     Running   0          5h1m
tuned-td9rs                                     1/1     Running   0          5h
$ oc rsh tuned-7pzdp
sh-4.4# stalld -h 2>&1|grep force_fifo
          -F/--force_fifo: use SCHED_FIFO for boosting

On host node with enabled stalld:
$ oc get pods -n openshift-cluster-node-tuning-operator -o wide
NAME                                            READY   STATUS    RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
cluster-node-tuning-operator-67dbdbf885-lmd7m   1/1     Running   0          4h47m   10.130.0.15    ip-10-0-139-57.us-east-2.compute.internal    <none>           <none>
tuned-7pzdp                                     1/1     Running   0          5h10m   10.0.201.78    ip-10-0-201-78.us-east-2.compute.internal    <none>           <none>
tuned-8qttl                                     1/1     Running   0          5h10m   10.0.181.44    ip-10-0-181-44.us-east-2.compute.internal    <none>           <none>
tuned-jpb24                                     1/1     Running   0          5h10m   10.0.139.57    ip-10-0-139-57.us-east-2.compute.internal    <none>           <none>
tuned-mkl86                                     1/1     Running   0          5h4m    10.0.199.83    ip-10-0-199-83.us-east-2.compute.internal    <none>           <none>
tuned-ntz7f                                     1/1     Running   0          5h2m    10.0.173.192   ip-10-0-173-192.us-east-2.compute.internal   <none>           <none>
tuned-td9rs                                     1/1     Running   0          5h2m    10.0.159.212   ip-10-0-159-212.us-east-2.compute.internal   <none>           <none>
$ oc debug node/ip-10-0-201-78.us-east-2.compute.internal
Starting pod/ip-10-0-201-78us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.201.78
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# ps -ef | grep stalld
root      298592  297867  0 19:07 pts/0    00:00:00 stalld -p 1000000000 -r 10000 -d 3 -t 30 --log_syslog --log_kmsg --foreground --pidfile /run/stalld.pid

clusterversion 4.7.0-0.nightly-2020-10-15-051208

Comment 7 errata-xmlrpc 2021-02-24 15:23:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.