Bug 1949739 - Re-enable stalld
Summary: Re-enable stalld
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Performance Addon Operator
Version: 4.8
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.7.z
Assignee: Yanir Quinn
QA Contact: Niranjan Mallapadi Raghavender
URL:
Whiteboard:
Depends On: 1947773
Blocks: 1949027
TreeView+ depends on / blocked
 
Reported: 2021-04-15 00:26 UTC by OpenShift BugZilla Robot
Modified: 2021-07-02 11:13 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Stalld service was disabled by default for performance profiles due to a kernel bug in the HRTICK kernel subsystem that caused the system to hang. Consequence: Stalld is a required functionality for preventing the starvation of operating system threads. Not having it enabled by default impacts achievable latency. Fix: Kernel versions: 4.18.0-240.22.1.el8_3.x86_64 4.18.0-240.22.1.rt7.77.el8_3.x86_64 (for non real-time and real-time respectively) contain a fix for the HRTICK bug which allows us to re-enable stalld by default for all performance profiles. Result: Stalld functionality re-enabled by default reinstates the required latency results provided by tuning the system via the Performance addon operator. Important Note: All nodes kernel version must be : 4.18.0-240.22.1.el8_3.x86_64 4.18.0-240.22.1.rt7.77.el8_3.x86_64 Or higher This means that the underlying RHCOS version needs to be updated BEFORE PAO itself.
Clone Of:
Environment:
Last Closed: 2021-04-27 05:10:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift-kni/performance-addon-operators/blob/master/functests/1_performance/performance.go#L216 0 None None None 2021-07-02 11:13:23 UTC
Github openshift-kni performance-addon-operators pull 613 0 None open [release-4.7] Bug 1949739: Re-enable stalld 2021-04-21 12:51:03 UTC
Red Hat Product Errata RHBA-2021:1349 0 None None None 2021-04-27 05:10:55 UTC

Comment 4 Niranjan Mallapadi Raghavender 2021-04-26 15:10:35 UTC
1. Install ocp-4.7.7 and pao from prod (pao-4.7.2-1)

[root@dell-r640-015 performance]# /root/img-nvr.sh 
f31fcaa57d82c50cd2a4b10ca0420940b7d2c24edee4a199b5161bd24c9783a7
NVR=v4.7.2-1

2. Check the tuned profile:

oot@dell-r640-015 performance]# oc describe tuned/openshift-node-performance-performance
Name:         openshift-node-performance-performance
Namespace:    openshift-cluster-node-tuning-operator
Labels:       <none>
Annotations:  <none>
API Version:  tuned.openshift.io/v1
Kind:         Tuned
Metadata:
  Creation Timestamp:  2021-04-26T07:07:44Z
  Generation:          1
  Managed Fields:
    API Version:  tuned.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:ownerReferences:
          .:
          k:{"uid":"a61ec883-ec20-4862-a44e-71afd641e657"}:
            .:
            f:apiVersion:
            f:blockOwnerDeletion:
            f:controller:
            f:kind:
            f:name:
            f:uid:
      f:spec:
        .:
        f:profile:
        f:recommend:
      f:status:
    Manager:    performance-operator
    Operation:  Update
    Time:       2021-04-26T07:07:44Z
  Owner References:
    API Version:           performance.openshift.io/v2
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  PerformanceProfile
    Name:                  performance
    UID:                   a61ec883-ec20-4862-a44e-71afd641e657
  Resource Version:        61672
  Self Link:               /apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/tuneds/openshift-node-performance-performance
  UID:                     2fc4d332-c6a6-4e27-9324-5e5121cd47fd
Spec:
  Profile:
    Data:  [main]
summary=Openshift node optimized for deterministic performance at the cost of increased power consumption, focused on low latency network performance. Based on Tuned 2.11 and Cluster node tuning (oc 4.5)
include=openshift-node,cpu-partitioning

# Inheritance of base profiles legend:
# cpu-partitioning -> network-latency -> latency-performance
# https://github.com/redhat-performance/tuned/blob/master/profiles/latency-performance/tuned.conf
# https://github.com/redhat-performance/tuned/blob/master/profiles/network-latency/tuned.conf
# https://github.com/redhat-performance/tuned/blob/master/profiles/cpu-partitioning/tuned.conf

# All values are mapped with a comment where a parent profile contains them.
# Different values will override the original values in parent profiles.

[variables]
# isolated_cores take a list of ranges; e.g. isolated_cores=2,4-7

isolated_cores=1-3


not_isolated_cores_expanded=${f:cpulist_invert:${isolated_cores_expanded}}

[cpu]
force_latency=cstate.id:1|3                   #  latency-performance  (override)
governor=performance                          #  latency-performance
energy_perf_bias=performance                  #  latency-performance
min_perf_pct=100                              #  latency-performance

# Comment the stalld service section to prevent stalld installation
# until bugs https://bugzilla.redhat.com/show_bug.cgi?id=1912118 and
# https://bugzilla.redhat.com/show_bug.cgi?id=1903302 will be fixed
#[service]
#service.stalld=stop,disable


3. Upgrade to ocp-4.7.8 


 oc adm upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:7456516a64edf63268522565cf00dc581f1d7ad22355ffab8157a9e106cf607f --allow-explicit-upgrade --force
warning: The requested upgrade image is not one of the available updates.  You have used --allow-explicit-upgrade to the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.ci.openshift.org/ocp/release@sha256:7456516a64edf63268522565cf00dc581f1d7ad22355ffab8157a9e106cf607f
[root@dell-r640-015 ~]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.7     True        True          7s      Working towards registry.ci.openshift.org/ocp/release@sha256:7456516a64edf63268522565cf00dc581f1d7ad22355ffab8157a9e106cf607f: downloading update

[root@dell-r640-015 ~]# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.8     True        False         44s     Cluster version is 4.7.8
[root@dell-r640-015 ~]# oc get nodes
NAME                             STATUS   ROLES               AGE     VERSION
ocp47-master-0.demo.lab.shanks   Ready    master              4h34m   v1.20.0+7d0a2b2
ocp47-master-1.demo.lab.shanks   Ready    master              4h34m   v1.20.0+7d0a2b2
ocp47-master-2.demo.lab.shanks   Ready    master              4h34m   v1.20.0+7d0a2b2
ocp47-worker-0.demo.lab.shanks   Ready    worker,worker-cnf   4h22m   v1.20.0+7d0a2b2
ocp47-worker-1.demo.lab.shanks   Ready    worker,worker-cnf   4h22m   v1.20.0+7d0a2b2
ocp47-worker-2.demo.lab.shanks   Ready    worker              4h20m   v1.20.0+7d0a2b2


4. Upgradee performance addon operator to 4.7.3-1

[root@dell-r640-015 performance]# oc get csv 
NAME                                DISPLAY                      VERSION   REPLACES                            PHASE
performance-addon-operator.v4.7.3   Performance Addon Operator   4.7.3     performance-addon-operator.v4.7.2   Succeeded


[root@dell-r640-015 ~]# oc get nodes -o wide
NAME                             STATUS   ROLES               AGE     VERSION           INTERNAL-IP       EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                        CONTAINER-RUNTIME
ocp47-master-0.demo.lab.shanks   Ready    master              4h34m   v1.20.0+7d0a2b2   192.168.122.17    <none>        Red Hat Enterprise Linux CoreOS 47.83.202104161442-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64          cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8
ocp47-master-1.demo.lab.shanks   Ready    master              4h34m   v1.20.0+7d0a2b2   192.168.122.233   <none>        Red Hat Enterprise Linux CoreOS 47.83.202104161442-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64          cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8
ocp47-master-2.demo.lab.shanks   Ready    master              4h34m   v1.20.0+7d0a2b2   192.168.122.220   <none>        Red Hat Enterprise Linux CoreOS 47.83.202104161442-0 (Ootpa)   4.18.0-240.22.1.el8_3.x86_64          cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8
ocp47-worker-0.demo.lab.shanks   Ready    worker,worker-cnf   4h22m   v1.20.0+7d0a2b2   192.168.122.26    <none>        Red Hat Enterprise Linux CoreOS 47.83.202104161442-0 (Ootpa)   4.18.0-240.22.1.rt7.77.el8_3.x86_64   cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8
ocp47-worker-1.demo.lab.shanks   Ready    worker,worker-cnf   4h22m   v1.20.0+7d0a2b2   192.168.122.241   <none>        Red Hat Enterprise Linux CoreOS 47.83.202104161442-0 (Ootpa)   4.18.0-240.22.1.rt7.77.el8_3.x86_64   cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8
ocp47-worker-2.demo.lab.shanks   Ready    worker              4h20m   v1.20.0+7d0a2b2   192.168.122.183   <none>        Red Hat Enterprise Linux CoreOS 47.83.202104161442-0 (Ootpa)   
4.18.0-240.22.1.el8_3.x86_64          cri-o://1.20.2-6.rhaos4.7.gitf1d5201.el8


5. Check the tuned profile and verify stalld is enabled


[root@dell-r640-015 performance]# oc describe tuned/openshift-node-performance-performance
Name:         openshift-node-performance-performance
Namespace:    openshift-cluster-node-tuning-operator
Labels:       <none>
Annotations:  <none>
API Version:  tuned.openshift.io/v1
Kind:         Tuned
Metadata:
  Creation Timestamp:  2021-04-26T07:07:44Z
  Generation:          3
  Managed Fields:
    API Version:  tuned.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:ownerReferences:
          .:
          k:{"uid":"a61ec883-ec20-4862-a44e-71afd641e657"}:
            .:
            f:apiVersion:
            f:blockOwnerDeletion:
            f:controller:
            f:kind:
            f:name:
            f:uid:
      f:spec:
       .:
        f:profile:
        f:recommend:
      f:status:
    Manager:    performance-operator
    Operation:  Update
    Time:       2021-04-26T07:07:44Z
  Owner References:
    API Version:           performance.openshift.io/v2
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  PerformanceProfile
    Name:                  performance
    UID:                   a61ec883-ec20-4862-a44e-71afd641e657
  Resource Version:        212538
  Self Link:               /apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/tuneds/openshift-node-performance-performance
  UID:                     2fc4d332-c6a6-4e27-9324-5e5121cd47fd
Spec:
  Profile:
    Data:  [main]
summary=Openshift node optimized for deterministic performance at the cost of increased power consumption, focused on low latency network performance. Based on Tuned 2.11 and Cluster node tuning (oc 4.5)
include=openshift-node,cpu-partitioning

# Inheritance of base profiles legend:
# cpu-partitioning -> network-latency -> latency-performance
# https://github.com/redhat-performance/tuned/blob/master/profiles/latency-performance/tuned.conf
# https://github.com/redhat-performance/tuned/blob/master/profiles/network-latency/tuned.conf
# https://github.com/redhat-performance/tuned/blob/master/profiles/cpu-partitioning/tuned.conf

# All values are mapped with a comment where a parent profile contains them.
# Different values will override the original values in parent profiles.

[variables]
# isolated_cores take a list of ranges; e.g. isolated_cores=2,4-7

isolated_cores=4-46


not_isolated_cores_expanded=${f:cpulist_invert:${isolated_cores_expanded}}

[cpu]
force_latency=cstate.id:1|3                   #  latency-performance  (override)
governor=performance                          #  latency-performance
energy_perf_bias=performance                  #  latency-performance
min_perf_pct=100                              #  latency-performance

[service]
service.stalld=start,enable

[root@ocp47-worker-0 ~]# systemctl status stalld
● stalld.service - Stall Monitor
   Loaded: loaded (/etc/systemd/system/stalld.service; static; vendor preset: disabled)
   Active: active (running) since Mon 2021-04-26 13:52:55 UTC; 1h 16min ago
 Main PID: 10639 (stalld)
    Tasks: 1 (limit: 205142)
   Memory: 4.2M
      CPU: 2min 10.809s
   CGroup: /system.slice/stalld.service
           └─10639 /usr/local/bin/stalld --systemd -p 1000000000 -r 10000 -d 3 -t 20 --log_syslog --log_kmsg --foreground --pidfile /run/stalld.pid

Apr 26 13:52:48 ocp47-worker-0.demo.lab.shanks systemd[1]: Starting Stall Monitor...
Apr 26 13:52:55 ocp47-worker-0.demo.lab.shanks systemd[1]: Started Stall Monitor.
Apr 26 13:52:55 ocp47-worker-0.demo.lab.shanks stalld[10639]: dl_runtime is shorter than 1ms, setting HRTICK
Apr 26 13:52:55 ocp47-worker-0.demo.lab.shanks stalld[10639]: boosted pid 0 using SCHED_DEADLINE
Apr 26 13:52:55 ocp47-worker-0.demo.lab.shanks stalld[10639]: using SCHED_DEADLINE for boosting
Apr 26 13:52:55 ocp47-worker-0.demo.lab.shanks stalld[10639]: initial config_buffer_size set to 535500
Apr 26 13:52:55 ocp47-worker-0.demo.lab.shanks stalld[10639]: detected new task format
Apr 26 13:52:55 ocp47-worker-0.demo.lab.shanks stalld[10639]: sched_debug is getting larger, increasing the buffer to 1071000
Apr 26 14:48:07 ocp47-worker-0.demo.lab.shanks stalld[10639]: sched_debug is getting larger, increasing the buffer to 2142000

Comment 6 errata-xmlrpc 2021-04-27 05:10:52 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.8 low-latency extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1349


Note You need to log in before you can comment on or make changes to this bug.