Bug 2055267 - OSLAT runner uses both sibling threads causing latency spikes
Summary: OSLAT runner uses both sibling threads causing latency spikes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: CNF Platform Validation
Version: 4.9
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.10.z
Assignee: Talor Itzhak
QA Contact: Dwaine Gonyier
URL:
Whiteboard:
Depends On: 2051443
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-16 14:42 UTC by OpenShift BugZilla Robot
Modified: 2022-12-06 13:00 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-12-06 13:00:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift-kni cnf-features-deploy pull 972 0 None open [release-4.10] Bug 2055267: cnf-tests: filter main thread CPU siblings under the oslat test 2022-02-16 14:42:26 UTC
Red Hat Product Errata RHEA-2022:8760 0 None None None 2022-12-06 13:00:40 UTC

Description OpenShift BugZilla Robot 2022-02-16 14:42:14 UTC
+++ This bug was initially created as a clone of Bug #2051443 +++

Description of problem:

CNF tests oslat runner configures oslat to use all available cpus even though some of them belong to the same cpu core. This causes false spikes.

Version-Release number of selected component (if applicable):

All

How reproducible:

Always, by following documentation on HT enabled nodes


Steps to Reproduce:
1. Configure OCP cluster using PAO, keep hyper-threading enabled
2. Run oslat according to the docs, request 7 cpus
3. Observe oslat will use one cpu for control thread (OK) and 6 for testing. Those 6 will be for sure threads 0 and 1 from about 3 cpu cores

https://docs.openshift.com/container-platform/4.9/scalability_and_performance/cnf-performance-addon-operator-for-low-latency-nodes.html#cnf-performing-end-to-end-tests-running-oslat

Actual results:

On a machine configured like this:

```
  cpu:
    reserved: "0,1,32,33"
    isolated: "2-31,34-63"

```

Where cpus 0-31 are threads 0 of cores 0-31 and cpus 32-63 are the secondary threads (thread 1) of the same cores.


```
cat <<EOF > run-cnf-tests.sh
sudo -E podman run --authfile ./pull_secret.txt -v $(pwd)/:/kubeconfig \
  --dns 10.20.129.82 \
  -e ROLE_WORKER_CNF=master \
  -e CLEAN_PERFORMANCE_PROFILE="false" \
  -e KUBECONFIG=/kubeconfig/kubeconfig \
  -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e LATENCY_TEST_CPUS=7 -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \
  registry.redhat.com/openshift4/cnf-tests-rhel8:v4.9 \
  /usr/bin/test-run.sh -ginkgo.focus="oslat"
EOF


################# OUTPUT ######################################
$ oc logs oslat-kwks5
I0203 14:55:43.363056       1 node.go:37] Environment information: /proc/cmdline: BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-22582544502f700ca41c79366c3dee5737c2ee491485c0cecb5ba19d7151b5e7/vmlinuz-4.18.0-305.30.1.el8_4.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal ostree=/ostree/boot.1/rhcos/22582544502f700ca41c79366c3dee5737c2ee491485c0cecb5ba19d7151b5e7/0 ip=ens43f0:dhcp root=UUID=efae0a7f-874e-411d-b4ce-a9a51a1622a7 rw rootflags=prjquota skew_tick=1 nohz=on rcu_nocbs=2-31,34-63 tuned.non_isolcpus=00000003,00000003 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,2-31,34-63 systemd.cpu_affinity=0,1,32,33 default_hugepagesz=1G hugepagesz=1G hugepages=50 idle=poll rcupdate.rcu_normal_after_boot=0 nohz_full=2-31,34-63
I0203 14:55:43.363322       1 node.go:44] Environment information: kernel version 4.18.0-305.30.1.el8_4.x86_64
I0203 14:55:43.363363       1 main.go:53] Running the oslat command with arguments [--duration 600 --rtprio 1 --cpu-list 3-5,34-36 --cpu-main-thread 2]

[admin@server poc-sno-01]$ oc logs oslat-kwks5 -f
I0203 14:55:43.363056       1 node.go:37] Environment information: /proc/cmdline: BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-22582544502f700ca41c79366c3dee5737c2ee491485c0cecb5ba19d7151b5e7/vmlinuz-4.18.0-305.30.1.el8_4.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal ostree=/ostree/boot.1/rhcos/22582544502f700ca41c79366c3dee5737c2ee491485c0cecb5ba19d7151b5e7/0 ip=ens43f0:dhcp root=UUID=efae0a7f-874e-411d-b4ce-a9a51a1622a7 rw rootflags=prjqWorkload:               no
Workload mem:           0 (KiB)
Preheat cores:          6

Pre-heat for 1 seconds...
Test starts...
Test completed.

```

Notice this line:

Running the oslat command with arguments [--duration 600 --rtprio 1 --cpu-list 3-5,34-36 --cpu-main-thread 2]

Expected results:

Only threads 3-5 should be used for the test, the remaining cpus should be left idle.

Additional info:

Here is the runner: https://github.com/openshift-kni/cnf-features-deploy/blob/953dcd664f12d39116039e76fead9d83d1d33afb/cnf-tests/pod-utils/oslat-runner/main.go

Here is the similar runner from the performance team: https://github.com/redhat-nfvpe/container-perf-tools/blob/f641d725ffa694b735561b837abc4219753c93d8/oslat/cmd.sh#L70

--- Additional comment from msivak on 2022-02-07 10:06:01 UTC ---

After further discussion, we should start by filtering out the sibling of the control thread.

Comment 3 Shereen Haj Makhoul 2022-11-22 13:06:30 UTC
Verification:

cnf-tests: registry-proxy.engineering.redhat.com/rh-osbs/openshift4-cnf-tests:v4.10.10-6

The machine has HT enabled, as can be seen below that CPU 0 has a sibling 40:
sh-4.4# cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
0,40

also from lscpu:
sh-4.4# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              80   
On-line CPU(s) list: 0-79
Thread(s) per core:  2
Core(s) per socket:  20
Socket(s):           2
NUMA node(s):        2
...
CPU(s)= Socket(s)*Core(s) per socket

configure the PP as follows:

```
spec:
  cpu:
    isolated: "2-39,42-80"
    reserved: "0,1,40,41"
  realTimeKernel:
    enabled: true
  nodeSelector:
    node-role.kubernetes.io/worker: ""
```

Run oslat as follows requesting 6 cpus:

 podman run --net=host -v /home/kni/clusterconfigs/auth:/kc:z -e KUBECONFIG=/kc/kubeconfig -e IMAGE_REGISTRY=registry.hlxcl6.lab.eng.tlv2.redhat.com:5000/ -e CNF_TESTS_IMAGE=openshift4-cnf-tests:v4.10.10-6 -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e ROLE_WORKER_CNF=worker -e LATENCY_TEST_RUNTIME=100 -e LATENCY_TEST_CPUS=6 -e MAXIMUM_LATENCY=10000000 registry.hlxcl6.lab.eng.tlv2.redhat.com:5000/openshift4-cnf-tests:v4.10.10-6 /usr/bin/test-run.sh -ginkgo.focus="oslat"
Trying to pull registry.hlxcl6.lab.eng.tlv2.redhat.com:5000/openshift4-cnf-tests:v4.10.10-6...
Getting image source signatures
...
Will run 1 of 173 specs

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSLog file created at: 2022/11/22 12:00:21
Running on machine: oslat-d56kq
Binary: Built with gc go1.17.12 for linux/amd64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1122 12:00:21.564781       1 node.go:39] Environment information: /proc/cmdline: BOOT_IMAGE=(hd1,gpt3)/ostree/rhcos-b594aea28251da3b472da2adba0a57d5fcf82c28c87897a88eb26e6db542b18b/vmlinuz-4.18.0-425.3.1.rt7.213.el8.x86_64 ignition.platform.id=metal ostree=/ostree/boot.0/rhcos/b594aea28251da3b472da2adba0a57d5fcf82c28c87897a88eb26e6db542b18b/0 ip=dhcp root=UUID=10f5e35f-f897-41e5-ab22-a4e811b9d4eb rw rootflags=prjquota boot=UUID=97cd8797-d71c-4b73-a9f3-5b8f41e0a473
I1122 12:00:21.564989       1 node.go:46] Environment information: kernel version 4.18.0-425.3.1.rt7.213.el8.x86_64
I1122 12:00:21.565010       1 main.go:58] Running the oslat command with arguments [--duration 100 --rtprio 1 --cpu-list 3-4,43-44 --cpu-main-thread 2]
I1122 12:02:02.603193       1 main.go:64] Succeeded to run the oslat command: oslat V 1.10
Total runtime: 		100 seconds
Thread priority: 	SCHED_FIFO:1
CPU list: 		3-4,43-44
CPU for main thread: 	2
Workload: 		no
Workload mem: 		0 (KiB)
Preheat cores: 		4

Pre-heat for 1 seconds...
Test starts...
Test completed.

        Core:	 3 4 43 44
    CPU Freq:	 2392 2392 2398 2392 (Mhz)
    001 (us):	 3334852986 3331365521 3335019579 3334277484
    002 (us):	 86 88 71 79
    003 (us):	 10595 310 11890 37009
    004 (us):	 63866 37955 65905 51737
    005 (us):	 18945 44141 15939 5529
    006 (us):	 1077 11262 770 235
    007 (us):	 87 794 67 64
    008 (us):	 45 63 46 46
    009 (us):	 55 41 46 53
    010 (us):	 32 57 30 43
    011 (us):	 578 424 728 439
    012 (us):	 870 1031 723 1014
    013 (us):	 3166 2057 3255 2454
    014 (us):	 191 1346 99 885
    015 (us):	 102 73 106 88
    016 (us):	 118 103 248 195
    017 (us):	 3709 2250 3637 2267
    018 (us):	 872 2370 806 2250
    019 (us):	 11 16 14 13
    020 (us):	 37 18 32 32
    021 (us):	 25 23 24 30
    022 (us):	 4 35 7 3
    023 (us):	 5 6 5 3
    024 (us):	 2 3 2 2
    025 (us):	 1 3 1 2
    026 (us):	 3 3 3 4
    027 (us):	 4 2 4 2
    028 (us):	 4 5 4 6
    029 (us):	 3 4 3 3
    030 (us):	 0 0 0 0
    031 (us):	 0 0 0 0
    032 (us):	 109 109 109 109 (including overflows)
     Minimum:	 1 1 1 1 (us)
     Average:	 1.002 1.002 1.002 1.002 (us)
     Maximum:	 50167 50168 50058 50184 (us)
     Max-Min:	 50166 50167 50057 50183 (us)
    Duration:	 100.095 100.095 99.845 100.095 (sec)


------------------------------
• [SLOW TEST:173.351 seconds]
[performance] Latency Test
/remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:62
  with the oslat image
  /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:132
    should succeed
    /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:157
------------------------------
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSI1122 12:03:21.567281      27 request.go:665] Waited for 1.171689031s due to client-side throttling, not priority and fairness, request: GET:https://api.hlxcl6.lab.eng.tlv2.redhat.com:6443/apis/flowcontrol.apiserver.k8s.io/v1beta1?timeout=32s

JUnit report was created: /junit.xml/cnftests-junit.xml

Ran 1 of 173 Specs in 215.647 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 172 Skipped


The pod logs are as follows:
 oc logs oslat-d56kq
I1122 12:00:21.564781       1 node.go:39] Environment information: /proc/cmdline: BOOT_IMAGE=(hd1,gpt3)/ostree/rhcos-b594aea28251da3b472da2adba0a57d5fcf82c28c87897a88eb26e6db542b18b/vmlinuz-4.18.0-425.3.1.rt7.213.el8.x86_64 ignition.platform.id=metal ostree=/ostree/boot.0/rhcos/b594aea28251da3b472da2adba0a57d5fcf82c28c87897a88eb26e6db542b18b/0 ip=dhcp root=UUID=10f5e35f-f897-41e5-ab22-a4e811b9d4eb rw rootflags=prjquota boot=UUID=97cd8797-d71c-4b73-a9f3-5b8f41e0a473
I1122 12:00:21.564989       1 node.go:46] Environment information: kernel version 4.18.0-425.3.1.rt7.213.el8.x86_64
I1122 12:00:21.565010       1 main.go:58] Running the oslat command with arguments [--duration 100 --rtprio 1 --cpu-list 3-4,43-44 --cpu-main-thread 2]
[root@registry ~]# ^C

2 is the main CPU to run the test, and its sibling is 42,
--cpu-list contains 4 cpus, in total that makes it 6 cpus.

Comment 6 errata-xmlrpc 2022-12-06 13:00:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.10 low-latency extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2022:8760


Note You need to log in before you can comment on or make changes to this bug.