Description of problem: CNF tests oslat runner configures oslat to use all available cpus even though some of them belong to the same cpu core. This causes false spikes. Version-Release number of selected component (if applicable): All How reproducible: Always, by following documentation on HT enabled nodes Steps to Reproduce: 1. Configure OCP cluster using PAO, keep hyper-threading enabled 2. Run oslat according to the docs, request 7 cpus 3. Observe oslat will use one cpu for control thread (OK) and 6 for testing. Those 6 will be for sure threads 0 and 1 from about 3 cpu cores https://docs.openshift.com/container-platform/4.9/scalability_and_performance/cnf-performance-addon-operator-for-low-latency-nodes.html#cnf-performing-end-to-end-tests-running-oslat Actual results: On a machine configured like this: ``` cpu: reserved: "0,1,32,33" isolated: "2-31,34-63" ``` Where cpus 0-31 are threads 0 of cores 0-31 and cpus 32-63 are the secondary threads (thread 1) of the same cores. ``` cat <<EOF > run-cnf-tests.sh sudo -E podman run --authfile ./pull_secret.txt -v $(pwd)/:/kubeconfig \ --dns 10.20.129.82 \ -e ROLE_WORKER_CNF=master \ -e CLEAN_PERFORMANCE_PROFILE="false" \ -e KUBECONFIG=/kubeconfig/kubeconfig \ -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e LATENCY_TEST_CPUS=7 -e LATENCY_TEST_RUNTIME=600 -e MAXIMUM_LATENCY=20 \ registry.redhat.com/openshift4/cnf-tests-rhel8:v4.9 \ /usr/bin/test-run.sh -ginkgo.focus="oslat" EOF ################# OUTPUT ###################################### $ oc logs oslat-kwks5 I0203 14:55:43.363056 1 node.go:37] Environment information: /proc/cmdline: BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-22582544502f700ca41c79366c3dee5737c2ee491485c0cecb5ba19d7151b5e7/vmlinuz-4.18.0-305.30.1.el8_4.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal ostree=/ostree/boot.1/rhcos/22582544502f700ca41c79366c3dee5737c2ee491485c0cecb5ba19d7151b5e7/0 ip=ens43f0:dhcp root=UUID=efae0a7f-874e-411d-b4ce-a9a51a1622a7 rw rootflags=prjquota skew_tick=1 nohz=on rcu_nocbs=2-31,34-63 tuned.non_isolcpus=00000003,00000003 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,2-31,34-63 systemd.cpu_affinity=0,1,32,33 default_hugepagesz=1G hugepagesz=1G hugepages=50 idle=poll rcupdate.rcu_normal_after_boot=0 nohz_full=2-31,34-63 I0203 14:55:43.363322 1 node.go:44] Environment information: kernel version 4.18.0-305.30.1.el8_4.x86_64 I0203 14:55:43.363363 1 main.go:53] Running the oslat command with arguments [--duration 600 --rtprio 1 --cpu-list 3-5,34-36 --cpu-main-thread 2] [admin@server poc-sno-01]$ oc logs oslat-kwks5 -f I0203 14:55:43.363056 1 node.go:37] Environment information: /proc/cmdline: BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-22582544502f700ca41c79366c3dee5737c2ee491485c0cecb5ba19d7151b5e7/vmlinuz-4.18.0-305.30.1.el8_4.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal ostree=/ostree/boot.1/rhcos/22582544502f700ca41c79366c3dee5737c2ee491485c0cecb5ba19d7151b5e7/0 ip=ens43f0:dhcp root=UUID=efae0a7f-874e-411d-b4ce-a9a51a1622a7 rw rootflags=prjqWorkload: no Workload mem: 0 (KiB) Preheat cores: 6 Pre-heat for 1 seconds... Test starts... Test completed. ``` Notice this line: Running the oslat command with arguments [--duration 600 --rtprio 1 --cpu-list 3-5,34-36 --cpu-main-thread 2] Expected results: Only threads 3-5 should be used for the test, the remaining cpus should be left idle. Additional info: Here is the runner: https://github.com/openshift-kni/cnf-features-deploy/blob/953dcd664f12d39116039e76fead9d83d1d33afb/cnf-tests/pod-utils/oslat-runner/main.go Here is the similar runner from the performance team: https://github.com/redhat-nfvpe/container-perf-tools/blob/f641d725ffa694b735561b837abc4219753c93d8/oslat/cmd.sh#L70
After further discussion, we should start by filtering out the sibling of the control thread.
Verification: cnf-tests: registry-proxy.engineering.redhat.com/rh-osbs/openshift4-cnf-tests:v4.12.0-60 The machine has HT enabled, as can be seen below that CPU 0 has a sibling 40: sh-4.4# cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list 0,40 also from lscpu: sh-4.4# lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 80 On-line CPU(s) list: 0-79 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 2 ... CPU(s)= Socket(s)*Core(s) per socket configure the PP as follows: ``` spec: cpu: isolated: "2-39,42-80" reserved: "0,1,40,41" realTimeKernel: enabled: true nodeSelector: node-role.kubernetes.io/worker: "" ``` Run oslat as follows requesting 6 cpus: [root@registry ~]# podman run --net=host -v /home/kni/clusterconfigs/auth:/kc:z -e KUBECONFIG=/kc/kubeconfig -e IMAGE_REGISTRY=registry.hlxcl6.lab.eng.tlv2.redhat.com:5000/ -e CNF_TESTS_IMAGE=openshift4-cnf-tests:v4.12.0-60 -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e ROLE_WORKER_CNF=worker -e LATENCY_TEST_RUNTIME=100 -e LATENCY_TEST_CPUS=6 -e MAXIMUM_LATENCY=10000000 registry.hlxcl6.lab.eng.tlv2.redhat.com:5000/openshift4-cnf-tests:v4.12.0-60 /usr/bin/test-run.sh -ginkgo.focus="oslat" observe the pod logs: [root@registry ~]# oc logs oslat-rdr77 I1122 09:53:32.779769 1 main.go:35] oslat main thread cpu: %d2 I1122 09:53:32.780267 1 main.go:41] oslat main thread's cpu siblings: %v[2 42] I1122 09:53:32.780331 1 node.go:39] Environment information: /proc/cmdline: BOOT_IMAGE=(hd1,gpt3)/ostree/rhcos-b594aea28251da3b472da2adba0a57d5fcf82c28c87897a88eb26e6db542b18b/vmlinuz-4.18.0-425.3.1.rt7.213.el8.x86_64 ignition.platform.id=metal ostree=/ostree/boot.0/rhcos/b594aea28251da3b472da2adba0a57d5fcf82c28c87897a88eb26e6db542b18b/0 ip=dhcp root=UUID=10f5e35f-f897-41e5-ab22-a4e811b9d4eb rw rootflags=prjquota boot=UUID=97cd8797-d71c-4b73-a9f3-5b8f41e0a473 I1122 09:53:32.780349 1 node.go:46] Environment information: kernel version 4.18.0-425.3.1.rt7.213.el8.x86_64 I1122 09:53:32.780375 1 main.go:73] running oslat command with arguments [--duration 100 --rtprio 1 --cpu-list 3-4,43-44 --cpu-main-thread 2] 2 is the main CPU to run the test, and its sibling is 42, --cpu-list contains 4 cpus, in total that makes it 6 cpus.
OCP is no longer using Bugzilla and this bug appears to have been left in an orphaned state. If the bug is still relevant, please open a new issue in the OCPBUGS Jira project: https://issues.redhat.com/projects/OCPBUGS/summary