Bug 2051443
| Summary: | OSLAT runner uses both sibling threads causing latency spikes | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Martin Sivák <msivak> |
| Component: | CNF Platform Validation | Assignee: | Talor Itzhak <titzhak> |
| Status: | CLOSED WONTFIX | QA Contact: | Dwaine Gonyier <dgonyier> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.9 | CC: | aos-bugs, dgonyier, kquinn, shajmakh, stevsmit, titzhak |
| Target Milestone: | --- | ||
| Target Release: | 4.12.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
* Previously, the `oslat` runner configured `oslat` to use all available CPUs, which caused false spikes. With this update, the `oslat` runner reserves one CPU for the control thread. As a result, false spikes no longer occur. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2051443[*BZ#2051443*])
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2024-04-30 18:04:53 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 2055267 | ||
|
Description
Martin Sivák
2022-02-07 09:21:57 UTC
After further discussion, we should start by filtering out the sibling of the control thread. Verification:
cnf-tests: registry-proxy.engineering.redhat.com/rh-osbs/openshift4-cnf-tests:v4.12.0-60
The machine has HT enabled, as can be seen below that CPU 0 has a sibling 40:
sh-4.4# cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
0,40
also from lscpu:
sh-4.4# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
...
CPU(s)= Socket(s)*Core(s) per socket
configure the PP as follows:
```
spec:
cpu:
isolated: "2-39,42-80"
reserved: "0,1,40,41"
realTimeKernel:
enabled: true
nodeSelector:
node-role.kubernetes.io/worker: ""
```
Run oslat as follows requesting 6 cpus:
[root@registry ~]# podman run --net=host -v /home/kni/clusterconfigs/auth:/kc:z -e KUBECONFIG=/kc/kubeconfig -e IMAGE_REGISTRY=registry.hlxcl6.lab.eng.tlv2.redhat.com:5000/ -e CNF_TESTS_IMAGE=openshift4-cnf-tests:v4.12.0-60 -e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e ROLE_WORKER_CNF=worker -e LATENCY_TEST_RUNTIME=100 -e LATENCY_TEST_CPUS=6 -e MAXIMUM_LATENCY=10000000 registry.hlxcl6.lab.eng.tlv2.redhat.com:5000/openshift4-cnf-tests:v4.12.0-60 /usr/bin/test-run.sh -ginkgo.focus="oslat"
observe the pod logs:
[root@registry ~]# oc logs oslat-rdr77
I1122 09:53:32.779769 1 main.go:35] oslat main thread cpu: %d2
I1122 09:53:32.780267 1 main.go:41] oslat main thread's cpu siblings: %v[2 42]
I1122 09:53:32.780331 1 node.go:39] Environment information: /proc/cmdline: BOOT_IMAGE=(hd1,gpt3)/ostree/rhcos-b594aea28251da3b472da2adba0a57d5fcf82c28c87897a88eb26e6db542b18b/vmlinuz-4.18.0-425.3.1.rt7.213.el8.x86_64 ignition.platform.id=metal ostree=/ostree/boot.0/rhcos/b594aea28251da3b472da2adba0a57d5fcf82c28c87897a88eb26e6db542b18b/0 ip=dhcp root=UUID=10f5e35f-f897-41e5-ab22-a4e811b9d4eb rw rootflags=prjquota boot=UUID=97cd8797-d71c-4b73-a9f3-5b8f41e0a473
I1122 09:53:32.780349 1 node.go:46] Environment information: kernel version 4.18.0-425.3.1.rt7.213.el8.x86_64
I1122 09:53:32.780375 1 main.go:73] running oslat command with arguments [--duration 100 --rtprio 1 --cpu-list 3-4,43-44 --cpu-main-thread 2]
2 is the main CPU to run the test, and its sibling is 42,
--cpu-list contains 4 cpus, in total that makes it 6 cpus.
OCP is no longer using Bugzilla and this bug appears to have been left in an orphaned state. If the bug is still relevant, please open a new issue in the OCPBUGS Jira project: https://issues.redhat.com/projects/OCPBUGS/summary |