Bug 1775834
Summary: | High cyclictest results for RHCOS on OCP | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Andrew Theurer <atheurer> | ||||
Component: | kernel-rt | Assignee: | Marcelo Tosatti <mtosatti> | ||||
kernel-rt sub component: | Core-Kernel | QA Contact: | Kernel Realtime QE <rt-qe> | ||||
Status: | CLOSED NOTABUG | Docs Contact: | |||||
Severity: | unspecified | ||||||
Priority: | unspecified | CC: | augol, bhu, dblack, dshaks, fsimonce, jianzzha, lcapitulino, lgoncalv, mtosatti, peterx, pmyers, qzhao, williams, yquinn | ||||
Version: | 8.1 | Flags: | jianzzha:
needinfo-
|
||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-01-21 13:58:04 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1771572, 1803965 | ||||||
Attachments: |
|
Description
Andrew Theurer
2019-11-22 22:23:43 UTC
Hi Andrew, The process which seems to be running, and should not, is ksoftirqd. Should not because: echo "1" > /sys/kernel/ktimer_lockless_check So if its running, there must be a timer on that cpu. Please add "timer_expire_entry timer_expire_exit" to the echo "smp_apic_timer_interrupt mod_timer mod_timer_pending add_timer_on add_timer" > set_ftrace_filter line. Another thing is please run cyclictest with -b X --tracemark, where X is a number of microseconds to fail, and --tracemark enables a trace entry to be written. So that if cyclictest takes more than X microseconds, it stops after writing a trace entry. This way its easier to pinpoint where the problem is. You want to set X lower than the maximum you are getting. For example if your maximum is 35us, you want -b 30us to catch the lowest hanging fruit. Also, please switch from echo sched_switch > set_event to echo "sys_enter_nanosleep sys_exit_nanosleep sched_switch" > set_event This will give us additional information. Thanks. (In reply to Marcelo Tosatti from comment #2) > Hi Andrew, > > The process which seems to be running, and should not, is ksoftirqd. > > Should not because: > > echo "1" > /sys/kernel/ktimer_lockless_check > > So if its running, there must be a timer on that cpu. > > Please add "timer_expire_entry timer_expire_exit" to the > > echo "smp_apic_timer_interrupt mod_timer mod_timer_pending add_timer_on > add_timer" > set_ftrace_filter > > line. > > Another thing is please run cyclictest with -b X --tracemark, where X is a > number of microseconds to fail, and --tracemark enables a trace entry to be > written. > > So that if cyclictest takes more than X microseconds, it stops after writing > a trace entry. > > This way its easier to pinpoint where the problem is. > > You want to set X lower than the maximum you are getting. For example if > your maximum is 35us, you want -b 30us to catch the lowest hanging fruit. > > Also, please switch from > > echo sched_switch > set_event > > to > > echo "sys_enter_nanosleep sys_exit_nanosleep sched_switch" > set_event > > This will give us additional information. > > Thanks. (all the changes above are still valid, please perform them). Forgot to mention that ktimer_lockless_check=1 works only with isolated (as in isolcpus= kernel command line), cpus. We can change that easily to include a runtime switch, though. Can you please confirm, just for testing sake, whether isolcpus= (and ktimer_lockless_check=1) allows ksoftirqd not to wakeup? Please grab a new trace with this and the above changes. Nothing in the traces could explain 40us for cyclictest on baremetal (apparently there are no old-style add_timer timers running, otherwise the add_timer __mod_timer function tracepoints would have caught them). There are no tasks other than ktimersoftd and cyclictest running. That said, has hwlatdetect been executed on this machine? (In reply to Marcelo Tosatti from comment #4) > Nothing in the traces could explain 40us for cyclictest on baremetal > (apparently there are no old-style add_timer timers running, otherwise > the add_timer __mod_timer function tracepoints would have caught them). > There are no tasks other than ktimersoftd and cyclictest running. > > That said, has hwlatdetect been executed on this machine? Not yet, but I agree we should do this next. On machine with k8s and rhel8, which has better cyclictest result, [root@cyclictest /]# hwlatdetect --threshold=1 hwlatdetect: test duration 120 seconds detector: tracer parameters: Latency threshold: 1us Sample window: 1000000us Sample width: 500000us Non-sampling period: 500000us Output File: None Starting test test finished Max Latency: Below threshold Samples recorded: 0 Samples exceeding threshold: 0 on machine with ocp and rhcos, which has worse cyclictest result, [root@worker0 yum.repos.d]# hwlatdetect --threshold 1 hwlatdetect: test duration 120 seconds detector: tracer parameters: Latency threshold: 1us Sample window: 1000000us Sample width: 500000us Non-sampling period: 500000us Output File: None Starting test test finished Max Latency: 16us Samples recorded: 9 Samples exceeding threshold: 9 ts: 1575551831.138503004, inner:16, outer:16 ts: 1575551833.150908799, inner:0, outer:7 ts: 1575551837.182918603, inner:0, outer:15 ts: 1575551840.206894031, inner:6, outer:0 ts: 1575551856.342915291, inner:13, outer:0 ts: 1575551899.686891703, inner:7, outer:0 ts: 1575551915.814925120, inner:11, outer:0 ts: 1575551941.014890798, inner:8, outer:0 ts: 1575551944.038903896, inner:7, outer:0 is there any tuning required at all to run the hwlatdetect? (In reply to jianzzha from comment #6) > On machine with k8s and rhel8, which has better cyclictest result, > [root@cyclictest /]# hwlatdetect --threshold=1 > hwlatdetect: test duration 120 seconds > detector: tracer > parameters: > Latency threshold: 1us > Sample window: 1000000us > Sample width: 500000us > Non-sampling period: 500000us > Output File: None > > Starting test > test finished > Max Latency: Below threshold > Samples recorded: 0 > Samples exceeding threshold: 0 That is fine. What is the cyclictest result from within the container in this case? > on machine with ocp and rhcos, which has worse cyclictest result, > [root@worker0 yum.repos.d]# hwlatdetect --threshold 1 > hwlatdetect: test duration 120 seconds > detector: tracer > parameters: > Latency threshold: 1us > Sample window: 1000000us > Sample width: 500000us > Non-sampling period: 500000us > Output File: None > > Starting test > test finished > Max Latency: 16us > Samples recorded: 9 > Samples exceeding threshold: 9 > ts: 1575551831.138503004, inner:16, outer:16 > ts: 1575551833.150908799, inner:0, outer:7 > ts: 1575551837.182918603, inner:0, outer:15 > ts: 1575551840.206894031, inner:6, outer:0 > ts: 1575551856.342915291, inner:13, outer:0 > ts: 1575551899.686891703, inner:7, outer:0 > ts: 1575551915.814925120, inner:11, outer:0 > ts: 1575551941.014890798, inner:8, outer:0 > ts: 1575551944.038903896, inner:7, outer:0 > > is there any tuning required at all to run the hwlatdetect? Yes, you should consult the manufacturer manual for details. For example, for Dell: https://www.dell.com/downloads/global/products/pedge/en/configuring_dell_powerEdge_servers_for_low_latency_12132010_final.pdf For HP servers: https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c01804533&lang=en-us&cc=us I didn't find diff between the good rhel8 machine vs the bad coreos machine, so I decided to install coreos on the good machine. Once coreos is installed, the hwlatdetect does bad. Now this is the coreos result from the original "good" (with rhel) machine # hwlatdetect --thresh 1 hwlatdetect: test duration 120 seconds detector: tracer parameters: Latency threshold: 1us Sample window: 1000000us Sample width: 500000us Non-sampling period: 500000us Output File: None Starting test test finished Max Latency: 12us Samples recorded: 9 Samples exceeding threshold: 9 ts: 1575640721.282947371, inner:6, outer:0 ts: 1575640730.353945716, inner:0, outer:5 ts: 1575640744.465945319, inner:0, outer:5 ts: 1575640757.570946521, inner:12, outer:0 ts: 1575640758.578946264, inner:0, outer:5 ts: 1575640759.585947521, inner:6, outer:0 ts: 1575640777.729944667, inner:0, outer:6 ts: 1575640780.754944737, inner:8, outer:0 ts: 1575640793.857946790, inner:0, outer:8 Should hwlatdetect have anything to do with the kernel? (In reply to jianzzha from comment #8) > I didn't find diff between the good rhel8 machine vs the bad coreos machine, > so I decided to install coreos on the good machine. Once coreos is > installed, the hwlatdetect does bad. Now this is the coreos result from the > original "good" (with rhel) machine > > # hwlatdetect --thresh 1 > hwlatdetect: test duration 120 seconds > detector: tracer > parameters: > Latency threshold: 1us > Sample window: 1000000us > Sample width: 500000us > Non-sampling period: 500000us > Output File: None > > Starting test > test finished > Max Latency: 12us > Samples recorded: 9 > Samples exceeding threshold: 9 > ts: 1575640721.282947371, inner:6, outer:0 > ts: 1575640730.353945716, inner:0, outer:5 > ts: 1575640744.465945319, inner:0, outer:5 > ts: 1575640757.570946521, inner:12, outer:0 > ts: 1575640758.578946264, inner:0, outer:5 > ts: 1575640759.585947521, inner:6, outer:0 > ts: 1575640777.729944667, inner:0, outer:6 > ts: 1575640780.754944737, inner:8, outer:0 > ts: 1575640793.857946790, inner:0, outer:8 > > Should hwlatdetect have anything to do with the kernel? What is the kernel on coreos ? It could be that depending on hardware configuration performed by (different) kernels, SMIs are enabled or disabled. Clark and Luis know better, though. Did you follow the recommendations of the manufacturer ? (which machine is this again). Jianzhu, can you provide a dump of the BIOS settings? Was hwlatdetect run without a container on RHEL but in a container with RHCOS? I remember seeing some hwlatdetect latencies a while back that were caused by c-state transitions. We try to mitigate that by opening /dev/cpu_dma_latency and holding it open for the duration of the hwlatdetect run, which should prevent c-state transitions, but if that's not working properly (i.e. if RHCOS doesn't provide /dev/cpu_dma_latency?) then the spike could be that the measurement core went idle, transitioned to a higher c-state and then took time to come back out of the c-state. Created attachment 1642666 [details]
dell r740xr bios setting
@Clark the /dev/cpu_dma_latency is a good point, is it also required for cyclictest correct? (In reply to jianzzha from comment #13) > @Clark the /dev/cpu_dma_latency is a good point, is it also required for > cyclictest correct? Yes, cyclictest opens it. You can confirm whether the machine is entering deep C-states with the command: # cpupower idle-info (In reply to Marcelo Tosatti from comment #14) > (In reply to jianzzha from comment #13) > > @Clark the /dev/cpu_dma_latency is a good point, is it also required for > > cyclictest correct? > > Yes, cyclictest opens it. > > You can confirm whether the machine is entering deep C-states with the > command: > > # cpupower idle-info But deep sleep states can't be the case since the kernel command line contains: intel_idle.max_cstate=1 correct? Clark, Marcelo, does it have to have a rt-kernel to run hwlatdetect properly? I don't know if Jianzhu has all of the tuning on this particular host where we went from RHEL to RHCOS, so we should verify what they are (and what the kernel is). Also, I thought hwlatdetect ran on a cpu 100% of the time with no idle until it moved to another cpu(?) If so, it should not matter what the C-state is. Things I saw in BIOS tht make me believe it is configured OK: EnergyPerformanceBias Attribute Value: MaxPower Attribute Name: ProcTurboMode Attribute Value: Disabled Attribute Name: ProcCStates Attribute Value: Disabled Attribute Name: NodeInterleave Attribute Value: Disabled Attribute Name: MemPatrolScrub Attribute Value: Disabled Attribute Name: OsWatchdogTimer Attribute Value: Disabled Attribute Name: EnergyPerformanceBias Attribute Value: MaxPower Attribute Name: ProcC1E Attribute Value: Disabled Attribute Name: CpuInterconnectBusSpeed Attribute Value: MaxDataRate Attribute Name: SysProfile Attribute Value: Custom Attribute Name: ProcPwrPerf Attribute Value: MaxPerf Attribute Name: ControlledTurbo Attribute Value: Disabled I changed one server back to rhel8 to easy tuning and testing. Here is what I found so far: 1) rt-kernel does matter, hwlatdetect with rt-kernel has better result than non-rt kernel 2) tuned profile matters too. Using cpu-partitioning profile and run the hwlatdetect in a container with isolated cpu, this will give us the best result. So I guess the hwlatdetect result will be impacted by the software. This makes me wonder the value of hwlatdetect - if it is impacted by the software. (In reply to jianzzha from comment #18) > I changed one server back to rhel8 to easy tuning and testing. Here is what > I found so far: > 1) rt-kernel does matter, hwlatdetect with rt-kernel has better result than > non-rt kernel > 2) tuned profile matters too. Using cpu-partitioning profile and run the > hwlatdetect in a container with isolated cpu, this will give us the best > result. > > So I guess the hwlatdetect result will be impacted by the software. > > This makes me wonder the value of hwlatdetect - if it is impacted by the > software. Jianzzha, https://www.kernel.org/doc/html/latest/trace/hwlat_detector.html Software might modify behaviour of whether BIOS triggers or not SMIs. What cyclictest numbers you get, with kernel-rt ? (In reply to Andrew Theurer from comment #16) > Clark, Marcelo, does it have to have a rt-kernel to run hwlatdetect properly? > > I don't know if Jianzhu has all of the tuning on this particular host where > we went from RHEL to RHCOS, so we should verify what they are (and what the > kernel is). > > Also, I thought hwlatdetect ran on a cpu 100% of the time with no idle until > it moved to another cpu(?) If so, it should not matter what the C-state is. It shouldn't matter which kernel is hosting a hwlatdetect run. The thread runs at FIFO:99 with interrupts off for it's specified window, so theoretically the only way the cpu can be interrupted is either an SMI or an NMI. OK, so either we have a problem with hwlatdetect and it not getting the same output with different kernels and/or tuning, or we have inconsistency in the test, perhaps because of the test length. Jianzhu, can you run on both RHEL and RHCOS builds, on the exact same system (not even same model but different systems, but the exact same host), but this time run hwlatdetect for 1 hour each? Hopefully that will eliminate any possibility of inconsistent test conditions (if there are SMIs, our confidence is high that it will always happen within an hour). SMI counter does not increase on all processors. Do we know how reliable this counter is? The Intel documentation is short: "Counter of SMI events" on the MSR section of the 4 volumes, and nothing about it on the SMI section of the volumes. (In reply to Marcelo Tosatti from comment #22) > SMI counter does not increase on all processors. > > Do we know how reliable this counter is? > > The Intel documentation is short: "Counter of SMI events" > on the MSR section of the 4 volumes, and nothing about > it on the SMI section of the volumes. As reliable as any hardware we work with :) It's pretty straightforward since SMIs go through a single handler and are just the mechanism to transition to SMM (System Management Mode) and run system firmware (i.e. BIOS code). I'd say we *don't* have an SMI problem and that means we should be able to see something in an ftrace traceback. Have we run cyclcitest with the --breaktrace option? The idea is to launch cyclictest from trace-cmd and pass --breaktrace=<usec> to it, causing it to write a trace marker to the ftrace buffer and stop active tracing. When it stops we use trace-cmd to create a trace.dat file and then we start digging for the tracemark and work back from that. (In reply to Clark Williams from comment #23) > (In reply to Marcelo Tosatti from comment #22) > > SMI counter does not increase on all processors. > > > > Do we know how reliable this counter is? > > > > The Intel documentation is short: "Counter of SMI events" > > on the MSR section of the 4 volumes, and nothing about > > it on the SMI section of the volumes. > > As reliable as any hardware we work with :) > > It's pretty straightforward since SMIs go through a single handler and > are just the mechanism to transition to SMM (System Management Mode) and > run system firmware (i.e. BIOS code). > > I'd say we *don't* have an SMI problem and that means we should be able > to see something in an ftrace traceback. Have we run cyclcitest with > the --breaktrace option? The idea is to launch cyclictest from trace-cmd > and pass --breaktrace=<usec> to it, causing it to write a trace marker > to the ftrace buffer and stop active tracing. When it stops we use > trace-cmd to create a trace.dat file and then we start digging for > the tracemark and work back from that. Suspect Non Maskable Interrupts from the hardlockup detector still execute when disabling interrupts. Jianzhu will add "nosoftlockup nowatchdog" to confirm the hypothesis. (In reply to Marcelo Tosatti from comment #24) > (In reply to Clark Williams from comment #23) > > (In reply to Marcelo Tosatti from comment #22) > > > SMI counter does not increase on all processors. > > > > > > Do we know how reliable this counter is? > > > > > > The Intel documentation is short: "Counter of SMI events" > > > on the MSR section of the 4 volumes, and nothing about > > > it on the SMI section of the volumes. > > > > As reliable as any hardware we work with :) > > > > It's pretty straightforward since SMIs go through a single handler and > > are just the mechanism to transition to SMM (System Management Mode) and > > run system firmware (i.e. BIOS code). > > > > I'd say we *don't* have an SMI problem and that means we should be able > > to see something in an ftrace traceback. Have we run cyclcitest with > > the --breaktrace option? The idea is to launch cyclictest from trace-cmd > > and pass --breaktrace=<usec> to it, causing it to write a trace marker > > to the ftrace buffer and stop active tracing. When it stops we use > > trace-cmd to create a trace.dat file and then we start digging for > > the tracemark and work back from that. > > Suspect Non Maskable Interrupts from the hardlockup detector still execute > when disabling interrupts. > > Jianzhu will add "nosoftlockup nowatchdog" to confirm the hypothesis. Jianzhu confirmed this was the problem. He will perform 24h tests to confirm cyclictest values. hwlatdetect reports no hits with threshold=1us. with nosoftlockup on the coreos kargs, the hwlatdetect looks good. [root@worker1 core]# hwlatdetect --duration 1d --threshold 1 hwlatdetect: test duration 86400 seconds detector: tracer parameters: Latency threshold: 1us Sample window: 1000000us Sample width: 500000us Non-sampling period: 500000us Output File: None Starting test test finished Max Latency: Below threshold Samples recorded: 0 Samples exceeding threshold: 0 SMIs during run: 0 nosoftlockup was missing and was the cause of the bad hwlatdetect number on the fresh coreos install. We now need to continue to investigate the bad cyclictest result on a tuned coreos (with all the tuning, including nosoftlockup). Very well done guys! I think we've solved two problems: 1. You found the reason for why hwlatdetect requires tuning 2. You confirmed we don't have SMIs in that machine Jianzhu, What's the current cyclictest result? Also, how are you applying the tuning? Since I guess you're not able to run tuned in CoreOS, is that correct? Very nice on the nosoftlockup. Hopefully we can add that to hwlatdetect documentation, and have it output a warning if that is not present. However, it does look like we were using nosoftlockup in the tuning for cyclictest. But just to be certain, we should run cyclictest again with this tuning. If we still get high max latencies, then I suspect we will need to increase the things we trace for, or perhaps trace all kernel function entry/exit. for the sake of simplicity, we used isolcpu for coreos and run cyclictest on baremetal (not from container), and the cyclictest is run on the isolcpu, result still no good - very much similar to what andrew got. further tracing is needed. @lcapitulino, we need to get this ball rolling? What info is needed from our end? (In reply to Luiz Capitulino from comment #27) > Very well done guys! I think we've solved two problems: > > 1. You found the reason for why hwlatdetect requires tuning > 2. You confirmed we don't have SMIs in that machine > > Jianzhu, > > What's the current cyclictest result? Also, how are you applying the tuning? > Since I guess you're not able to run tuned in CoreOS, is that correct? it is 30+ us on baremetal coreos (no container). isolcpus are used and cyclictest is running on the isolated cores (without other user processes sharing the same cpu). (In reply to jianzzha from comment #30) > @lcapitulino, we need to get this ball rolling? What info is needed from our > end? I thought you and Marcelo were still debugging this issue. Are you, Marcelo? (In reply to Luiz Capitulino from comment #32) > (In reply to jianzzha from comment #30) > > @lcapitulino, we need to get this ball rolling? What info is needed from our > > end? > > I thought you and Marcelo were still debugging this issue. Are you, Marcelo? Yes. However: 1) Its not clear what the interface, for the k8s user, should be (in the container description). IMHO should not hurry to merge an interface which has not been validated to work with the use cases in question (its not clear what the use cases are in the first place, from the customer interested in using this). 2) There are discussions in redhat and k8s upstream about this interfaces. 3) The worker node is down so will continue debugging with podman. (In reply to Marcelo Tosatti from comment #33) > 2) There are discussions in redhat and k8s upstream about this interfaces. > > 3) The worker node is down so will continue debugging with podman. So for example, for O-RAN (4G/5G L1 processing including HARQ ACK processing) very aggressive options are used: https://docs.o-ran-sc.org/projects/o-ran-sc-pti-rtp/en/latest/installation-guide.html#deploy-cmk-cpu-manager-for-kubernetes # In this example, 1-16 cores are isolated for real time processes root@intel-x86-64:~# rt_tuning="crashkernel=auto biosdevname=0 iommu=pt usbcore.autosuspend=-1 nmi_watchdog=0 softlockup_panic=0 intel_iommu=on cgroup_enable=memory skew_tick=1 hugepagesz=1G hugepages=4 default_hugepagesz=1G isolcpus=1-16 rcu_nocbs=1-16 kthread_cpus=0 irqaffinity=0 nohz=on nohz_full=1-16 intel_idle.max_cstate=0 processor.max_cstate=1 intel_pstate=disable nosoftlockup idle=poll mce=ignore_ce" Some examples of interface requirements: -> Requirement from user application: https://www.youtube.com/watch?v=7DuKzaBaRwY minute 13:00 and forward. -> https://www.youtube.com/watch?v=QbfjlNfLGxM The requirements should probably be specified at "Call to NFV orchestration" level (minute 9:50). this BZ doesn't appears to have much to do with k8s or even containers, as we have observed the similar bad cyclictest results, even with isolcpu in place and cyclictest is running on the isolcpu. Can we solve the problem on the baremetal machine first? (In reply to jianzzha from comment #35) > this BZ doesn't appears to have much to do with k8s or even containers, as > we have observed the similar bad cyclictest results, even with isolcpu in > place and cyclictest is running on the isolcpu. Can we solve the problem on > the baremetal machine first? Hi Jianzhu, So the problem is, in production (which is where this interface is supposed to be used), different latency requirements will exist. In the initial request from Andrew, he mentions: " Desired result: Max latencies in the low-20 usecs " But given the different usecases (building a list of them in KVM-RT's wiki, will post it to kvm-rt@ as soon as its finished), and the fact that automation is desired by customers, i would rather build a whole picture. But OK, this work can be done elsewhere, if that is what is preferred. Can you please bring the worker node up so i can check what timer is it so the low 20 usecs latency can be achieved? > Can you please bring the worker node up so i can check what timer is it so
> the low 20 usecs latency can be achieved?
[root@e25-h23-740xd ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master0 Ready master 18d v1.16.2
master1 Ready master 18d v1.16.2
master2 Ready master 18d v1.16.2
worker1 NotReady worker,worker-rt 18d v1.16.2
Using SCHED_FIFO priority 99, with -p99, should fix the timer problem. Jianzhu, can you please check the latency values with that option? TIA (In reply to Marcelo Tosatti from comment #38) > Using SCHED_FIFO priority 99, with -p99, should fix the timer problem. > > Jianzhu, can you please check the latency values with that option? > > TIA we have been using -p 99 in the cyclictest, This is how we test it for 5 min on coreos baremetal (no container involved): cyclictest -D 5m -p 99 -t 4 -a 10,4,6,8 -h 30 -m # /dev/cpu_dma_latency set to 0us policy: fifo: loadavg: 2.61 2.02 2.48 1/1979 3296755 T: 0 (3295845) P:99 I:1000 C: 299996 Min: 2 Act: 2 Avg: 3 Max: 26 T: 1 (3295846) P:99 I:1000 C: 299996 Min: 2 Act: 3 Avg: 2 Max: 49 T: 2 (3295847) P:99 I:1000 C: 299995 Min: 2 Act: 3 Avg: 3 Max: 32 T: 3 (3295848) P:99 I:1000 C: 299995 Min: 2 Act: 3 Avg: 3 Max: 31 # Histogram 000000 000000 000000 000000 000000 000001 000000 000000 000000 000000 000002 005034 008376 000142 003936 000003 292207 288920 296385 293464 000004 002080 002000 002447 001802 000005 000243 000228 000345 000322 000006 000085 000093 000131 000113 000007 000067 000068 000101 000056 000008 000037 000051 000081 000039 000009 000029 000028 000073 000029 000010 000026 000022 000050 000030 000011 000013 000026 000036 000018 000012 000022 000023 000022 000023 000013 000018 000014 000017 000015 000014 000023 000024 000023 000013 000015 000009 000010 000022 000009 000016 000015 000012 000014 000013 000017 000013 000014 000015 000008 000018 000011 000019 000016 000000 000019 000010 000018 000018 000009 000020 000015 000010 000012 000022 000021 000012 000015 000019 000045 000022 000015 000016 000015 000006 000023 000010 000008 000011 000017 000024 000004 000003 000002 000009 000025 000001 000000 000001 000000 000026 000001 000000 000001 000000 000027 000000 000001 000000 000000 000028 000000 000000 000000 000000 000029 000000 000000 000000 000000 # Total: 000300000 000299999 000299999 000299998 # Min Latencies: 00002 00002 00002 00002 # Avg Latencies: 00003 00002 00003 00003 # Max Latencies: 00026 00049 00032 00031 # Histogram Overflows: 00000 00001 00001 00001 # Histogram Overflow at cycle number: # Thread 0: # Thread 1: 194177 # Thread 2: 194177 # Thread 3: 194205 there is no other work load on these processor. isolate cpu is used: # cat /proc/cmdline BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-464d92a710185aa15005e946175a6900b0bd16bbefc60c9e71e19cc5f513b675/vmlinuz-4.18.0-147.8.rt13.1.el8locklesspendingtimermap4.x86_64 root=/dev/disk/by-label/root rootflags=defaults,prjquota rw console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal ostree=/ostree/boot.1/rhcos/464d92a710185aa15005e946175a6900b0bd16bbefc60c9e71e19cc5f513b675/0 intel_iommu=on iommu=pt default_hugepagesz=1GB hugepagesz=1G hugepages=32 intel_pstate=disable skew_tick=1 nohz=on nohz_full=1,3-31 rcu_nocbs=1,3-31 non_iso_cpumask=00000005 nosoftlockup intel_idle.max_cstate=1 transparent_hugepage=never numa_balancing=disable mitigations=off nosmt tsc=nowatchdog nowatchdog isolcpus=1,3-31 Marcelo, this cores worker node is back online and feel free to continue the debug. the access info is the same as before. a quick 5 min run from inside OCP container: # Min Latencies: 00003 00003 00003 00003 # Avg Latencies: 00003 00003 00003 00003 # Max Latencies: 00008 00007 00006 00007 # cat /proc/cmdline BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-c40bc0eface9378448e03d886391fcf4087a966617700dc3efe2653b14f4e6cc/vmlinuz-4.18.0-171.rt13.28.el8nolpoll.x86_64 root=/dev/disk/by-label/root rootflags=defaults,prjquota rw console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal ostree=/ostree/boot.1/rhcos/c40bc0eface9378448e03d886391fcf4087a966617700dc3efe2653b14f4e6cc/0 nosmt intel_iommu=on iommu=pt default_hugepagesz=1GB hugepagesz=1G hugepages=32 intel_pstate=disable skew_tick=1 nohz=on nohz_full=4-23 rcu_nocbs=4-23 nosoftlockup nosoftlockup isolcpus=4-23 tsc=nowatchdog |