Bug 1775834

Summary:

High cyclictest results for RHCOS on OCP

Product:

Red Hat Enterprise Linux 8

Reporter:

Andrew Theurer <atheurer>

Component:

kernel-rt

Assignee:

Marcelo Tosatti <mtosatti>

kernel-rt sub component:

Core-Kernel

QA Contact:

Kernel Realtime QE <rt-qe>

Status:

CLOSED NOTABUG

Docs Contact:

Severity:

unspecified

Priority:

unspecified

CC:

augol, bhu, dblack, dshaks, fsimonce, jianzzha, lcapitulino, lgoncalv, mtosatti, peterx, pmyers, qzhao, williams, yquinn

Version:

8.1

Flags:

jianzzha: needinfo-

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-01-21 13:58:04 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1771572, 1803965

Attachments:

Description	Flags
dell r740xr bios setting	none

Description Andrew Theurer 2019-11-22 22:23:43 UTC

Description of problem:

This BZ is to track the performance of cyclictest on Openshift with a bare-metal worker node using RHCOS (RHEL CoreOS).  Below describes the tuning used, which is meant to replicate typical tuning for RT and cyclictest on RHEL, but by alternate means when the standard RHEL method is not possible (with BZs to address these gaps).



Version-Release number of selected component (if applicable):
OCP 4.4 nightly build:

NAME="Red Hat Enterprise Linux CoreOS"
VERSION="43.81.201911131833.0"
VERSION_ID="4.3"
OPENSHIFT_VERSION="4.3"
RHEL_VERSION=8.0
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 43.81.201911131833.0 (Ootpa)"
ID="rhcos"
ID_LIKE="rhel fedora"
ANSI_COLOR="0;31"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.3"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.3"
OSTREE_VERSION='43.81.201911131833.0'


How reproducible:

Easily with cyclictest 


Steps to Reproduce:
1. Ensure worker node used systemReserved: 1000m and KubeReserved: 1000m with static cpu-manager policy

2. Install the realtime kernel on the worker node and reboot:
rpm-ostree override remove kernel{,-core,-modules,-modules-extra} --install ./kernel-rt-core-4.18.0-148.rt13.5.el8.x86_64.rpm --install ./kernel-rt-modules-4.18.0-148.rt13.5.el8.x86_64.rpm --install ./kernel-rt-modules-extra-4.18.0-148.rt13.5.el8.x86_64.rpm


2.  Through machine config operator or rpm-ostree, add the following kernel params and reboot:
intel_iommu=on iommu=pt default_hugepagesz=1GB hugepagesz=1G hugepages=32 intel_pstate=disable skew_tick=1 nohz=on nohz_full=1,3-31 rcu_nocbs=1,3-31 non_iso_cpumask=00000005 nosoftlockup intel_idle.max_cstate=1 transparent_hugepage=never numa_balancing=disable mitigations=off nosmt tsc=nowatchdog

Note that this uses cpus 0 and 2 as "non_iso_cpumask" cpus, the cpus used for housekeeping aka system tasks.  This assumes kubelet also has determined that cpus 0 and 2 will *not* be used for static cpu pods.  These two must correlate.  See https://bugzilla.redhat.com/show_bug.cgi?id=1775826 for more details

3.  Add an initrd which has the following structure:
.
./etc
./etc/systemd
./etc/systemd/system.conf
./usr
./usr/lib
./usr/lib/dracut
./usr/lib/dracut/hooks
./usr/lib/dracut/hooks/pre-udev
./usr/lib/dracut/hooks/pre-udev/00-tuned-pre-udev.sh

with 00-tuned-pre-udev.sh containing:

#!/bin/sh

type getargs >/dev/null 2>&1 || . /lib/dracut-lib.sh

cpumask="$(getargs non_iso_cpumask)"


log()
{
  echo "tuned: $@" >> /dev/kmsg
}

if [ -n "$cpumask" ]; then
  for file in /sys/devices/virtual/workqueue/cpumask /sys/bus/workqueue/devices/writeback/cpumask; do
    log "setting $file CPU mask to $cpumask"
    if ! echo $cpumask > $file 2>/dev/null; then
      log "ERROR: could not write CPU mask for $file"
    fi
  done
fi
echo "1" >/sys/devices/system/machinecheck/machinecheck*/ignore_ce
echo "0" >/proc/sys/kernel/timer_migration
echo "-1" >/proc/sys/kernel/sched_rt_runtime_us
echo "10" >/proc/sys/vm/stat_interval
echo "600" >/proc/sys/kernel/hung_task_timeout_secs
echo "1" > /sys/kernel/ktimer_lockless_check

<eof>

and file ./etc/systemd/system.conf containing the following:

[Manager]
CPUAffinity=0 2

<eof>

copy this initrd to where your existing vmlinuz and initramfs are located, for example: /boot/ostree/rhcos-21519edaf99baa28a38f359996b59b7f0eed6f07f15dc720038d4ba84c274735/  <YMMV>

Update the boot loader entry file to also load the second initrd:
/boot/loader/entries/ostree-2-rhcos.conf

The last line which loads the initrd will look something like:

initrd /ostree/rhcos-21519edaf99baa28a38f359996b59b7f0eed6f07f15dc720038d4ba84c274735/initramfs-4.18.0-148.rt13.5.el8.x86_64.img /ostree/rhcos-21519edaf99baa28a38f359996b59b7f0eed6f07f15
dc720038d4ba84c274735/iso_initrd.img

The exact path is local to your installation and the name you are using for your 2nd initrd.  

The BZ to include a 2nd initrd for OCP/RHCOS is here: https://bugzilla.redhat.com/show_bug.cgi?id=1775447

4. Update /etc/sysconfig/irqbalance so only the housekeeping cpus are used:
include the line:  IRQBALANCE_BANNED_CPUS=FFFFFFFA
The BZ to accomplish this with node tuning operator is: https://bugzilla.redhat.com/show_bug.cgi?id=1775449

5. Reboot.

6. Move kernel rcuo* threads to the housekeeping cpus:
pgrep rcuo | while read line; do taskset -p 00000005 $line; done
A BZ to make this possible via node tuning operator is here: https://bugzilla.redhat.com/show_bug.cgi?id=1775450

7. Run a cyclitest container test, with the following spec:

apiVersion: v1
kind: Pod
metadata:
  name: cyclictest
spec:
  restartPolicy: Never
  containers:
  - name: cyclictest
    image:  docker.io/cscojianzhan/cyclictest
    resources:
      limits:
        cpu: 12
        memory: "400Mi"
      requests:
        cpu: 12
        memory: "400Mi"
    env:
    - name: DURATION
      value: "30m"
    securityContext:
      privileged: true
      #privileged: false
      capabilities:
        add:
          - SYS_NICE
          - SYS_RAWIO
          - IPC_LOCK
    volumeMounts:
    - mountPath: /tmp
      name: results-volume
    - mountPath: /dev/cpu_dma_latency
      name: cstate
  nodeSelector:
    node-role.kubernetes.io/worker-rt: ""
  volumes:
  - name: results-volume
    hostPath:
      path: /tmp
  - name: cstate
    hostPath:
      path: /dev/cpu_dma_latency
<eof>


Actual results:

Max latencies in mid-40 usecs


Expected results:

Max latencies in the low-20 usecs


Additional info:

Comment 2 Marcelo Tosatti 2019-11-25 13:34:03 UTC

Hi Andrew,

The process which seems to be running, and should not, is ksoftirqd.

Should not because:

echo "1" > /sys/kernel/ktimer_lockless_check

So if its running, there must be a timer on that cpu.

Please add "timer_expire_entry timer_expire_exit" to the 

echo "smp_apic_timer_interrupt mod_timer mod_timer_pending add_timer_on add_timer" > set_ftrace_filter

line.

Another thing is please run cyclictest with -b X --tracemark, where X is a number of microseconds to fail, and --tracemark enables a trace entry to be written.

So that if cyclictest takes more than X microseconds, it stops after writing a trace entry.

This way its easier to pinpoint where the problem is.

You want to set X lower than the maximum you are getting. For example if your maximum is 35us, you want -b 30us to catch the lowest hanging fruit.

Also, please switch from

echo sched_switch > set_event

to 

echo "sys_enter_nanosleep sys_exit_nanosleep sched_switch" > set_event

This will give us additional information.

Thanks.

Comment 3 Marcelo Tosatti 2019-11-25 13:58:46 UTC

(In reply to Marcelo Tosatti from comment #2)
> Hi Andrew,
> 
> The process which seems to be running, and should not, is ksoftirqd.
> 
> Should not because:
> 
> echo "1" > /sys/kernel/ktimer_lockless_check
> 
> So if its running, there must be a timer on that cpu.
> 
> Please add "timer_expire_entry timer_expire_exit" to the 
> 
> echo "smp_apic_timer_interrupt mod_timer mod_timer_pending add_timer_on
> add_timer" > set_ftrace_filter
> 
> line.
> 
> Another thing is please run cyclictest with -b X --tracemark, where X is a
> number of microseconds to fail, and --tracemark enables a trace entry to be
> written.
> 
> So that if cyclictest takes more than X microseconds, it stops after writing
> a trace entry.
> 
> This way its easier to pinpoint where the problem is.
> 
> You want to set X lower than the maximum you are getting. For example if
> your maximum is 35us, you want -b 30us to catch the lowest hanging fruit.
> 
> Also, please switch from
> 
> echo sched_switch > set_event
> 
> to 
> 
> echo "sys_enter_nanosleep sys_exit_nanosleep sched_switch" > set_event
> 
> This will give us additional information.
> 
> Thanks.

(all the changes above are still valid, please perform them).

Forgot to mention that ktimer_lockless_check=1 works only with isolated (as in isolcpus= kernel command line), cpus. 

We can change that easily to include a runtime switch, though.

Can you please confirm, just for testing sake, whether isolcpus= (and ktimer_lockless_check=1) allows ksoftirqd not to wakeup? Please grab a 
new trace with this and the above changes.

Comment 4 Marcelo Tosatti 2019-11-25 18:41:01 UTC

Nothing in the traces could explain 40us for cyclictest on baremetal
(apparently there are no old-style add_timer timers running, otherwise
the add_timer __mod_timer function tracepoints would have caught them).
There are no tasks other than ktimersoftd and cyclictest running.

That said, has hwlatdetect been executed on this machine?

Comment 5 Andrew Theurer 2019-12-02 13:41:36 UTC

(In reply to Marcelo Tosatti from comment #4)
> Nothing in the traces could explain 40us for cyclictest on baremetal
> (apparently there are no old-style add_timer timers running, otherwise
> the add_timer __mod_timer function tracepoints would have caught them).
> There are no tasks other than ktimersoftd and cyclictest running.
> 
> That said, has hwlatdetect been executed on this machine?

Not yet, but I agree we should do this next.

Comment 6 jianzzha 2019-12-05 17:31:59 UTC

On machine with k8s and rhel8, which has better cyclictest result,
[root@cyclictest /]# hwlatdetect --threshold=1
hwlatdetect:  test duration 120 seconds
   detector: tracer
   parameters:
        Latency threshold: 1us
        Sample window:     1000000us
        Sample width:      500000us
     Non-sampling period:  500000us
        Output File:       None

Starting test
test finished
Max Latency: Below threshold
Samples recorded: 0
Samples exceeding threshold: 0


on machine with ocp and rhcos, which has worse cyclictest result,
[root@worker0 yum.repos.d]# hwlatdetect --threshold 1
hwlatdetect:  test duration 120 seconds
   detector: tracer
   parameters:
        Latency threshold: 1us
        Sample window:     1000000us
        Sample width:      500000us
     Non-sampling period:  500000us
        Output File:       None

Starting test
test finished
Max Latency: 16us
Samples recorded: 9
Samples exceeding threshold: 9
ts: 1575551831.138503004, inner:16, outer:16
ts: 1575551833.150908799, inner:0, outer:7
ts: 1575551837.182918603, inner:0, outer:15
ts: 1575551840.206894031, inner:6, outer:0
ts: 1575551856.342915291, inner:13, outer:0
ts: 1575551899.686891703, inner:7, outer:0
ts: 1575551915.814925120, inner:11, outer:0
ts: 1575551941.014890798, inner:8, outer:0
ts: 1575551944.038903896, inner:7, outer:0

is there any tuning required at all to run the hwlatdetect?

Comment 7 Marcelo Tosatti 2019-12-05 19:27:55 UTC

(In reply to jianzzha from comment #6)
> On machine with k8s and rhel8, which has better cyclictest result,
> [root@cyclictest /]# hwlatdetect --threshold=1
> hwlatdetect:  test duration 120 seconds
>    detector: tracer
>    parameters:
>         Latency threshold: 1us
>         Sample window:     1000000us
>         Sample width:      500000us
>      Non-sampling period:  500000us
>         Output File:       None
> 
> Starting test
> test finished
> Max Latency: Below threshold
> Samples recorded: 0
> Samples exceeding threshold: 0

That is fine. What is the cyclictest result from within the container in this 
case?

 
> on machine with ocp and rhcos, which has worse cyclictest result,
> [root@worker0 yum.repos.d]# hwlatdetect --threshold 1
> hwlatdetect:  test duration 120 seconds
>    detector: tracer
>    parameters:
>         Latency threshold: 1us
>         Sample window:     1000000us
>         Sample width:      500000us
>      Non-sampling period:  500000us
>         Output File:       None
> 
> Starting test
> test finished
> Max Latency: 16us
> Samples recorded: 9
> Samples exceeding threshold: 9
> ts: 1575551831.138503004, inner:16, outer:16
> ts: 1575551833.150908799, inner:0, outer:7
> ts: 1575551837.182918603, inner:0, outer:15
> ts: 1575551840.206894031, inner:6, outer:0
> ts: 1575551856.342915291, inner:13, outer:0
> ts: 1575551899.686891703, inner:7, outer:0
> ts: 1575551915.814925120, inner:11, outer:0
> ts: 1575551941.014890798, inner:8, outer:0
> ts: 1575551944.038903896, inner:7, outer:0
> 
> is there any tuning required at all to run the hwlatdetect?

Yes, you should consult the manufacturer manual for details.

For example, for Dell: 

https://www.dell.com/downloads/global/products/pedge/en/configuring_dell_powerEdge_servers_for_low_latency_12132010_final.pdf

For HP servers:

https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c01804533&lang=en-us&cc=us

Comment 8 jianzzha 2019-12-06 14:08:36 UTC

I didn't find diff between the good rhel8 machine vs the bad coreos machine, so I decided to  install coreos on the good machine. Once coreos is installed, the hwlatdetect does bad. Now this is the coreos result from the original "good" (with rhel) machine

# hwlatdetect --thresh 1
hwlatdetect:  test duration 120 seconds
   detector: tracer
   parameters:
        Latency threshold: 1us
        Sample window:     1000000us
        Sample width:      500000us
     Non-sampling period:  500000us
        Output File:       None

Starting test
test finished
Max Latency: 12us
Samples recorded: 9
Samples exceeding threshold: 9
ts: 1575640721.282947371, inner:6, outer:0
ts: 1575640730.353945716, inner:0, outer:5
ts: 1575640744.465945319, inner:0, outer:5
ts: 1575640757.570946521, inner:12, outer:0
ts: 1575640758.578946264, inner:0, outer:5
ts: 1575640759.585947521, inner:6, outer:0
ts: 1575640777.729944667, inner:0, outer:6
ts: 1575640780.754944737, inner:8, outer:0
ts: 1575640793.857946790, inner:0, outer:8

Should hwlatdetect have anything to do with the kernel?

Comment 9 Marcelo Tosatti 2019-12-06 15:01:24 UTC

(In reply to jianzzha from comment #8)
> I didn't find diff between the good rhel8 machine vs the bad coreos machine,
> so I decided to  install coreos on the good machine. Once coreos is
> installed, the hwlatdetect does bad. Now this is the coreos result from the
> original "good" (with rhel) machine
> 
> # hwlatdetect --thresh 1
> hwlatdetect:  test duration 120 seconds
>    detector: tracer
>    parameters:
>         Latency threshold: 1us
>         Sample window:     1000000us
>         Sample width:      500000us
>      Non-sampling period:  500000us
>         Output File:       None
> 
> Starting test
> test finished
> Max Latency: 12us
> Samples recorded: 9
> Samples exceeding threshold: 9
> ts: 1575640721.282947371, inner:6, outer:0
> ts: 1575640730.353945716, inner:0, outer:5
> ts: 1575640744.465945319, inner:0, outer:5
> ts: 1575640757.570946521, inner:12, outer:0
> ts: 1575640758.578946264, inner:0, outer:5
> ts: 1575640759.585947521, inner:6, outer:0
> ts: 1575640777.729944667, inner:0, outer:6
> ts: 1575640780.754944737, inner:8, outer:0
> ts: 1575640793.857946790, inner:0, outer:8
> 
> Should hwlatdetect have anything to do with the kernel?

What is the kernel on coreos ? It could be that depending on hardware configuration
performed by (different) kernels, SMIs are enabled or disabled.
Clark and Luis know better, though.

Did you follow the recommendations of the manufacturer ? (which machine is this again).

Comment 10 Andrew Theurer 2019-12-06 15:16:41 UTC

Jianzhu, can you provide a dump of the BIOS settings?

Was hwlatdetect run without a container on RHEL but in a container with RHCOS?

Comment 11 Clark Williams 2019-12-06 15:23:18 UTC

I remember seeing some hwlatdetect latencies a while back that were caused by c-state transitions. We try to mitigate that by opening /dev/cpu_dma_latency and holding it open for the duration of the hwlatdetect run, which should prevent c-state transitions, but if that's not working properly (i.e. if RHCOS doesn't provide /dev/cpu_dma_latency?) then the spike could be that the measurement core went idle, transitioned to a higher c-state and then took time to come back out of the c-state.

Comment 12 jianzzha 2019-12-06 15:34:25 UTC

Created attachment 1642666 [details]
dell r740xr bios setting

Comment 13 jianzzha 2019-12-06 15:36:55 UTC

@Clark the /dev/cpu_dma_latency is a good point, is it also required for cyclictest correct?

Comment 14 Marcelo Tosatti 2019-12-06 16:43:33 UTC

(In reply to jianzzha from comment #13)
> @Clark the /dev/cpu_dma_latency is a good point, is it also required for
> cyclictest correct?

Yes, cyclictest opens it. 

You can confirm whether the machine is entering deep C-states with the command:

# cpupower idle-info

Comment 15 Marcelo Tosatti 2019-12-06 16:46:12 UTC

(In reply to Marcelo Tosatti from comment #14)
> (In reply to jianzzha from comment #13)
> > @Clark the /dev/cpu_dma_latency is a good point, is it also required for
> > cyclictest correct?
> 
> Yes, cyclictest opens it. 
> 
> You can confirm whether the machine is entering deep C-states with the
> command:
> 
> # cpupower idle-info

But deep sleep states can't be the case since 
the kernel command line contains:

intel_idle.max_cstate=1

correct?

Comment 16 Andrew Theurer 2019-12-06 23:30:18 UTC

Clark, Marcelo, does it have to have a rt-kernel to run hwlatdetect properly?

I don't know if Jianzhu has all of the tuning on this particular host where we went from RHEL to RHCOS, so we should verify what they are (and what the kernel is).

Also, I thought hwlatdetect ran on a cpu 100% of the time with no idle until it moved to another cpu(?)  If so, it should not matter what the C-state is.

Comment 17 Andrew Theurer 2019-12-06 23:39:53 UTC

Things I saw in BIOS tht make me believe it is configured OK:

EnergyPerformanceBias	Attribute Value: MaxPower
Attribute Name: ProcTurboMode	Attribute Value: Disabled
Attribute Name: ProcCStates	Attribute Value: Disabled
Attribute Name: NodeInterleave	Attribute Value: Disabled
Attribute Name: MemPatrolScrub	Attribute Value: Disabled
Attribute Name: OsWatchdogTimer	Attribute Value: Disabled
Attribute Name: EnergyPerformanceBias	Attribute Value: MaxPower
Attribute Name: ProcC1E	Attribute Value: Disabled
Attribute Name: CpuInterconnectBusSpeed	Attribute Value: MaxDataRate
Attribute Name: SysProfile	Attribute Value: Custom
Attribute Name: ProcPwrPerf	Attribute Value: MaxPerf
Attribute Name: ControlledTurbo	Attribute Value: Disabled

Comment 18 jianzzha 2019-12-08 22:39:13 UTC

I changed one server back to rhel8 to easy tuning and testing. Here is what I found so far:
1) rt-kernel does matter,  hwlatdetect with rt-kernel has better result than non-rt kernel
2) tuned profile matters too. Using cpu-partitioning profile and run the hwlatdetect in a container with isolated cpu, this will give us the best result.

So I guess the hwlatdetect result will be impacted by the software.

This makes me wonder the value of hwlatdetect - if it is impacted by the software.

Comment 19 Marcelo Tosatti 2019-12-09 13:34:05 UTC

(In reply to jianzzha from comment #18)
> I changed one server back to rhel8 to easy tuning and testing. Here is what
> I found so far:
> 1) rt-kernel does matter,  hwlatdetect with rt-kernel has better result than
> non-rt kernel
> 2) tuned profile matters too. Using cpu-partitioning profile and run the
> hwlatdetect in a container with isolated cpu, this will give us the best
> result.
> 
> So I guess the hwlatdetect result will be impacted by the software.
> 
> This makes me wonder the value of hwlatdetect - if it is impacted by the
> software.

Jianzzha,

https://www.kernel.org/doc/html/latest/trace/hwlat_detector.html

Software might modify behaviour of whether BIOS triggers or not
SMIs.

What cyclictest numbers you get, with kernel-rt ?

Comment 20 Clark Williams 2019-12-09 14:42:09 UTC

(In reply to Andrew Theurer from comment #16)
> Clark, Marcelo, does it have to have a rt-kernel to run hwlatdetect properly?
> 
> I don't know if Jianzhu has all of the tuning on this particular host where
> we went from RHEL to RHCOS, so we should verify what they are (and what the
> kernel is).
> 
> Also, I thought hwlatdetect ran on a cpu 100% of the time with no idle until
> it moved to another cpu(?)  If so, it should not matter what the C-state is.


It shouldn't matter which kernel is hosting a hwlatdetect run. The thread runs
at FIFO:99 with interrupts off for it's specified window, so theoretically the
only way the cpu can be interrupted is either an SMI or an NMI.

Comment 21 Andrew Theurer 2019-12-09 15:12:31 UTC

OK, so either we have a problem with hwlatdetect and it not getting the same output with different kernels and/or tuning, or we have inconsistency in the test, perhaps because of the test length.

Jianzhu, can you run on both RHEL and RHCOS builds, on the exact same system (not even same model but different systems, but the exact same host), but this time run hwlatdetect for 1 hour each?  Hopefully that will eliminate any possibility of inconsistent test conditions (if there are SMIs, our confidence is high that it will always happen within an hour).

Comment 22 Marcelo Tosatti 2019-12-10 15:01:09 UTC

SMI counter does not increase on all processors.

Do we know how reliable this counter is? 

The Intel documentation is short: "Counter of SMI events"
on the MSR section of the 4 volumes, and nothing about 
it on the SMI section of the volumes.

Comment 23 Clark Williams 2019-12-10 19:13:35 UTC

(In reply to Marcelo Tosatti from comment #22)
> SMI counter does not increase on all processors.
> 
> Do we know how reliable this counter is? 
> 
> The Intel documentation is short: "Counter of SMI events"
> on the MSR section of the 4 volumes, and nothing about 
> it on the SMI section of the volumes.

As reliable as any hardware we work with :)

It's pretty straightforward since SMIs go through a single handler and
are just the mechanism to transition to SMM (System Management Mode) and
run system firmware (i.e. BIOS code). 

I'd say we *don't* have an SMI problem and that means we should be able
to see something in an ftrace traceback. Have we run cyclcitest with 
the --breaktrace option? The idea is to launch cyclictest from trace-cmd
and pass --breaktrace=<usec> to it, causing it to write a trace marker
to the ftrace buffer and stop active tracing. When it stops we use 
trace-cmd to create a trace.dat file and then we start digging for
the tracemark and work back from that.

Comment 24 Marcelo Tosatti 2019-12-11 04:36:34 UTC

(In reply to Clark Williams from comment #23)
> (In reply to Marcelo Tosatti from comment #22)
> > SMI counter does not increase on all processors.
> > 
> > Do we know how reliable this counter is? 
> > 
> > The Intel documentation is short: "Counter of SMI events"
> > on the MSR section of the 4 volumes, and nothing about 
> > it on the SMI section of the volumes.
> 
> As reliable as any hardware we work with :)
> 
> It's pretty straightforward since SMIs go through a single handler and
> are just the mechanism to transition to SMM (System Management Mode) and
> run system firmware (i.e. BIOS code). 
> 
> I'd say we *don't* have an SMI problem and that means we should be able
> to see something in an ftrace traceback. Have we run cyclcitest with 
> the --breaktrace option? The idea is to launch cyclictest from trace-cmd
> and pass --breaktrace=<usec> to it, causing it to write a trace marker
> to the ftrace buffer and stop active tracing. When it stops we use 
> trace-cmd to create a trace.dat file and then we start digging for
> the tracemark and work back from that.

Suspect Non Maskable Interrupts from the hardlockup detector still execute 
when disabling interrupts.

Jianzhu will add "nosoftlockup nowatchdog" to confirm the hypothesis.

Comment 25 Marcelo Tosatti 2019-12-11 21:45:44 UTC

(In reply to Marcelo Tosatti from comment #24)
> (In reply to Clark Williams from comment #23)
> > (In reply to Marcelo Tosatti from comment #22)
> > > SMI counter does not increase on all processors.
> > > 
> > > Do we know how reliable this counter is? 
> > > 
> > > The Intel documentation is short: "Counter of SMI events"
> > > on the MSR section of the 4 volumes, and nothing about 
> > > it on the SMI section of the volumes.
> > 
> > As reliable as any hardware we work with :)
> > 
> > It's pretty straightforward since SMIs go through a single handler and
> > are just the mechanism to transition to SMM (System Management Mode) and
> > run system firmware (i.e. BIOS code). 
> > 
> > I'd say we *don't* have an SMI problem and that means we should be able
> > to see something in an ftrace traceback. Have we run cyclcitest with 
> > the --breaktrace option? The idea is to launch cyclictest from trace-cmd
> > and pass --breaktrace=<usec> to it, causing it to write a trace marker
> > to the ftrace buffer and stop active tracing. When it stops we use 
> > trace-cmd to create a trace.dat file and then we start digging for
> > the tracemark and work back from that.
> 
> Suspect Non Maskable Interrupts from the hardlockup detector still execute 
> when disabling interrupts.
> 
> Jianzhu will add "nosoftlockup nowatchdog" to confirm the hypothesis.

Jianzhu confirmed this was the problem. He will perform 24h tests to confirm
cyclictest values.

hwlatdetect reports no hits with threshold=1us.

Comment 26 jianzzha 2019-12-12 16:09:30 UTC

with nosoftlockup on the coreos kargs, the hwlatdetect looks good.

[root@worker1 core]# hwlatdetect --duration 1d --threshold 1
hwlatdetect:  test duration 86400 seconds
   detector: tracer
   parameters:
        Latency threshold: 1us
        Sample window:     1000000us
        Sample width:      500000us
     Non-sampling period:  500000us
        Output File:       None

Starting test
test finished
Max Latency: Below threshold
Samples recorded: 0
Samples exceeding threshold: 0
SMIs during run: 0

nosoftlockup was missing and was the cause of the bad hwlatdetect number on the fresh coreos install.

We now need to continue to investigate the bad cyclictest result on a tuned coreos (with all the tuning, including nosoftlockup).

Comment 27 Luiz Capitulino 2019-12-12 16:34:36 UTC

Very well done guys! I think we've solved two problems:

1. You found the reason for why hwlatdetect requires tuning
2. You confirmed we don't have SMIs in that machine

Jianzhu,

What's the current cyclictest result? Also, how are you applying the tuning?
Since I guess you're not able to run tuned in CoreOS, is that correct?

Comment 28 Andrew Theurer 2019-12-13 13:24:14 UTC

Very nice on the nosoftlockup.  Hopefully we can add that to hwlatdetect documentation, and have it output a warning if that is not present.

However, it does look like we were using nosoftlockup in the tuning for cyclictest.  But just to be certain, we should run cyclictest again with this tuning.  If we still get high max latencies, then I suspect we will need to increase the things we trace for, or perhaps trace all kernel function entry/exit.

Comment 29 jianzzha 2019-12-13 16:19:56 UTC

for the sake of simplicity, we used isolcpu for coreos and run cyclictest on baremetal (not from container), and the cyclictest is run on the isolcpu, result still no good - very much similar to what andrew got.

further tracing is needed.

Comment 30 jianzzha 2019-12-18 13:57:47 UTC

@lcapitulino, we need to get this ball rolling? What info is needed from our end?

Comment 31 jianzzha 2019-12-18 14:00:16 UTC

(In reply to Luiz Capitulino from comment #27)
> Very well done guys! I think we've solved two problems:
> 
> 1. You found the reason for why hwlatdetect requires tuning
> 2. You confirmed we don't have SMIs in that machine
> 
> Jianzhu,
> 
> What's the current cyclictest result? Also, how are you applying the tuning?
> Since I guess you're not able to run tuned in CoreOS, is that correct?

it is 30+ us on baremetal coreos (no container). isolcpus are used and cyclictest is running on the isolated cores (without other user processes sharing the same cpu).

Comment 32 Luiz Capitulino 2019-12-18 21:16:25 UTC

(In reply to jianzzha from comment #30)
> @lcapitulino, we need to get this ball rolling? What info is needed from our
> end?

I thought you and Marcelo were still debugging this issue. Are you, Marcelo?

Comment 33 Marcelo Tosatti 2019-12-26 20:49:31 UTC

(In reply to Luiz Capitulino from comment #32)
> (In reply to jianzzha from comment #30)
> > @lcapitulino, we need to get this ball rolling? What info is needed from our
> > end?
> 
> I thought you and Marcelo were still debugging this issue. Are you, Marcelo?

Yes. However:

1) Its not clear what the interface, for the k8s user, should be 
(in the container description).

IMHO should not hurry to merge an interface which has not been validated
to work with the use cases in question (its not clear what the use cases are in the first place, from the customer interested in using this).

2) There are discussions in redhat and k8s upstream about this interfaces.

3) The worker node is down so will continue debugging with podman.

Comment 34 Marcelo Tosatti 2019-12-26 22:09:27 UTC

(In reply to Marcelo Tosatti from comment #33)
> 2) There are discussions in redhat and k8s upstream about this interfaces.
> 
> 3) The worker node is down so will continue debugging with podman.

So for example, for O-RAN (4G/5G L1 processing including HARQ ACK processing) very aggressive options are used:

https://docs.o-ran-sc.org/projects/o-ran-sc-pti-rtp/en/latest/installation-guide.html#deploy-cmk-cpu-manager-for-kubernetes

# In this example, 1-16 cores are isolated for real time processes
root@intel-x86-64:~# rt_tuning="crashkernel=auto biosdevname=0 iommu=pt usbcore.autosuspend=-1 nmi_watchdog=0 softlockup_panic=0 intel_iommu=on cgroup_enable=memory skew_tick=1 hugepagesz=1G hugepages=4 default_hugepagesz=1G isolcpus=1-16 rcu_nocbs=1-16 kthread_cpus=0 irqaffinity=0 nohz=on nohz_full=1-16 intel_idle.max_cstate=0 processor.max_cstate=1 intel_pstate=disable nosoftlockup idle=poll mce=ignore_ce"

Some examples of interface requirements:

-> Requirement from user application:

https://www.youtube.com/watch?v=7DuKzaBaRwY minute 13:00 and forward.

-> https://www.youtube.com/watch?v=QbfjlNfLGxM

The requirements should probably be specified at "Call to NFV orchestration" level (minute 9:50).

Comment 35 jianzzha 2019-12-26 22:20:17 UTC

this BZ doesn't appears to have much to do with k8s or even containers, as we have observed the similar bad cyclictest results, even with isolcpu in place and cyclictest is running on the isolcpu. Can we solve the problem on the baremetal machine first?

Comment 36 Marcelo Tosatti 2019-12-27 13:01:07 UTC

(In reply to jianzzha from comment #35)
> this BZ doesn't appears to have much to do with k8s or even containers, as
> we have observed the similar bad cyclictest results, even with isolcpu in
> place and cyclictest is running on the isolcpu. Can we solve the problem on
> the baremetal machine first?

Hi Jianzhu,

So the problem is, in production (which is where this interface is supposed to be used),
different latency requirements will exist.

In the initial request from Andrew, he mentions:

"
Desired result:

Max latencies in the low-20 usecs
"

But given the different usecases (building a list of them in KVM-RT's wiki, will post it 
to kvm-rt@ as soon as its finished), and the fact that automation is desired by customers, i would rather build a whole picture.

But OK, this work can be done elsewhere, if that is what is preferred.

Can you please bring the worker node up so i can check what timer is it so 
the low 20 usecs latency can be achieved?

Comment 37 Marcelo Tosatti 2019-12-27 13:19:31 UTC

> Can you please bring the worker node up so i can check what timer is it so 
> the low 20 usecs latency can be achieved?

[root@e25-h23-740xd ~]# kubectl get nodes
NAME      STATUS     ROLES              AGE   VERSION
master0   Ready      master             18d   v1.16.2
master1   Ready      master             18d   v1.16.2
master2   Ready      master             18d   v1.16.2
worker1   NotReady   worker,worker-rt   18d   v1.16.2

Comment 38 Marcelo Tosatti 2020-01-06 11:43:50 UTC

Using SCHED_FIFO priority 99, with -p99, should fix the timer problem. 

Jianzhu, can you please check the latency values with that option?

TIA

Comment 39 jianzzha 2020-01-06 13:52:55 UTC

(In reply to Marcelo Tosatti from comment #38)
> Using SCHED_FIFO priority 99, with -p99, should fix the timer problem. 
> 
> Jianzhu, can you please check the latency values with that option?
> 
> TIA

we have been using -p 99 in the cyclictest, 

This is how we test it for 5 min on coreos baremetal (no container involved): 
cyclictest  -D 5m -p 99 -t 4 -a 10,4,6,8 -h 30 -m
# /dev/cpu_dma_latency set to 0us
policy: fifo: loadavg: 2.61 2.02 2.48 1/1979 3296755           

T: 0 (3295845) P:99 I:1000 C: 299996 Min:      2 Act:    2 Avg:    3 Max:      26
T: 1 (3295846) P:99 I:1000 C: 299996 Min:      2 Act:    3 Avg:    2 Max:      49
T: 2 (3295847) P:99 I:1000 C: 299995 Min:      2 Act:    3 Avg:    3 Max:      32
T: 3 (3295848) P:99 I:1000 C: 299995 Min:      2 Act:    3 Avg:    3 Max:      31
# Histogram
000000 000000	000000	000000	000000
000001 000000	000000	000000	000000
000002 005034	008376	000142	003936
000003 292207	288920	296385	293464
000004 002080	002000	002447	001802
000005 000243	000228	000345	000322
000006 000085	000093	000131	000113
000007 000067	000068	000101	000056
000008 000037	000051	000081	000039
000009 000029	000028	000073	000029
000010 000026	000022	000050	000030
000011 000013	000026	000036	000018
000012 000022	000023	000022	000023
000013 000018	000014	000017	000015
000014 000023	000024	000023	000013
000015 000009	000010	000022	000009
000016 000015	000012	000014	000013
000017 000013	000014	000015	000008
000018 000011	000019	000016	000000
000019 000010	000018	000018	000009
000020 000015	000010	000012	000022
000021 000012	000015	000019	000045
000022 000015	000016	000015	000006
000023 000010	000008	000011	000017
000024 000004	000003	000002	000009
000025 000001	000000	000001	000000
000026 000001	000000	000001	000000
000027 000000	000001	000000	000000
000028 000000	000000	000000	000000
000029 000000	000000	000000	000000
# Total: 000300000 000299999 000299999 000299998
# Min Latencies: 00002 00002 00002 00002
# Avg Latencies: 00003 00002 00003 00003
# Max Latencies: 00026 00049 00032 00031
# Histogram Overflows: 00000 00001 00001 00001
# Histogram Overflow at cycle number:
# Thread 0:
# Thread 1: 194177
# Thread 2: 194177
# Thread 3: 194205

there is no other work load on these processor. isolate cpu is used:

# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-464d92a710185aa15005e946175a6900b0bd16bbefc60c9e71e19cc5f513b675/vmlinuz-4.18.0-147.8.rt13.1.el8locklesspendingtimermap4.x86_64 root=/dev/disk/by-label/root rootflags=defaults,prjquota rw console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal ostree=/ostree/boot.1/rhcos/464d92a710185aa15005e946175a6900b0bd16bbefc60c9e71e19cc5f513b675/0 intel_iommu=on iommu=pt default_hugepagesz=1GB hugepagesz=1G hugepages=32 intel_pstate=disable skew_tick=1 nohz=on nohz_full=1,3-31 rcu_nocbs=1,3-31 non_iso_cpumask=00000005 nosoftlockup intel_idle.max_cstate=1 transparent_hugepage=never numa_balancing=disable mitigations=off nosmt tsc=nowatchdog nowatchdog isolcpus=1,3-31

Comment 40 jianzzha 2020-01-06 14:09:31 UTC

Marcelo, this cores worker node is back online and feel free to continue the debug. the access info is the same as before.

Comment 44 jianzzha 2020-01-21 17:53:07 UTC

a quick 5 min run from inside OCP container:

# Min Latencies: 00003 00003 00003 00003
# Avg Latencies: 00003 00003 00003 00003
# Max Latencies: 00008 00007 00006 00007


# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-c40bc0eface9378448e03d886391fcf4087a966617700dc3efe2653b14f4e6cc/vmlinuz-4.18.0-171.rt13.28.el8nolpoll.x86_64 root=/dev/disk/by-label/root rootflags=defaults,prjquota rw console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal ostree=/ostree/boot.1/rhcos/c40bc0eface9378448e03d886391fcf4087a966617700dc3efe2653b14f4e6cc/0 nosmt intel_iommu=on iommu=pt default_hugepagesz=1GB hugepagesz=1G hugepages=32 intel_pstate=disable skew_tick=1 nohz=on nohz_full=4-23 rcu_nocbs=4-23 nosoftlockup nosoftlockup isolcpus=4-23 tsc=nowatchdog