Bug 1833196
Summary: | adaptive-tick appears to not be functioning as expected on RHEL 8.2 RT | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Troy Wilson <trwilson> | ||||||||
Component: | kernel-rt | Assignee: | Juri Lelli <jlelli> | ||||||||
kernel-rt sub component: | Scheduler | QA Contact: | Qiao Zhao <qzhao> | ||||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||||
Severity: | medium | ||||||||||
Priority: | unspecified | CC: | aklimov, atheurer, bhu, chuhu, daolivei, jlelli, jmario, jskarvad, kcarcia, lcapitulino, lgoncalv, mtosatti, pauld, peterx, pezhang, qzhao, williams | ||||||||
Version: | 8.2 | ||||||||||
Target Milestone: | rc | ||||||||||
Target Release: | 8.0 | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | 4.18.0-233 | Doc Type: | If docs needed, set a value | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2020-11-04 02:25:17 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 1833096 | ||||||||||
Bug Blocks: | 1817732, 1823810 | ||||||||||
Attachments: |
|
Description
Troy Wilson
2020-05-08 03:23:31 UTC
Hi Troy, did you already try running with the corresponding (non RT) RHEL kernel? Trying with RHEL might help us isolating where the problem is. Also, what about extending tracing to other interesting events (e.g. scheduling, other types or irqs)? So that we can collect more info about what caused the tick to start again. (In reply to Juri Lelli from comment #1) > Hi Troy, > > did you already try running with the corresponding (non RT) RHEL > kernel? Trying with RHEL might help us isolating where the problem > is. > > Also, what about extending tracing to other interesting events > (e.g. scheduling, other types or irqs)? So that we can collect > more info about what caused the tick to start again. I do not believe I am seeing the situation with a non-RT kernel [root@e25-h33-740xd ~]# perf stat -C 5 -e irq_vectors:local_timer_entry taskset --cpu-list 5 ./sysjitter/sysjitter --runtime 600 200 core_i: 5 threshold(ns): 200 cpu_mhz: 2096 runtime(ns): 599729456412 runtime(s): 599.729 int_n: 66 int_n_per_sec: 0.110 int_min(ns): 233 int_median(ns): 2935 int_mean(ns): 4474 int_90(ns): 8770 int_99(ns): 14707 int_999(ns): 14707 int_9999(ns): 14707 int_99999(ns): 14707 int_max(ns): 14707 int_total(ns): 295333 int_total(%): 0.000 Performance counter stats for 'CPU(s) 5': 35 irq_vectors:local_timer_entry 601.011019915 seconds time elapsed [root@e25-h33-740xd ~]# uname -a Linux e25-h33-740xd.alias.bos.scalelab.redhat.com 4.18.0-167.el8.x86_64 #1 SMP Sun Dec 15 01:24:23 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux [root@e25-h33-740xd ~]# I will look at tracing other events-- please let me know if there are any specific events you'd like traced. OK. I could replicate the issue on another test box. On RT, tick seems to still be disabled at some point, but then it is turned back on. Will first see if upstream shows the same behavior. Created attachment 1687600 [details]
RHEL trace with tick_stop events
Created attachment 1687601 [details]
RHEL-RT trace with tick_stop events
Created attachment 1687937 [details]
UPSTREAM (v5.6-rt) trace with tick_stop events
Upstream seems to be working OK. I noticed bz1833096. Wonder if the two issues are related. My current understanding is that we are missing a chance to enter full_no_hz when softirqs are serviced. In RT, sched tick activates ksoftirqd(s), the NOHZ SCHED dep is correctly cleared after ksoftirqd goes to sleep (leaving only 1 task on the cpu, the sysjitter one(s)), but tick stop is not evaluated again after this happens. Something like the following (not completely correct, but might be a start) seems to fix the issue. --->8--- diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index cfeab6ef0f163..d4c4f860562e4 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1894,10 +1894,12 @@ static inline void sched_update_tick_dependency(struct rq *rq) if (!tick_nohz_full_cpu(cpu)) return; - if (sched_can_stop_tick(rq)) + if (sched_can_stop_tick(rq)) { tick_nohz_dep_clear_cpu(cpu, TICK_DEP_BIT_SCHED); - else + tick_nohz_irq_exit(); + } else { tick_nohz_dep_set_cpu(cpu, TICK_DEP_BIT_SCHED); + } } #else static inline int sched_tick_offload_init(void) { return 0; } --->8--- Could you please test if things get better with following kernel? http://brew-task-repos.usersys.redhat.com/repos/scratch/jlelli/kernel-rt/4.18.0/198.rt5.10.el8.dyntick.1833196.test.cki.kt1/x86_64/ I've encountered a similar issue even using fifo:1 priority when testing with sysjitter2 (https://github.com/xzpeter/sysjitter2) in a RT virtual machine, where I can observe 1000hz * 10us latency conditionally. The brew in comment 8 seems to have fixed that for me (for a 1H run). [root@localhost ~]# ./sysjitter2 --cpu-list 2,3,4,5,6,7,8,9 --rtprio 1 --runtime 3600 Version: 2.01 core_i: 2 3 4 5 6 7 8 9 cpu_mhz: 2197 2197 2197 2197 2197 2197 2197 2197 001 (us): 93068541026 93068541249 93068541429 93068541411 93068541565 93068541296 93068541320 93068541409 002 (us): 51 51 51 51 51 51 51 51 003 (us): 2 2 2 2 2 2 2 2 004 (us): 4 4 8 3 26 3 4 2 005 (us): 40 39 35 41 18 42 40 41 006 (us): 1 1 1 0 0 0 1 1 007 (us): 0 0 0 0 0 0 0 0 008 (us): 0 0 0 0 0 0 0 0 009 (us): 0 0 0 0 0 0 0 0 010 (us): 0 0 0 0 0 0 0 0 011 (us): 0 0 0 0 0 0 0 0 012 (us): 0 0 0 0 0 0 0 0 013 (us): 0 0 0 0 0 0 0 0 014 (us): 0 0 0 0 0 0 0 0 015 (us): 0 0 0 0 0 0 0 0 016 (us): 0 0 0 0 0 0 0 0 017 (us): 0 0 0 0 0 0 0 0 018 (us): 0 0 0 0 0 0 0 0 019 (us): 0 0 0 0 0 0 0 0 020 (us): 0 0 0 0 0 0 0 0 021 (us): 0 0 0 0 0 0 0 0 022 (us): 0 0 0 0 0 0 0 0 023 (us): 0 0 0 0 0 0 0 0 024 (us): 0 0 0 0 0 0 0 0 025 (us): 0 0 0 0 0 0 0 0 026 (us): 0 0 0 0 0 0 0 0 027 (us): 0 0 0 0 0 0 0 0 028 (us): 0 0 0 0 0 0 0 0 029 (us): 0 0 0 0 0 0 0 0 030 (us): 0 0 0 0 0 0 0 0 031 (us): 0 0 0 0 0 0 0 0 032 (us): 0 0 0 0 0 0 0 0 maxlat: 6 6 6 5 5 5 6 6 (us) runtime: 3600.740 3600.740 3600.740 3600.740 3600.740 3600.740 3600.740 3600.740 (sec) The proposed fix in comment 8 also works for me: [root@e25-h33-740xd sysjitter]# ./sysjitter --runtime 600 --rtprio 95 --accept-cpuset --cores 6,8,10,12,14,16,18,20 --master-core 4 200 core_i: 6 8 10 12 14 16 18 20 threshold(ns): 200 200 200 200 200 200 200 200 cpu_mhz: 2096 2096 2096 2096 2096 2096 2096 2096 runtime(ns): 599734002777 599734002948 599734002886 599734003307 599734002693 599734002740 599734002851 599734002759 runtime(s): 599.734 599.734 599.734 599.734 599.734 599.734 599.734 599.734 int_n: 11 10 7 599315 8 7 5 6 int_n_per_sec: 0.018 0.017 0.012 999.301 0.013 0.012 0.008 0.010 int_min(ns): 208 211 225 205 243 211 216 203 int_median(ns): 284 326 638 2759 582 283 291 312 int_mean(ns): 15629734 10795656 6283980 2871 124946153 344 375 351 int_90(ns): 4072 107949676 43981736 3291 999562041 595 565 615 int_99(ns): 171920095 107949676 43981736 4104 999562041 595 565 615 int_999(ns): 171920095 107949676 43981736 4641 999562041 595 565 615 int_9999(ns): 171920095 107949676 43981736 5074 999562041 595 565 615 int_99999(ns): 171920095 107949676 43981736 5513 999562041 595 565 615 int_max(ns): 171920095 107949676 43981736 5647 999562041 595 565 615 int_total(ns): 171927085 107956573 43987864 1720828725 999569226 2415 1875 2106 int_total(%): 0.029 0.018 0.007 0.287 0.167 0.000 0.000 0.000 [root@e25-h33-740xd sysjitter]# RFC patch sent upstream https://lore.kernel.org/lkml/20200520140402.358880-1-juri.lelli@redhat.com/ *** Bug 1833096 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: kernel-rt security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:4609 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days |