Note: This bug is displayed in read-only format because
the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Description of problem:
Version-ReleaseThe soft timers for non-RT task are being places on the per-CPU soft clocks on isolcpu’s instead of being move to the housekeeping CPU’s and the apcic timer for them is not being reprogrammed. So these timers can be late to expire or depending on what is happening on the cpu not expire at all.
number of selected component (if applicable):
How reproducible:
Happens often
Steps to Reproduce:
1.NA
2.
3.
Actual results:
Expected results:
Additional info:
We are seeing this in the RT kernel starting at version
4.18.0-305.14.1.rt7.86.el8_4.x86_64.
The version kernel-rt-4.18.0-305.12.1.rt7.84.el8_4 does not have this problem.
There are 31 patches between the two versions with these being the ones of concern.
$ git log --pretty=oneline kernel-rt-4.18.0-305.12.1.rt7.84.el8_4..kernel-rt-4.18.0-305.14.1.rt7.86.el8_4 | grep -e tick -e time
a0fed4acf70e2c5c2c07339819c65b0a528e2549 Merge: time/hrtimer: Add PINNED_HARD mode for realtime hrtimers
59ca0686bd4d16011aeb26661ae158d6591932a3 Merge: tick/nohz: Avoid latency spike on nohz full cpus
b2d3e52448290572b7a09310d4026e039203fcbf time/hrtimer: Embed hrtimer mode into hrtimer_sleeper
cab277b7d32a3403801a3718367a0ce86faf9322 time/hrtimer: Add PINNED_HARD mode for realtime hrtimers
4472d079cc1427a837e3f3bdb7bd958f3e60e12a tick/nohz: Kick only _queued_ task whose tick dependency is updated
693443d4c7b4664908b86927fae3227f4360248c tick/nohz: Change signal tick dependency to wake up CPUs of member tasks
52fdcac7277b837ca9943542a0903d497fe56c73 tick/nohz: Only wake up a single target cpu when kicking a task
5ec28bffbb49176c2176b90934f589a230973d20 tick/nohz: Narrow down noise while setting current task's tick dependency
$
crash> sys
KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/4.18.0-305.19.1.rt7.91.el8_4.x86_64/vmlinux [TAINTED]
DUMPFILE: /cores/retrace/tasks/510256984/crash/vmcore [PARTIAL DUMP]
CPUS: 36
DATE: Mon Dec 20 12:25:51 EST 2021
UPTIME: 13:21:54
LOAD AVERAGE: 10.70, 11.13, 11.27
TASKS: 1083
NODENAME:
RELEASE: 4.18.0-305.19.1.rt7.91.el8_4.x86_64
VERSION: #1 SMP PREEMPT_RT Wed Sep 8 13:32:17 EDT 2021
MACHINE: x86_64 (3000 Mhz)
MEMORY: 63.6 GB
PANIC: "sysrq: SysRq : Trigger a crash"
crash> sys -i
DMI_BIOS_VENDOR: Dell Inc.
DMI_BIOS_VERSION: 2.10.0
DMI_BIOS_DATE: 11/12/2020
DMI_SYS_VENDOR: Dell Inc.
DMI_PRODUCT_NAME: PowerEdge R740
DMI_PRODUCT_VERSION:
DMI_PRODUCT_SERIAL:
DMI_PRODUCT_UUID:
DMI_PRODUCT_SKU: SKU=0715;ModelName=PowerEdge R740
DMI_PRODUCT_FAMILY: PowerEdge
DMI_BOARD_VENDOR: Dell Inc.
DMI_BOARD_NAME: 0JMK61
DMI_BOARD_VERSION: A00
DMI_BOARD_SERIAL:
DMI_CHASSIS_VENDOR: Dell Inc.
DMI_CHASSIS_TYPE: 23
DMI_CHASSIS_VERSION:
DMI_CHASSIS_SERIAL:
DMI_CHASSIS_ASSET_TAG:
crash>
The application is hung and the customer has triggered the dump.
They have identified that they are waiting for pid 124891 to run for the app to continue.
crash> bt 124891
PID: 124891 TASK: ffff9a6df5a49f40 CPU: 11 COMMAND: " asyncRun"
#0 [ffffae5889f67d70] __schedule at ffffffff96f5711d
#1 [ffffae5889f67e10] schedule at ffffffff96f576a6
#2 [ffffae5889f67e20] do_nanosleep at ffffffff96f5a4d6
#3 [ffffae5889f67e68] hrtimer_nanosleep at ffffffff96772212
#4 [ffffae5889f67f10] __x64_sys_nanosleep at ffffffff96772586
#5 [ffffae5889f67f38] do_syscall_64 at ffffffff96602bb7
#6 [ffffae5889f67f50] entry_SYSCALL_64_after_hwframe at ffffffff970000ad
RIP: 00007f531965cd98 RSP: 00007f52f3a4fe00 RFLAGS: 00000293
RAX: ffffffffffffffda RBX: 00007f52f3a4fe30 RCX: 00007f531965cd98
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f52f3a4fe30
RBP: 0000000000000000 R8: 00007f52ec006910 R9: 0000000000000007
R10: 0000000000000010 R11: 0000000000000293 R12: 00007f52f3a4fe90
R13: 00007f520801a640 R14: 00007f52f3a4fea0 R15: 0000000000000025
ORIG_RAX: 0000000000000023 CS: 0033 SS: 002b
crash>
We see that it is in a sleep call and has been there for 65 seconds but it should be 1 millisecond.
crash> ps -m 124891
[0 00:01:05.659] [IN] PID: 124891 TASK: ffff9a6df5a49f40 CPU: 11 COMMAND: " asyncRun"
crash>
The timeout for this nanosleep is 1ms (1000000 ns)
crash> timespec64 ffffae5889f67f18 -d
struct timespec64 {
tv_sec = 0,
tv_nsec = 1000000 <-----
}
Looking at the hrtimer struct in the stack.
#3 [ffffae5889f67e68] hrtimer_nanosleep at ffffffff96772212
crash> hrtimer_sleeper ffffae5889f67e78 <- from r14
struct hrtimer_sleeper {
timer = {
node = {
node = {
__rb_parent_color = 0x1,
rb_right = 0x0,
rb_left = 0x0
},
expires = 0x2bb31813370e <------ 48048203052814
},
_softexpires = 0x2bb3181273be, <- 48048203002814
function = 0xffffffff96770fa0, <- hrtimer_wakeup
base = 0xffff9a775f95f080, <- clock index 4
state = 0x1,
is_rel = 0x0,
is_soft = 0x1,
is_hard = 0x0,
hrtimer_size_rh = 0x0,
_rh = 0x0
},
task = 0xffff9a6df5a49f40,
mode = HRTIMER_MODE_REL
}
crash>
ffffae5889f67e70: ffff9a76693a9a48 0000000000000001
^-- start of hrtimer_sleeper struct
ffffae5889f67e80: 0000000000000000 0000000000000000
ffffae5889f67e90: 00002bb31813370e 00002bb3181273be
ffffae5889f67ea0: ffffffff96770fa0 ffff9a775f95f080
ffffae5889f67eb0: 0000000000010001 0000000000000000
ffffae5889f67ec0: 0000000000000000 ffff9a6df5a49f40
+68=can
ffffae5889f67ed0: ffffffff00000001 fa62ab6d46cc8e00
rbx rbp
ffffae5889f67ee0: 0000000000000000 ffffae5889f67f58
r12 r13
ffffae5889f67ef0: 0000000000000000 0000000000000000
r14 r15
ffffae5889f67f00: 0000000000000000 0000000000000000
ffffae5889f67f10: ffffffff96772586
#4 [ffffae5889f67f10] __x64_sys_nanosleep at ffffffff96772586
Looking for it with the timer cmd it’s not there.
We see that there are now 8 per-CPU clocks for each CPU in RHEL 8 the timer -r cmd only displays 3 of them.
Using a modified version of crash to see all 8 of the clocks and we can see the timer on CPU 11 clock 4.
crash-vl> timer -r | grep ffffae5889f67e78
48048203002814 48048203052814 -65843947186 ffffae5889f67e78 ffffffff96770fa0 <hrtimer_wakeup>
crash-vl>
And we see that it is on CPU 11 clock 4.
crash-vl> timer -r -C 11
CPU: 11 HRTIMER_CPU_BASE: ffff9a775f95ee00
CLOCK: 0 HRTIMER_CLOCK_BASE: ffff9a775f95ee80 [ktime_get]
(empty)
CLOCK: 1 HRTIMER_CLOCK_BASE: ffff9a775f95ef00 [ktime_get_real]
(empty)
CLOCK: 2 HRTIMER_CLOCK_BASE: ffff9a775f95ef80 [ktime_get_boottime]
(empty)
CLOCK: 3 HRTIMER_CLOCK_BASE: ffff9a775f95f000 [ktime_get_clocktai]
(empty)
CLOCK: 4 HRTIMER_CLOCK_BASE: ffff9a775f95f080 [ktime_get]
CURRENT
48114047000000
SOFTEXPIRES EXPIRES TTE HRTIMER FUNCTION
48048203002814 48048203052814 -65843947186 ffffae5889f67e78 ffffffff96770fa0 <hrtimer_wakeup> <------------
CLOCK: 5 HRTIMER_CLOCK_BASE: ffff9a775f95f100 [ktime_get_real]
CURRENT
1640021151290259551
SOFTEXPIRES EXPIRES TTE HRTIMER FUNCTION
1640021064148260307 1640021064148310307 -87141949244 ffffae588a9c7d30 ffffffff96770fa0 <hrtimer_wakeup>
CLOCK: 6 HRTIMER_CLOCK_BASE: ffff9a775f95f180 [ktime_get_boottime]
(empty)
CLOCK: 7 HRTIMER_CLOCK_BASE: ffff9a775f95f200 [ktime_get_clocktai]
(empty)
crash-vl>
This task is running on isolcpu 11 and is not realtime.
[ 0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-305.19.1.rt7.91.el8_4.x86_64 root=/dev/mapper/rootvg-root ro rd.lvm.lv=rootvg/root rd.lvm.lv=rootvg/swap rd.driver.blacklist=sfc rd.driver.blacklist=qla2xxx rd.driver.blacklist=lpfc crashkernel=auto biosdevname=0 net.ifnames=1 console=tty0 console=ttyS1,115200 consoleblank=0 noibrs noibpb nopti tsc=reliable ipmi_si.kipmid_max_busy_us=100 processor.max_cstate=0 intel_idle.max_cstate=0
isolcpus=1,3,5,7,9,11,13,15-35 <----------------
nohz_full=1,3,5,7,9,11,13,15-35 rcu_nocbs=1,3,5,7,9,11,13,15-35 rd.earlykdump intel_pstate=disable
crash-vl> task -R policy -R sched_class 124891
PID: 124891 TASK: ffff9a6df5a49f40 CPU: 11 COMMAND: " asyncRun"
policy = 0x0, <---------
sched_class = 0xffffffff97423700,
crash-vl> sym 0xffffffff97423700,
ffffffff97423700 (R) fair_sched_class
crash-vl>
From the code to me it looks like this task should have been moved to a house keeping cpus.
crash-vl> p housekeeping_mask
housekeeping_mask = $7 =
{{
bits = {0x5555,
crash-vl> eval -b 0x5555 | grep bits
bits set: 14 12 10 8 6 4 2 0 <- one of these cpus
crash-vl>
190 enum hrtimer_base_type {
191 HRTIMER_BASE_MONOTONIC,
192 HRTIMER_BASE_REALTIME,
193 HRTIMER_BASE_BOOTTIME,
194 HRTIMER_BASE_TAI,
195 HRTIMER_BASE_MONOTONIC_SOFT, <- clock 4
196 HRTIMER_BASE_REALTIME_SOFT,
197 HRTIMER_BASE_BOOTTIME_SOFT,
198 HRTIMER_BASE_TAI_SOFT,
199 HRTIMER_MAX_CLOCK_BASES,
200 };
In the dump we can see an example of that for pid 138409.
crash> bt 138409
PID: 138409 TASK: ffff9a6e6d43be80 CPU: 9 COMMAND: "IFPGA_refill"
#0 [ffffae588d50fd70] __schedule at ffffffff96f5711d
#1 [ffffae588d50fe10] schedule at ffffffff96f576a6
#2 [ffffae588d50fe20] do_nanosleep at ffffffff96f5a4d6
#3 [ffffae588d50fe68] hrtimer_nanosleep at ffffffff96772212
#4 [ffffae588d50ff10] __x64_sys_nanosleep at ffffffff96772586
#5 [ffffae588d50ff38] do_syscall_64 at ffffffff96602bb7
#6 [ffffae588d50ff50] entry_SYSCALL_64_after_hwframe at ffffffff970000ad
RIP: 00007f531965cd98 RSP: 00007f52822eaf50 RFLAGS: 00000293
RAX: ffffffffffffffda RBX: 00007f52822eaf80 RCX: 00007f531965cd98
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f52822eaf80
RBP: 0000000000000000 R8: 0000000000000012 R9: 0000000000000011
R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
R13: 00007f52822eb000 R14: 000000000918a2f8 R15: 00007f52822eb380
ORIG_RAX: 0000000000000023 CS: 0033 SS: 002b
crash>
crash> hrtimer_sleeper.task,mode ffffae588d50fe78
task = 0xffff9a6e6d43be80
mode = HRTIMER_MODE_REL
crash>
crash-vl> timer -r -C 0 | grep -e CLOCK: -e ffffae588d50fe78
CLOCK: 0 HRTIMER_CLOCK_BASE: ffff9a6f5fc1ee80 [ktime_get]
CLOCK: 1 HRTIMER_CLOCK_BASE: ffff9a6f5fc1ef00 [ktime_get_real]
CLOCK: 2 HRTIMER_CLOCK_BASE: ffff9a6f5fc1ef80 [ktime_get_boottime]
CLOCK: 3 HRTIMER_CLOCK_BASE: ffff9a6f5fc1f000 [ktime_get_clocktai]
CLOCK: 4 HRTIMER_CLOCK_BASE: ffff9a6f5fc1f080 [ktime_get]
48114131624069 48114131674069 84674069 ffffae588d50fe78 ffffffff96770fa0 <hrtimer_wakeup> <--- timer for pid 138409 on CPU 9
CLOCK: 5 HRTIMER_CLOCK_BASE: ffff9a6f5fc1f100 [ktime_get_real]
CLOCK: 6 HRTIMER_CLOCK_BASE: ffff9a6f5fc1f180 [ktime_get_boottime]
CLOCK: 7 HRTIMER_CLOCK_BASE: ffff9a6f5fc1f200 [ktime_get_clocktai]
crash-vl>
Looking at the code tick_program_event() it appears that CPU 11 has not been programmed for a timer interrupt.
27 int tick_program_event(ktime_t expires, int force)
28 {
29 struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
30
31 if (unlikely(expires == KTIME_MAX)) {
32 /*
33 * We don't need the clock event device any more, stop it.
34 */
35 clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT_STOPPED); <---------
36 dev->next_event = KTIME_MAX; <-------
37 return 0; <---------------------------------------------------------
38 }
39
40 if (unlikely(clockevent_state_oneshot_stopped(dev))) {
41 /*
42 * We need the clock event again, configure it in ONESHOT mode
43 * before using it.
44 */
45 clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT);
46 }
47
48 return clockevents_program_event(dev, expires, force);
49 }
crash-vl> p lapic_events:11 | grep -e state_use_accessors -e next_event | paste - - - | pr -Tn -N 0 | column -t
0 set_next_event = 0xffffffff96651b60, next_event = 0x7fffffffffffffff, state_use_accessors = CLOCK_EVT_STATE_ONESHOT_STOPPED,
crash-vl>
crash-vl> p hrtimer_bases:11 | grep -e softirq_activated -e softirq_expires_next -e softirq_next_timer -e expires_next | paste - - - - | pr -Tn -N 0 | column -t
0 softirq_activated = 0x0, expires_next = 0x7fffffffffffffff, softirq_expires_next = 0x2ba5dc165ee1, softirq_next_timer = 0x0,
crash-vl>
So no apic timer interrupts are schedule for CPU 11 and this timer won’t be server until something triggers hrtimer_run_softirq() to run from softirq_vec[HRTIMER_SOFTIRQ] such as getting another timer to run on CPU 11.
The occurrences of HRTIMER_SOFTIRQ in the source:
0 interrupt.h <global> 525 HRTIMER_SOFTIRQ,
1 hrtimer.c hrtimer_interrupt 1653 raise_softirq_irqoff(HRTIMER_SOFTIRQ);
2 hrtimer.c hrtimer_run_queues 1767 raise_softirq_irqoff(HRTIMER_SOFTIRQ);
3 hrtimer.c hrtimers_init 2146 open_softirq(HRTIMER_SOFTIRQ, hrtimer_run_softirq);
1626 void hrtimer_interrupt(struct clock_event_device *dev)
1627 {
. . .
1650 if (!ktime_before(now, cpu_base->softirq_expires_next)) {
1651 cpu_base->softirq_expires_next = KTIME_MAX;
1652 cpu_base->softirq_activated = 1;
1653 raise_softirq_irqoff(HRTIMER_SOFTIRQ); <-------
1654 }
apic_timer_interrupt()->smp_apic_timer_interrupt()->hrtimer_interrupt().
The customer has noted 2 possible workarounds.
1: Doing a taskset -c 11 bash will trigger the timer.
2: Running the app in realtime (RR 1) will avoid the issue.
As I see the path when setting the timer it goes.
hrtimer_nanosleep()->
do_nanosleep()->
hrtimer_sleeper_start_expires()->
hrtimer_start_expires()->
hrtimer_start_range_ns()->
__hrtimer_start_range_ns()
1095 static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
1096 u64 delta_ns, const enum hrtimer_mode mode,
1097 struct hrtimer_clock_base *base)
1098 {
. . .
1111 /* Switch the timer base, if necessary: */
1112 new_base = switch_hrtimer_base(timer, base, mode & HRTIMER_MODE_PINNED); <--- base is decided here
1113
1114 return enqueue_hrtimer(timer, new_base, mode); <---
235 switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
236 int pinned)
237 {
238 struct hrtimer_cpu_base *new_cpu_base, *this_cpu_base;
239 struct hrtimer_clock_base *new_base;
240 int basenum = base->index;
241
242 this_cpu_base = this_cpu_ptr(&hrtimer_bases);
243 new_cpu_base = get_target_base(this_cpu_base, pinned); <----
If sysctl kernel.timer_migration=0 the get_target_base() returns this_cpu_base to new_cpu_base for both RT task and non RT task.
211 static inline
212 struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base,
213 int pinned)
214 {
215 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
216 if (static_branch_likely(&timers_migration_enabled) && !pinned) <--
217 return &per_cpu(hrtimer_bases, get_nohz_timer_target()); <-- call if not pinned/ not rt
218 #endif
219 return base;
If it was off then we would see the hrtimer placed on the isolcpu timer list just like we see in the dumps.
But kernel.timer_migration is on in the dumps. Not sure if it applies.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (Moderate: kernel-rt security and bug fix update), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHSA-2022:7444
Description of problem: Version-ReleaseThe soft timers for non-RT task are being places on the per-CPU soft clocks on isolcpu’s instead of being move to the housekeeping CPU’s and the apcic timer for them is not being reprogrammed. So these timers can be late to expire or depending on what is happening on the cpu not expire at all. number of selected component (if applicable): How reproducible: Happens often Steps to Reproduce: 1.NA 2. 3. Actual results: Expected results: Additional info: We are seeing this in the RT kernel starting at version 4.18.0-305.14.1.rt7.86.el8_4.x86_64. The version kernel-rt-4.18.0-305.12.1.rt7.84.el8_4 does not have this problem. There are 31 patches between the two versions with these being the ones of concern. $ git log --pretty=oneline kernel-rt-4.18.0-305.12.1.rt7.84.el8_4..kernel-rt-4.18.0-305.14.1.rt7.86.el8_4 | grep -e tick -e time a0fed4acf70e2c5c2c07339819c65b0a528e2549 Merge: time/hrtimer: Add PINNED_HARD mode for realtime hrtimers 59ca0686bd4d16011aeb26661ae158d6591932a3 Merge: tick/nohz: Avoid latency spike on nohz full cpus b2d3e52448290572b7a09310d4026e039203fcbf time/hrtimer: Embed hrtimer mode into hrtimer_sleeper cab277b7d32a3403801a3718367a0ce86faf9322 time/hrtimer: Add PINNED_HARD mode for realtime hrtimers 4472d079cc1427a837e3f3bdb7bd958f3e60e12a tick/nohz: Kick only _queued_ task whose tick dependency is updated 693443d4c7b4664908b86927fae3227f4360248c tick/nohz: Change signal tick dependency to wake up CPUs of member tasks 52fdcac7277b837ca9943542a0903d497fe56c73 tick/nohz: Only wake up a single target cpu when kicking a task 5ec28bffbb49176c2176b90934f589a230973d20 tick/nohz: Narrow down noise while setting current task's tick dependency $ crash> sys KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/4.18.0-305.19.1.rt7.91.el8_4.x86_64/vmlinux [TAINTED] DUMPFILE: /cores/retrace/tasks/510256984/crash/vmcore [PARTIAL DUMP] CPUS: 36 DATE: Mon Dec 20 12:25:51 EST 2021 UPTIME: 13:21:54 LOAD AVERAGE: 10.70, 11.13, 11.27 TASKS: 1083 NODENAME: RELEASE: 4.18.0-305.19.1.rt7.91.el8_4.x86_64 VERSION: #1 SMP PREEMPT_RT Wed Sep 8 13:32:17 EDT 2021 MACHINE: x86_64 (3000 Mhz) MEMORY: 63.6 GB PANIC: "sysrq: SysRq : Trigger a crash" crash> sys -i DMI_BIOS_VENDOR: Dell Inc. DMI_BIOS_VERSION: 2.10.0 DMI_BIOS_DATE: 11/12/2020 DMI_SYS_VENDOR: Dell Inc. DMI_PRODUCT_NAME: PowerEdge R740 DMI_PRODUCT_VERSION: DMI_PRODUCT_SERIAL: DMI_PRODUCT_UUID: DMI_PRODUCT_SKU: SKU=0715;ModelName=PowerEdge R740 DMI_PRODUCT_FAMILY: PowerEdge DMI_BOARD_VENDOR: Dell Inc. DMI_BOARD_NAME: 0JMK61 DMI_BOARD_VERSION: A00 DMI_BOARD_SERIAL: DMI_CHASSIS_VENDOR: Dell Inc. DMI_CHASSIS_TYPE: 23 DMI_CHASSIS_VERSION: DMI_CHASSIS_SERIAL: DMI_CHASSIS_ASSET_TAG: crash> The application is hung and the customer has triggered the dump. They have identified that they are waiting for pid 124891 to run for the app to continue. crash> bt 124891 PID: 124891 TASK: ffff9a6df5a49f40 CPU: 11 COMMAND: " asyncRun" #0 [ffffae5889f67d70] __schedule at ffffffff96f5711d #1 [ffffae5889f67e10] schedule at ffffffff96f576a6 #2 [ffffae5889f67e20] do_nanosleep at ffffffff96f5a4d6 #3 [ffffae5889f67e68] hrtimer_nanosleep at ffffffff96772212 #4 [ffffae5889f67f10] __x64_sys_nanosleep at ffffffff96772586 #5 [ffffae5889f67f38] do_syscall_64 at ffffffff96602bb7 #6 [ffffae5889f67f50] entry_SYSCALL_64_after_hwframe at ffffffff970000ad RIP: 00007f531965cd98 RSP: 00007f52f3a4fe00 RFLAGS: 00000293 RAX: ffffffffffffffda RBX: 00007f52f3a4fe30 RCX: 00007f531965cd98 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f52f3a4fe30 RBP: 0000000000000000 R8: 00007f52ec006910 R9: 0000000000000007 R10: 0000000000000010 R11: 0000000000000293 R12: 00007f52f3a4fe90 R13: 00007f520801a640 R14: 00007f52f3a4fea0 R15: 0000000000000025 ORIG_RAX: 0000000000000023 CS: 0033 SS: 002b crash> We see that it is in a sleep call and has been there for 65 seconds but it should be 1 millisecond. crash> ps -m 124891 [0 00:01:05.659] [IN] PID: 124891 TASK: ffff9a6df5a49f40 CPU: 11 COMMAND: " asyncRun" crash> The timeout for this nanosleep is 1ms (1000000 ns) crash> timespec64 ffffae5889f67f18 -d struct timespec64 { tv_sec = 0, tv_nsec = 1000000 <----- } Looking at the hrtimer struct in the stack. #3 [ffffae5889f67e68] hrtimer_nanosleep at ffffffff96772212 crash> hrtimer_sleeper ffffae5889f67e78 <- from r14 struct hrtimer_sleeper { timer = { node = { node = { __rb_parent_color = 0x1, rb_right = 0x0, rb_left = 0x0 }, expires = 0x2bb31813370e <------ 48048203052814 }, _softexpires = 0x2bb3181273be, <- 48048203002814 function = 0xffffffff96770fa0, <- hrtimer_wakeup base = 0xffff9a775f95f080, <- clock index 4 state = 0x1, is_rel = 0x0, is_soft = 0x1, is_hard = 0x0, hrtimer_size_rh = 0x0, _rh = 0x0 }, task = 0xffff9a6df5a49f40, mode = HRTIMER_MODE_REL } crash> ffffae5889f67e70: ffff9a76693a9a48 0000000000000001 ^-- start of hrtimer_sleeper struct ffffae5889f67e80: 0000000000000000 0000000000000000 ffffae5889f67e90: 00002bb31813370e 00002bb3181273be ffffae5889f67ea0: ffffffff96770fa0 ffff9a775f95f080 ffffae5889f67eb0: 0000000000010001 0000000000000000 ffffae5889f67ec0: 0000000000000000 ffff9a6df5a49f40 +68=can ffffae5889f67ed0: ffffffff00000001 fa62ab6d46cc8e00 rbx rbp ffffae5889f67ee0: 0000000000000000 ffffae5889f67f58 r12 r13 ffffae5889f67ef0: 0000000000000000 0000000000000000 r14 r15 ffffae5889f67f00: 0000000000000000 0000000000000000 ffffae5889f67f10: ffffffff96772586 #4 [ffffae5889f67f10] __x64_sys_nanosleep at ffffffff96772586 Looking for it with the timer cmd it’s not there. We see that there are now 8 per-CPU clocks for each CPU in RHEL 8 the timer -r cmd only displays 3 of them. Using a modified version of crash to see all 8 of the clocks and we can see the timer on CPU 11 clock 4. crash-vl> timer -r | grep ffffae5889f67e78 48048203002814 48048203052814 -65843947186 ffffae5889f67e78 ffffffff96770fa0 <hrtimer_wakeup> crash-vl> And we see that it is on CPU 11 clock 4. crash-vl> timer -r -C 11 CPU: 11 HRTIMER_CPU_BASE: ffff9a775f95ee00 CLOCK: 0 HRTIMER_CLOCK_BASE: ffff9a775f95ee80 [ktime_get] (empty) CLOCK: 1 HRTIMER_CLOCK_BASE: ffff9a775f95ef00 [ktime_get_real] (empty) CLOCK: 2 HRTIMER_CLOCK_BASE: ffff9a775f95ef80 [ktime_get_boottime] (empty) CLOCK: 3 HRTIMER_CLOCK_BASE: ffff9a775f95f000 [ktime_get_clocktai] (empty) CLOCK: 4 HRTIMER_CLOCK_BASE: ffff9a775f95f080 [ktime_get] CURRENT 48114047000000 SOFTEXPIRES EXPIRES TTE HRTIMER FUNCTION 48048203002814 48048203052814 -65843947186 ffffae5889f67e78 ffffffff96770fa0 <hrtimer_wakeup> <------------ CLOCK: 5 HRTIMER_CLOCK_BASE: ffff9a775f95f100 [ktime_get_real] CURRENT 1640021151290259551 SOFTEXPIRES EXPIRES TTE HRTIMER FUNCTION 1640021064148260307 1640021064148310307 -87141949244 ffffae588a9c7d30 ffffffff96770fa0 <hrtimer_wakeup> CLOCK: 6 HRTIMER_CLOCK_BASE: ffff9a775f95f180 [ktime_get_boottime] (empty) CLOCK: 7 HRTIMER_CLOCK_BASE: ffff9a775f95f200 [ktime_get_clocktai] (empty) crash-vl> This task is running on isolcpu 11 and is not realtime. [ 0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-305.19.1.rt7.91.el8_4.x86_64 root=/dev/mapper/rootvg-root ro rd.lvm.lv=rootvg/root rd.lvm.lv=rootvg/swap rd.driver.blacklist=sfc rd.driver.blacklist=qla2xxx rd.driver.blacklist=lpfc crashkernel=auto biosdevname=0 net.ifnames=1 console=tty0 console=ttyS1,115200 consoleblank=0 noibrs noibpb nopti tsc=reliable ipmi_si.kipmid_max_busy_us=100 processor.max_cstate=0 intel_idle.max_cstate=0 isolcpus=1,3,5,7,9,11,13,15-35 <---------------- nohz_full=1,3,5,7,9,11,13,15-35 rcu_nocbs=1,3,5,7,9,11,13,15-35 rd.earlykdump intel_pstate=disable crash-vl> task -R policy -R sched_class 124891 PID: 124891 TASK: ffff9a6df5a49f40 CPU: 11 COMMAND: " asyncRun" policy = 0x0, <--------- sched_class = 0xffffffff97423700, crash-vl> sym 0xffffffff97423700, ffffffff97423700 (R) fair_sched_class crash-vl> From the code to me it looks like this task should have been moved to a house keeping cpus. crash-vl> p housekeeping_mask housekeeping_mask = $7 = {{ bits = {0x5555, crash-vl> eval -b 0x5555 | grep bits bits set: 14 12 10 8 6 4 2 0 <- one of these cpus crash-vl> 190 enum hrtimer_base_type { 191 HRTIMER_BASE_MONOTONIC, 192 HRTIMER_BASE_REALTIME, 193 HRTIMER_BASE_BOOTTIME, 194 HRTIMER_BASE_TAI, 195 HRTIMER_BASE_MONOTONIC_SOFT, <- clock 4 196 HRTIMER_BASE_REALTIME_SOFT, 197 HRTIMER_BASE_BOOTTIME_SOFT, 198 HRTIMER_BASE_TAI_SOFT, 199 HRTIMER_MAX_CLOCK_BASES, 200 }; In the dump we can see an example of that for pid 138409. crash> bt 138409 PID: 138409 TASK: ffff9a6e6d43be80 CPU: 9 COMMAND: "IFPGA_refill" #0 [ffffae588d50fd70] __schedule at ffffffff96f5711d #1 [ffffae588d50fe10] schedule at ffffffff96f576a6 #2 [ffffae588d50fe20] do_nanosleep at ffffffff96f5a4d6 #3 [ffffae588d50fe68] hrtimer_nanosleep at ffffffff96772212 #4 [ffffae588d50ff10] __x64_sys_nanosleep at ffffffff96772586 #5 [ffffae588d50ff38] do_syscall_64 at ffffffff96602bb7 #6 [ffffae588d50ff50] entry_SYSCALL_64_after_hwframe at ffffffff970000ad RIP: 00007f531965cd98 RSP: 00007f52822eaf50 RFLAGS: 00000293 RAX: ffffffffffffffda RBX: 00007f52822eaf80 RCX: 00007f531965cd98 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f52822eaf80 RBP: 0000000000000000 R8: 0000000000000012 R9: 0000000000000011 R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000 R13: 00007f52822eb000 R14: 000000000918a2f8 R15: 00007f52822eb380 ORIG_RAX: 0000000000000023 CS: 0033 SS: 002b crash> crash> hrtimer_sleeper.task,mode ffffae588d50fe78 task = 0xffff9a6e6d43be80 mode = HRTIMER_MODE_REL crash> crash-vl> timer -r -C 0 | grep -e CLOCK: -e ffffae588d50fe78 CLOCK: 0 HRTIMER_CLOCK_BASE: ffff9a6f5fc1ee80 [ktime_get] CLOCK: 1 HRTIMER_CLOCK_BASE: ffff9a6f5fc1ef00 [ktime_get_real] CLOCK: 2 HRTIMER_CLOCK_BASE: ffff9a6f5fc1ef80 [ktime_get_boottime] CLOCK: 3 HRTIMER_CLOCK_BASE: ffff9a6f5fc1f000 [ktime_get_clocktai] CLOCK: 4 HRTIMER_CLOCK_BASE: ffff9a6f5fc1f080 [ktime_get] 48114131624069 48114131674069 84674069 ffffae588d50fe78 ffffffff96770fa0 <hrtimer_wakeup> <--- timer for pid 138409 on CPU 9 CLOCK: 5 HRTIMER_CLOCK_BASE: ffff9a6f5fc1f100 [ktime_get_real] CLOCK: 6 HRTIMER_CLOCK_BASE: ffff9a6f5fc1f180 [ktime_get_boottime] CLOCK: 7 HRTIMER_CLOCK_BASE: ffff9a6f5fc1f200 [ktime_get_clocktai] crash-vl> Looking at the code tick_program_event() it appears that CPU 11 has not been programmed for a timer interrupt. 27 int tick_program_event(ktime_t expires, int force) 28 { 29 struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev); 30 31 if (unlikely(expires == KTIME_MAX)) { 32 /* 33 * We don't need the clock event device any more, stop it. 34 */ 35 clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT_STOPPED); <--------- 36 dev->next_event = KTIME_MAX; <------- 37 return 0; <--------------------------------------------------------- 38 } 39 40 if (unlikely(clockevent_state_oneshot_stopped(dev))) { 41 /* 42 * We need the clock event again, configure it in ONESHOT mode 43 * before using it. 44 */ 45 clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT); 46 } 47 48 return clockevents_program_event(dev, expires, force); 49 } crash-vl> p lapic_events:11 | grep -e state_use_accessors -e next_event | paste - - - | pr -Tn -N 0 | column -t 0 set_next_event = 0xffffffff96651b60, next_event = 0x7fffffffffffffff, state_use_accessors = CLOCK_EVT_STATE_ONESHOT_STOPPED, crash-vl> crash-vl> p hrtimer_bases:11 | grep -e softirq_activated -e softirq_expires_next -e softirq_next_timer -e expires_next | paste - - - - | pr -Tn -N 0 | column -t 0 softirq_activated = 0x0, expires_next = 0x7fffffffffffffff, softirq_expires_next = 0x2ba5dc165ee1, softirq_next_timer = 0x0, crash-vl> So no apic timer interrupts are schedule for CPU 11 and this timer won’t be server until something triggers hrtimer_run_softirq() to run from softirq_vec[HRTIMER_SOFTIRQ] such as getting another timer to run on CPU 11. The occurrences of HRTIMER_SOFTIRQ in the source: 0 interrupt.h <global> 525 HRTIMER_SOFTIRQ, 1 hrtimer.c hrtimer_interrupt 1653 raise_softirq_irqoff(HRTIMER_SOFTIRQ); 2 hrtimer.c hrtimer_run_queues 1767 raise_softirq_irqoff(HRTIMER_SOFTIRQ); 3 hrtimer.c hrtimers_init 2146 open_softirq(HRTIMER_SOFTIRQ, hrtimer_run_softirq); 1626 void hrtimer_interrupt(struct clock_event_device *dev) 1627 { . . . 1650 if (!ktime_before(now, cpu_base->softirq_expires_next)) { 1651 cpu_base->softirq_expires_next = KTIME_MAX; 1652 cpu_base->softirq_activated = 1; 1653 raise_softirq_irqoff(HRTIMER_SOFTIRQ); <------- 1654 } apic_timer_interrupt()->smp_apic_timer_interrupt()->hrtimer_interrupt(). The customer has noted 2 possible workarounds. 1: Doing a taskset -c 11 bash will trigger the timer. 2: Running the app in realtime (RR 1) will avoid the issue.