2037769 – Softirq hrtimers are being placed on the per-CPU softirq clocks on isolcpu’s.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 2037769 - Softirq hrtimers are being placed on the per-CPU softirq clocks on isolcpu’s.

Summary: Softirq hrtimers are being placed on the per-CPU softirq clocks on isolcpu’s.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Deadline:	2022-06-13
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	kernel-rt
Sub Component:
Version:	8.4
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Clark Williams
QA Contact:	Qiao Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:	2071776
Blocks:	2045902
TreeView+	depends on / blocked

Reported:	2022-01-06 14:24 UTC by vlovejoy
Modified:	2022-11-08 09:18 UTC (History)
CC List:	11 users (show)
Fixed In Version:	kernel-rt-4.18.0-391.rt7.176.el8
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2071776 (view as bug list)
Environment:
Last Closed:	2022-11-08 09:10:42 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-106966	0	None	None	None	2022-01-06 14:31:38 UTC
Red Hat Product Errata	RHSA-2022:7444	0	None	None	None	2022-11-08 09:12:16 UTC

Description vlovejoy 2022-01-06 14:24:08 UTC

Description of problem:


Version-ReleaseThe soft timers for non-RT task are being places on the per-CPU soft clocks on isolcpu’s instead of being move to the housekeeping CPU’s and the apcic timer for them is not being reprogrammed. So these timers can be late to expire or depending on what is happening on the cpu not expire at all.
 number of selected component (if applicable):


How reproducible:
Happens often

Steps to Reproduce:
1.NA
2.
3.

Actual results:


Expected results:


Additional info:

We are seeing this in the RT kernel starting at version 
4.18.0-305.14.1.rt7.86.el8_4.x86_64.

The version kernel-rt-4.18.0-305.12.1.rt7.84.el8_4 does not have this problem. 

There are 31 patches  between the two versions with these being the ones of concern.

$ git log --pretty=oneline  kernel-rt-4.18.0-305.12.1.rt7.84.el8_4..kernel-rt-4.18.0-305.14.1.rt7.86.el8_4   | grep -e tick  -e time
a0fed4acf70e2c5c2c07339819c65b0a528e2549 Merge: time/hrtimer: Add PINNED_HARD mode for realtime hrtimers
59ca0686bd4d16011aeb26661ae158d6591932a3 Merge: tick/nohz: Avoid latency spike on nohz full cpus
b2d3e52448290572b7a09310d4026e039203fcbf time/hrtimer: Embed hrtimer mode into hrtimer_sleeper
cab277b7d32a3403801a3718367a0ce86faf9322 time/hrtimer: Add PINNED_HARD mode for realtime hrtimers
4472d079cc1427a837e3f3bdb7bd958f3e60e12a tick/nohz: Kick only _queued_ task whose tick dependency is updated
693443d4c7b4664908b86927fae3227f4360248c tick/nohz: Change signal tick dependency to wake up CPUs of member tasks
52fdcac7277b837ca9943542a0903d497fe56c73 tick/nohz: Only wake up a single target cpu when kicking a task
5ec28bffbb49176c2176b90934f589a230973d20 tick/nohz: Narrow down noise while setting current task's tick dependency
$



crash> sys
      KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/4.18.0-305.19.1.rt7.91.el8_4.x86_64/vmlinux  [TAINTED]
    DUMPFILE: /cores/retrace/tasks/510256984/crash/vmcore  [PARTIAL DUMP]
        CPUS: 36
        DATE: Mon Dec 20 12:25:51 EST 2021
      UPTIME: 13:21:54
LOAD AVERAGE: 10.70, 11.13, 11.27
       TASKS: 1083
    NODENAME: 
     RELEASE: 4.18.0-305.19.1.rt7.91.el8_4.x86_64
     VERSION: #1 SMP PREEMPT_RT Wed Sep 8 13:32:17 EDT 2021
     MACHINE: x86_64  (3000 Mhz)
      MEMORY: 63.6 GB
       PANIC: "sysrq: SysRq : Trigger a crash"
crash> sys -i
        DMI_BIOS_VENDOR: Dell Inc.
       DMI_BIOS_VERSION: 2.10.0
          DMI_BIOS_DATE: 11/12/2020
         DMI_SYS_VENDOR: Dell Inc.
       DMI_PRODUCT_NAME: PowerEdge R740
    DMI_PRODUCT_VERSION: 
     DMI_PRODUCT_SERIAL: 
       DMI_PRODUCT_UUID: 
        DMI_PRODUCT_SKU: SKU=0715;ModelName=PowerEdge R740
     DMI_PRODUCT_FAMILY: PowerEdge
       DMI_BOARD_VENDOR: Dell Inc.
         DMI_BOARD_NAME: 0JMK61
      DMI_BOARD_VERSION: A00
       DMI_BOARD_SERIAL: 
     DMI_CHASSIS_VENDOR: Dell Inc.
       DMI_CHASSIS_TYPE: 23
    DMI_CHASSIS_VERSION: 
     DMI_CHASSIS_SERIAL: 
  DMI_CHASSIS_ASSET_TAG: 
crash> 



The application is hung and the customer has triggered the dump.

They have identified that they are waiting for pid 124891 to run for the app to continue.




crash> bt 124891
PID: 124891  TASK: ffff9a6df5a49f40  CPU: 11  COMMAND: " asyncRun"
 #0 [ffffae5889f67d70] __schedule at ffffffff96f5711d
 #1 [ffffae5889f67e10] schedule at ffffffff96f576a6
 #2 [ffffae5889f67e20] do_nanosleep at ffffffff96f5a4d6
 #3 [ffffae5889f67e68] hrtimer_nanosleep at ffffffff96772212
 #4 [ffffae5889f67f10] __x64_sys_nanosleep at ffffffff96772586
 #5 [ffffae5889f67f38] do_syscall_64 at ffffffff96602bb7
 #6 [ffffae5889f67f50] entry_SYSCALL_64_after_hwframe at ffffffff970000ad
    RIP: 00007f531965cd98  RSP: 00007f52f3a4fe00  RFLAGS: 00000293
    RAX: ffffffffffffffda  RBX: 00007f52f3a4fe30  RCX: 00007f531965cd98
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 00007f52f3a4fe30
    RBP: 0000000000000000   R8: 00007f52ec006910   R9: 0000000000000007
    R10: 0000000000000010  R11: 0000000000000293  R12: 00007f52f3a4fe90
    R13: 00007f520801a640  R14: 00007f52f3a4fea0  R15: 0000000000000025
    ORIG_RAX: 0000000000000023  CS: 0033  SS: 002b
crash> 

We see that it is in a sleep call and has been there for 65 seconds but it should be 1 millisecond.

crash> ps -m 124891
[0 00:01:05.659] [IN]  PID: 124891  TASK: ffff9a6df5a49f40  CPU: 11  COMMAND: " asyncRun"
crash> 



The timeout for this nanosleep is 1ms (1000000 ns)

crash> timespec64 ffffae5889f67f18 -d
struct timespec64 {
  tv_sec = 0, 
  tv_nsec = 1000000  <-----
}

Looking at the hrtimer struct in the stack.



#3 [ffffae5889f67e68] hrtimer_nanosleep at ffffffff96772212

crash> hrtimer_sleeper ffffae5889f67e78 <- from r14
struct hrtimer_sleeper {
  timer = {
    node = {
      node = {
        __rb_parent_color = 0x1, 
        rb_right = 0x0, 
        rb_left = 0x0
      }, 
      expires = 0x2bb31813370e <------ 48048203052814  
    }, 
    _softexpires = 0x2bb3181273be,  <- 48048203002814   
    function = 0xffffffff96770fa0, <- hrtimer_wakeup
    base = 0xffff9a775f95f080,   <- clock index 4
    state = 0x1, 
    is_rel = 0x0, 
    is_soft = 0x1, 
    is_hard = 0x0, 
    hrtimer_size_rh = 0x0, 
    _rh = 0x0
  }, 
  task = 0xffff9a6df5a49f40, 
  mode = HRTIMER_MODE_REL
}
crash> 



    ffffae5889f67e70: ffff9a76693a9a48 0000000000000001 
                                       ^-- start of hrtimer_sleeper struct
    ffffae5889f67e80: 0000000000000000 0000000000000000 
    ffffae5889f67e90: 00002bb31813370e 00002bb3181273be 
    ffffae5889f67ea0: ffffffff96770fa0 ffff9a775f95f080 
    ffffae5889f67eb0: 0000000000010001 0000000000000000 
    ffffae5889f67ec0: 0000000000000000 ffff9a6df5a49f40 
                                       +68=can
    ffffae5889f67ed0: ffffffff00000001 fa62ab6d46cc8e00 
                      rbx              rbp
    ffffae5889f67ee0: 0000000000000000 ffffae5889f67f58 
                      r12              r13
    ffffae5889f67ef0: 0000000000000000 0000000000000000
                      r14              r15 
    ffffae5889f67f00: 0000000000000000 0000000000000000 
    ffffae5889f67f10: ffffffff96772586 
 #4 [ffffae5889f67f10] __x64_sys_nanosleep at ffffffff96772586

Looking for it with the timer cmd it’s not there.

We see that there are now 8 per-CPU clocks for each CPU in RHEL 8 the timer -r cmd only displays 3 of them. 


Using a modified version of crash to see all 8 of the clocks  and we can see the timer  on CPU 11 clock 4.

crash-vl> timer -r | grep ffffae5889f67e78
  48048203002814  48048203052814  -65843947186  ffffae5889f67e78  ffffffff96770fa0  <hrtimer_wakeup>
crash-vl> 

And we see that it is on CPU 11 clock 4.
crash-vl> timer -r -C 11
CPU: 11  HRTIMER_CPU_BASE: ffff9a775f95ee00
  CLOCK: 0  HRTIMER_CLOCK_BASE: ffff9a775f95ee80  [ktime_get]
  (empty)

  CLOCK: 1  HRTIMER_CLOCK_BASE: ffff9a775f95ef00  [ktime_get_real]
  (empty)

  CLOCK: 2  HRTIMER_CLOCK_BASE: ffff9a775f95ef80  [ktime_get_boottime]
  (empty)

  CLOCK: 3  HRTIMER_CLOCK_BASE: ffff9a775f95f000  [ktime_get_clocktai]
  (empty)

  CLOCK: 4  HRTIMER_CLOCK_BASE: ffff9a775f95f080  [ktime_get]
      CURRENT   
  48114047000000
    SOFTEXPIRES       EXPIRES          TTE          HRTIMER           FUNCTION    
  48048203002814  48048203052814  -65843947186  ffffae5889f67e78  ffffffff96770fa0  <hrtimer_wakeup>  <------------

  CLOCK: 5  HRTIMER_CLOCK_BASE: ffff9a775f95f100  [ktime_get_real]
        CURRENT      
  1640021151290259551
      SOFTEXPIRES            EXPIRES             TTE          HRTIMER           FUNCTION    
  1640021064148260307  1640021064148310307  -87141949244  ffffae588a9c7d30  ffffffff96770fa0  <hrtimer_wakeup>

  CLOCK: 6  HRTIMER_CLOCK_BASE: ffff9a775f95f180  [ktime_get_boottime]
  (empty)

  CLOCK: 7  HRTIMER_CLOCK_BASE: ffff9a775f95f200  [ktime_get_clocktai]
  (empty)
crash-vl> 



This task is  running on isolcpu  11 and is not realtime. 

[    0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-305.19.1.rt7.91.el8_4.x86_64 root=/dev/mapper/rootvg-root ro rd.lvm.lv=rootvg/root rd.lvm.lv=rootvg/swap rd.driver.blacklist=sfc rd.driver.blacklist=qla2xxx rd.driver.blacklist=lpfc crashkernel=auto biosdevname=0 net.ifnames=1 console=tty0 console=ttyS1,115200 consoleblank=0 noibrs noibpb nopti tsc=reliable ipmi_si.kipmid_max_busy_us=100 processor.max_cstate=0 intel_idle.max_cstate=0

 isolcpus=1,3,5,7,9,11,13,15-35 <----------------


nohz_full=1,3,5,7,9,11,13,15-35 rcu_nocbs=1,3,5,7,9,11,13,15-35 rd.earlykdump intel_pstate=disable

crash-vl> task -R policy -R sched_class 124891
PID: 124891  TASK: ffff9a6df5a49f40  CPU: 11  COMMAND: " asyncRun"
  policy = 0x0,    <---------
  sched_class = 0xffffffff97423700, 

crash-vl> sym 0xffffffff97423700,
ffffffff97423700 (R) fair_sched_class
crash-vl> 


From the code to me it looks like this task should have been moved to a house keeping cpus.

crash-vl> p housekeeping_mask
housekeeping_mask = $7 = 
 {{
    bits = {0x5555, 


crash-vl> eval -b 0x5555 | grep bits
   bits set: 14 12 10 8 6 4 2 0   <- one of these cpus
crash-vl> 


   190 enum  hrtimer_base_type {
    191         HRTIMER_BASE_MONOTONIC,
    192         HRTIMER_BASE_REALTIME,
    193         HRTIMER_BASE_BOOTTIME,
    194         HRTIMER_BASE_TAI,
    195         HRTIMER_BASE_MONOTONIC_SOFT,   <- clock 4
    196         HRTIMER_BASE_REALTIME_SOFT,
    197         HRTIMER_BASE_BOOTTIME_SOFT,
    198         HRTIMER_BASE_TAI_SOFT,
    199         HRTIMER_MAX_CLOCK_BASES,
    200 };

In the dump we can see an example of that for pid 138409.

crash> bt 138409
PID: 138409  TASK: ffff9a6e6d43be80  CPU: 9   COMMAND: "IFPGA_refill"
 #0 [ffffae588d50fd70] __schedule at ffffffff96f5711d
 #1 [ffffae588d50fe10] schedule at ffffffff96f576a6
 #2 [ffffae588d50fe20] do_nanosleep at ffffffff96f5a4d6
 #3 [ffffae588d50fe68] hrtimer_nanosleep at ffffffff96772212
 #4 [ffffae588d50ff10] __x64_sys_nanosleep at ffffffff96772586
 #5 [ffffae588d50ff38] do_syscall_64 at ffffffff96602bb7
 #6 [ffffae588d50ff50] entry_SYSCALL_64_after_hwframe at ffffffff970000ad
    RIP: 00007f531965cd98  RSP: 00007f52822eaf50  RFLAGS: 00000293
    RAX: ffffffffffffffda  RBX: 00007f52822eaf80  RCX: 00007f531965cd98
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 00007f52822eaf80
    RBP: 0000000000000000   R8: 0000000000000012   R9: 0000000000000011
    R10: 0000000000000000  R11: 0000000000000293  R12: 0000000000000000
    R13: 00007f52822eb000  R14: 000000000918a2f8  R15: 00007f52822eb380
    ORIG_RAX: 0000000000000023  CS: 0033  SS: 002b
crash> 

crash> hrtimer_sleeper.task,mode ffffae588d50fe78
  task = 0xffff9a6e6d43be80
  mode = HRTIMER_MODE_REL
crash> 


crash-vl> timer -r -C 0 | grep -e CLOCK: -e ffffae588d50fe78
  CLOCK: 0  HRTIMER_CLOCK_BASE: ffff9a6f5fc1ee80  [ktime_get]
  CLOCK: 1  HRTIMER_CLOCK_BASE: ffff9a6f5fc1ef00  [ktime_get_real]
  CLOCK: 2  HRTIMER_CLOCK_BASE: ffff9a6f5fc1ef80  [ktime_get_boottime]
  CLOCK: 3  HRTIMER_CLOCK_BASE: ffff9a6f5fc1f000  [ktime_get_clocktai]
  CLOCK: 4  HRTIMER_CLOCK_BASE: ffff9a6f5fc1f080  [ktime_get]

  48114131624069  48114131674069    84674069  ffffae588d50fe78  ffffffff96770fa0  <hrtimer_wakeup>   <--- timer for pid 138409 on CPU 9

  CLOCK: 5  HRTIMER_CLOCK_BASE: ffff9a6f5fc1f100  [ktime_get_real]
  CLOCK: 6  HRTIMER_CLOCK_BASE: ffff9a6f5fc1f180  [ktime_get_boottime]
  CLOCK: 7  HRTIMER_CLOCK_BASE: ffff9a6f5fc1f200  [ktime_get_clocktai]
crash-vl> 


Looking at the code tick_program_event() it appears that CPU 11 has not been programmed for a timer interrupt.

     27 int tick_program_event(ktime_t expires, int force)
     28 {
     29         struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
     30 
     31         if (unlikely(expires == KTIME_MAX)) {
     32                 /*
     33                  * We don't need the clock event device any more, stop it.
     34                  */
     35                 clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT_STOPPED);   <---------
     36                 dev->next_event = KTIME_MAX;   <-------
     37                 return 0;   <---------------------------------------------------------
     38         }
     39 
     40         if (unlikely(clockevent_state_oneshot_stopped(dev))) {
     41                 /*
     42                  * We need the clock event again, configure it in ONESHOT mode
     43                  * before using it.
     44                  */
     45                 clockevents_switch_state(dev, CLOCK_EVT_STATE_ONESHOT);
     46         }
     47 
     48         return clockevents_program_event(dev, expires, force);
     49 }

crash-vl> p lapic_events:11 | grep -e state_use_accessors -e next_event | paste - - - | pr -Tn -N 0 | column -t 
0  set_next_event  =  0xffffffff96651b60,  next_event  =  0x7fffffffffffffff,  state_use_accessors  =  CLOCK_EVT_STATE_ONESHOT_STOPPED,
crash-vl> 

crash-vl> p hrtimer_bases:11 | grep -e softirq_activated -e softirq_expires_next -e softirq_next_timer -e expires_next | paste - - - - | pr -Tn -N 0 | column -t
0  softirq_activated  =  0x0,  expires_next  =  0x7fffffffffffffff,  softirq_expires_next  =  0x2ba5dc165ee1,  softirq_next_timer  =  0x0,
crash-vl> 


So no apic timer interrupts are schedule for CPU 11 and this timer won’t be server until something triggers hrtimer_run_softirq() to run from softirq_vec[HRTIMER_SOFTIRQ] such as getting another timer to run on CPU 11.



The occurrences of HRTIMER_SOFTIRQ in the source:

0 interrupt.h <global>            525 HRTIMER_SOFTIRQ,
1 hrtimer.c   hrtimer_interrupt  1653 raise_softirq_irqoff(HRTIMER_SOFTIRQ);
2 hrtimer.c   hrtimer_run_queues 1767 raise_softirq_irqoff(HRTIMER_SOFTIRQ);
3 hrtimer.c   hrtimers_init	 2146 open_softirq(HRTIMER_SOFTIRQ, hrtimer_run_softirq);


   1626 void hrtimer_interrupt(struct clock_event_device *dev)
   1627 {
. . .


   1650         if (!ktime_before(now, cpu_base->softirq_expires_next)) {
   1651                 cpu_base->softirq_expires_next = KTIME_MAX;
   1652                 cpu_base->softirq_activated = 1;

   1653                 raise_softirq_irqoff(HRTIMER_SOFTIRQ);   <-------
   1654         }




apic_timer_interrupt()->smp_apic_timer_interrupt()->hrtimer_interrupt().


The customer has noted  2 possible workarounds. 
1: Doing a taskset -c 11 bash will trigger the timer.
2: Running the app in realtime  (RR 1) will avoid the issue.

Comment 3 vlovejoy 2022-01-06 20:56:49 UTC

As I see the path when setting the timer it goes.

hrtimer_nanosleep()->
  do_nanosleep()->
    hrtimer_sleeper_start_expires()->
      hrtimer_start_expires()->
        hrtimer_start_range_ns()->
           __hrtimer_start_range_ns()


  1095 static int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
   1096                                     u64 delta_ns, const enum hrtimer_mode mode,
   1097                                     struct hrtimer_clock_base *base)
   1098 {
   . . .

   1111         /* Switch the timer base, if necessary: */

   1112         new_base = switch_hrtimer_base(timer, base, mode & HRTIMER_MODE_PINNED);  <--- base is decided here

   1113 
   1114         return enqueue_hrtimer(timer, new_base, mode);  <---



   235 switch_hrtimer_base(struct hrtimer *timer, struct hrtimer_clock_base *base,
    236                     int pinned)
    237 {
    238         struct hrtimer_cpu_base *new_cpu_base, *this_cpu_base;
    239         struct hrtimer_clock_base *new_base;
    240         int basenum = base->index;
    241 
    242         this_cpu_base = this_cpu_ptr(&hrtimer_bases);

             
    243         new_cpu_base = get_target_base(this_cpu_base, pinned);   <----


 If sysctl  kernel.timer_migration=0 the  get_target_base() returns this_cpu_base to new_cpu_base for both RT task and non RT task.



   211 static inline
    212 struct hrtimer_cpu_base *get_target_base(struct hrtimer_cpu_base *base,
    213                                          int pinned)
    214 {
    215 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
    216         if (static_branch_likely(&timers_migration_enabled) && !pinned)   <--
    217                 return &per_cpu(hrtimer_bases, get_nohz_timer_target());   <-- call if not pinned/ not rt
    218 #endif
    219         return base;



If it was off then we would see the hrtimer placed on the isolcpu timer list just like we see in the dumps. 

But kernel.timer_migration is on in the dumps. Not sure if it applies.

Comment 62 errata-xmlrpc 2022-11-08 09:10:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: kernel-rt security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7444

Note You need to log in before you can comment on or make changes to this bug.