1798543 – Infinite looping when trying to acquire eventpoll->mtx during eventpoll_release_file

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1798543 - Infinite looping when trying to acquire eventpoll->mtx during eventpoll_release_file

Summary: Infinite looping when trying to acquire eventpoll->mtx during eventpoll_relea...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	kernel-rt
Sub Component:
Version:	7.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Luis Claudio R. Goncalves
QA Contact:	Qiao Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1781949 1814310
TreeView+	depends on / blocked

Reported:	2020-02-05 15:01 UTC by Sterling Alexander
Modified:	2023-12-15 17:16 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1814310 (view as bug list)
Environment:
Last Closed:	2023-01-06 19:33:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHELPLAN-144078	0	None	None	None	2023-01-06 01:03:04 UTC

Description Sterling Alexander 2020-02-05 15:01:41 UTC

Description of problem:

During teardown of an eventpoll file a CPU loops infinitely while trying to acquire a mutex.  This mutex should be an eventpoll->mtx, but the address passed in is part of the socket_inode_cache...suggesting some sort of race or use-after-free.  

[INVESTIGATION]

[I:1]  This is the core that was analyzed:

  crash> sys
        KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/3.10.0-957.12.2.rt56.929.el7.x86_64/vmlinux
      DUMPFILE: /cores/retrace/tasks/806338892/crash/vmcore  [PARTIAL DUMP]
          CPUS: 80
          DATE: Thu Jan 30 09:05:34 2020
        UPTIME: 105 days, 02:10:46
  LOAD AVERAGE: 106.69, 106.04, 105.80
         TASKS: 3809
      NODENAME: xxxxxxxxxxxxxxxxxxxx
       RELEASE: 3.10.0-957.12.2.rt56.929.el7.x86_64
       VERSION: #1 SMP PREEMPT RT Fri Apr 26 15:08:12 UTC 2019
       MACHINE: x86_64  (1600 Mhz)
        MEMORY: 382.7 GB
         PANIC: "SysRq : Trigger a crash"

We do note a few modules loaded:

  crash> mod -t
  NAME     TAINTS
  i40e     OE
  overlay  T


[I:2]  Note the kernel ring buffer has wrapped with RCU stall messages for CPU 40 (see [I:3] and [I:6]), so we will pull the saved command line from memory:

  crash> rd -a ffff92ac3ff8a180
  ffff92ac3ff8a180:  BOOT_IMAGE=/vmlinuz-3.10.0-957.12.2.rt56.929.el7.x86_64 root
  ffff92ac3ff8a1bc:  =/dev/mapper/UHN6suzku1cvcm01_vg_root-lv_root ro audit=1 cra
  ffff92ac3ff8a1f8:  shkernel=auto rd.lvm.lv=UHN6suzku1cvcm01_vg_root/lv_root rd.
  ffff92ac3ff8a234:  lvm.lv=UHN6suzku1cvcm01_vg_root/lv_swap nomodeset console=tt
  ffff92ac3ff8a270:  y0 elevator=cfq hugepagesz=1G hugepages=263 pci-stub.ids=113
  ffff92ac3ff8a2ac:  7:0071,8086:10ed,8086:154c intel_idle.max_cstate=1 console=t
  ffff92ac3ff8a2e8:  tyS1,115200n8 console=tty0 nomodeset iommu=pt intel_iommu=on
  ffff92ac3ff8a324:   tsc=reliable idle=poll intel_pstate=disable rcu_nocb_poll=1
  ffff92ac3ff8a360:   clocksource=tsc pcie_aspm.policy=performance skew_tick=1 is
  ffff92ac3ff8a39c:  olcpus=1-19,21-39,41-59,61-79 intel_pstate=disable nosoftloc
  ffff92ac3ff8a3d8:  kup nohz=on nohz_full=1-19,21-39,41-59,61-79 rcu_nocbs=1-19,
  ffff92ac3ff8a414:  21-39,41-59,61-79

Looking at the boot options we note that nosoftlock has been set (no watchdog threads) and isolcpu's has been set, leaving only CPUs 0, 20, 40, and 60 available for scheduling.


[I:3]  Looking at those CPUs runqueues, we note significant backup on CPU 40:

  crash> runq -c 0
  CPU 0 RUNQUEUE: ffff927b7da1bfc0
    CURRENT: PID: 4      TASK: ffff924e181d10c0  COMMAND: "ktimersoftd/0"
    RT PRIO_ARRAY: ffff927b7da1c158
       [no tasks queued]
    CFS RB_ROOT: ffff927b7da1c060
       [120] PID: 29     TASK: ffff924e1749c300  COMMAND: "rcuos/1"
       [120] PID: 107    TASK: ffff924e176fd3c0  COMMAND: "rcuop/9"
       [120] PID: 308    TASK: ffff924e1738c300  COMMAND: "rcuos/29"
       [120] PID: 596    TASK: ffff924e16af6480  COMMAND: "rcuob/58"
       [120] PID: 527    TASK: ffff924e168e8000  COMMAND: "rcuos/51"
       [120] PID: 348    TASK: ffff924e16c453c0  COMMAND: "rcuob/33"
       [120] PID: 15     TASK: ffff924e17ee53c0  COMMAND: "rcu_sched"

  crash> runq -c 20
  CPU 20 RUNQUEUE: ffff92ab7be1bfc0
    CURRENT: PID: 357044  TASK: ffff927e89af10c0  COMMAND: "bash" 
    RT PRIO_ARRAY: ffff92ab7be1c158
       [no tasks queued]
    CFS RB_ROOT: ffff92ab7be1c060
       [120] PID: 90087  TASK: ffff92ab6a46a180  COMMAND: "nova-conductor"

  crash> runq -c 40
  CPU 40 RUNQUEUE: ffff927b7df1bfc0
    CURRENT: PID: 72756  TASK: ffff92ab5582e480  COMMAND: "haproxy"
    RT PRIO_ARRAY: ffff927b7df1c158
       [ 49] PID: 26915  TASK: ffff92ab707710c0  COMMAND: "irq/418-i40e-sa"
       [ 49] PID: 26901  TASK: ffff92ab706f90c0  COMMAND: "irq/404-i40e-sa"
       [ 49] PID: 26914  TASK: ffff92ab70770000  COMMAND: "irq/417-i40e-sa"
       [ 49] PID: 26786  TASK: ffff92ab72ea6480  COMMAND: "irq/321-i40e-sa"
       [ 49] PID: 26910  TASK: ffff92ab70743240  COMMAND: "irq/413-i40e-sa"
       [ 49] PID: 26908  TASK: ffff92ab707410c0  COMMAND: "irq/411-i40e-sa"
       [ 49] PID: 26909  TASK: ffff92ab70742180  COMMAND: "irq/412-i40e-sa"
       [ 49] PID: 26949  TASK: ffff927b7269a180  COMMAND: "irq/452-i40e-sa"
       [ 49] PID: 26781  TASK: ffff92ab7059e480  COMMAND: "irq/316-i40e-sa"
       [ 49] PID: 26779  TASK: ffff92ab7059b240  COMMAND: "irq/314-i40e-sa"
    CFS RB_ROOT: ffff927b7df1c060
       [120] PID: 54102  TASK: ffff929074038000  COMMAND: "td-agent"
       [120] PID: 92186  TASK: ffff92879278d3c0  COMMAND: "httpd"
       [120] PID: 83155  TASK: ffff9288d8376480  COMMAND: "handler482"
       [120] PID: 83193  TASK: ffff9288867bc300  COMMAND: "handler520"
       [120] PID: 83230  TASK: ffff9288b87db240  COMMAND: "revalidator554"
       [120] PID: 83228  TASK: ffff9288b87d90c0  COMMAND: "revalidator552"
       [120] PID: 83223  TASK: ffff9288b87c3240  COMMAND: "revalidator547"
       [120] PID: 83229  TASK: ffff9288b87da180  COMMAND: "revalidator553"
       [120] PID: 83226  TASK: ffff9288b87c6480  COMMAND: "revalidator550"
       [120] PID: 83219  TASK: ffff9288b8706480  COMMAND: "revalidator543"
       [120] PID: 83233  TASK: ffff9288b87de480  COMMAND: "revalidator557"
       [120] PID: 83236  TASK: ffff9288a02ea180  COMMAND: "revalidator560"
       [120] PID: 83235  TASK: ffff9288a02e90c0  COMMAND: "revalidator559"
       [120] PID: 83227  TASK: ffff9288b87d8000  COMMAND: "revalidator551"
       [120] PID: 83220  TASK: ffff9288b87c0000  COMMAND: "revalidator544"
       [120] PID: 83222  TASK: ffff9288b87c2180  COMMAND: "revalidator546"
       [120] PID: 83237  TASK: ffff9288a02eb240  COMMAND: "revalidator561"
       [120] PID: 83231  TASK: ffff9288b87dc300  COMMAND: "revalidator555"
       [120] PID: 83225  TASK: ffff9288b87c53c0  COMMAND: "revalidator549"
       [120] PID: 83221  TASK: ffff9288b87c10c0  COMMAND: "revalidator545"
       [120] PID: 83232  TASK: ffff9288b87dd3c0  COMMAND: "revalidator556"
       [120] PID: 83234  TASK: ffff9288a02e8000  COMMAND: "revalidator558"
       [120] PID: 83218  TASK: ffff9288b87053c0  COMMAND: "revalidator542"
       [120] PID: 83224  TASK: ffff9288b87c4300  COMMAND: "revalidator548"
       [120] PID: 75299  TASK: ffff9289ac654300  COMMAND: "18_scheduler"
       [120] PID: 75300  TASK: ffff9289ac6553c0  COMMAND: "19_scheduler"
       [120] PID: 75304  TASK: ffff92ab6abdc300  COMMAND: "23_scheduler"
       [120] PID: 75305  TASK: ffff92ab6abdd3c0  COMMAND: "24_scheduler"
       [120] PID: 75316  TASK: ffff92ab55b1d3c0  COMMAND: "35_scheduler"
       [120] PID: 75331  TASK: ffff9289bd7f10c0  COMMAND: "50_scheduler"
       [120] PID: 75337  TASK: ffff9289bd680000  COMMAND: "56_scheduler"
       [120] PID: 75338  TASK: ffff9289bd6810c0  COMMAND: "57_scheduler"
       [120] PID: 75342  TASK: ffff9289bd7f53c0  COMMAND: "61_scheduler"
       [120] PID: 75346  TASK: ffff9289bd762180  COMMAND: "65_scheduler"
       [120] PID: 75358  TASK: ffff92ab558c8000  COMMAND: "77_scheduler"
       [120] PID: 75284  TASK: ffff9289b1f72180  COMMAND: "3_scheduler"
       [120] PID: 88485  TASK: ffff928834632180  COMMAND: "libvirtd"
       [120] PID: 88483  TASK: ffff928834633240  COMMAND: "libvirtd"
       [120] PID: 88490  TASK: ffff92ab51f9c300  COMMAND: "libvirtd"
       [120] PID: 90511  TASK: ffff9287eb64a180  COMMAND: "nova-scheduler"
       [120] PID: 90474  TASK: ffff92ab56fe6480  COMMAND: "nova-scheduler"
       [120] PID: 90648  TASK: ffff9287f0bfc300  COMMAND: "nova-scheduler"
       [120] PID: 90712  TASK: ffff92590779d3c0  COMMAND: "nova-scheduler"
       [120] PID: 90830  TASK: ffff92ab5bfb0000  COMMAND: "nova-scheduler"
       [120] PID: 42161  TASK: ffff927b45b38000  COMMAND: "tuned"
       [120] PID: 38897  TASK: ffff9290740410c0  COMMAND: "rhel-push-plugi"
       [120] PID: 31027  TASK: ffff927b756da180  COMMAND: "in:imjournal"
       [120] PID: 81532  TASK: ffff92ab653ba180  COMMAND: "docker-current"
       [120] PID: 254569  TASK: ffff9255e9ac0000  COMMAND: "kworker/40:2"
       [100] PID: 138241  TASK: ffff92589a6cc300  COMMAND: "kworker/40:0H"
       [120] PID: 121768  TASK: ffff92ab6ef23240  COMMAND: "dockerd-current"
       [120] PID: 396533  TASK: ffff9258b3656480  COMMAND: "docker-current"

  crash> runq -c 60
  CPU 60 RUNQUEUE: ffff92ab7c31bfc0
    CURRENT: PID: 0      TASK: ffff927d56856480  COMMAND: "swapper/60"
    RT PRIO_ARRAY: ffff92ab7c31c158
       [no tasks queued]
    CFS RB_ROOT: ffff92ab7c31c060
       [no tasks queued]

In fact, CPU 40 has been attempting to run the same task for over 24 hours:

crash> runq -T

  [...snip...]

  CPU 40: 95669.37 secs


[I:4]  For completeness, looking at the longest blocked tasks on the system:

  crash> ps -m | grep UN

  [...snip...]

  [  1 02:34:17.332] [UN]  PID: 75283  TASK: ffff9289b1f710c0  CPU: 20  COMMAND: "2_scheduler"
  [  1 02:34:20.568] [UN]  PID: 73929  TASK: ffff9289d1f790c0  CPU: 0   COMMAND: "mysqld"
  [  1 02:34:26.631] [UN]  PID: 257990  TASK: ffff9258ba3b2180  CPU: 20  COMMAND: "ipmitool"
  [  1 02:34:29.062] [UN]  PID: 75282  TASK: ffff9289b1f70000  CPU: 0   COMMAND: "1_scheduler"

The longest blocked task has been waiting for ~26 hours, which is consistent with our CPU timer.  Looking at the backtrace to see what it is waiting on:

  crash> bt 75282
  PID: 75282  TASK: ffff9289b1f70000  CPU: 0   COMMAND: "1_scheduler"
   #0 [ffff928a1df3bc38] __schedule at ffffffffa615a914
   #1 [ffff928a1df3bcc8] schedule at ffffffffa615ae30
   #2 [ffff928a1df3bce0] __rt_mutex_slowlock at ffffffffa615b9cd
   #3 [ffff928a1df3bd40] rt_mutex_slowlock_locked at ffffffffa615be73
   #4 [ffff928a1df3bd90] rt_mutex_slowlock at ffffffffa615bfec
   #5 [ffff928a1df3be28] rt_mutex_lock at ffffffffa615c0df
   #6 [ffff928a1df3be40] _mutex_lock at ffffffffa615d74e
   #7 [ffff928a1df3be50] eventpoll_release_file at ffffffffa5c7ede9
   #8 [ffff928a1df3be90] __fput at ffffffffa5c30a76
   #9 [ffff928a1df3bed8] ____fput at ffffffffa5c30b6e
  #10 [ffff928a1df3bee8] task_work_run at ffffffffa5ab16fb
  #11 [ffff928a1df3bf28] do_notify_resume at ffffffffa5a1ad5e
  #12 [ffff928a1df3bf50] int_signal at ffffffffa61673b8
      RIP: 00007f27254e975d  RSP: 00007f27080fe980  RFLAGS: 00000293
      RAX: 0000000000000000  RBX: 00000000000000f1  RCX: ffffffffffffffff
      RDX: 0000000000912240  RSI: 0000000000000000  RDI: 00000000000000f1
      RBP: 0000000000000000   R8: 0000000000000000   R9: 00007f27080fe660
      R10: 0000000000000004  R11: 0000000000000293  R12: 00000000005bc5b0
      R13: 0000000000000000  R14: 0000000000000004  R15: 00007f26df351b68
      ORIG_RAX: 0000000000000003  CS: 0033  SS: 002b


[I:5]  Inspecting the source code, we can find this task Waiting on the epmutex:

  crash> whatis epmutex
  struct mutex epmutex;

  crash> sym epmutex
  ffffffffa6cd2c60 (b) epmutex

Showing the mutex, we can find it's current owner:

  crash> struct mutex ffffffffa6cd2c60
  struct mutex {
    lock = {
      wait_lock = {
        raw_lock = {
          val = {
            counter = 0x0
          }
        }
      }, 
      waiters = {
        rb_node = 0xffff9281b66f7b68
      }, 
      waiters_leftmost = 0xffff9281b167fd00, 
      owner = 0xffff92ab5582e481, 
      save_state = 0x0
    }
  }

Who owns it?

  crash> task_struct.comm,pid 0xffff92ab5582e480
    comm = "haproxy\000ord\000t.s"
    pid = 0x11c34


[I:6]  The task that owns the mutex is running on CPU 40:

  crash> bt 72756
  PID: 72756  TASK: ffff92ab5582e480  CPU: 40  COMMAND: "haproxy"
   #0 [ffff927b7df09e40] crash_nmi_callback at ffffffffa5a45e57
   #1 [ffff927b7df09e50] nmi_handle at ffffffffa615f8e4
   #2 [ffff927b7df09ea8] do_nmi at ffffffffa615fafd
   #3 [ffff927b7df09ef0] end_repeat_nmi at ffffffffa615ed39
      [exception RIP: native_queued_spin_lock_slowpath+0x122]
      RIP: ffffffffa5b07ed2  RSP: ffff9289cee03d60  RFLAGS: 00000046
      RAX: 0000000000000000  RBX: 0000000000000246  RCX: 0000000001410000
      RDX: ffff927b7da1ce60  RSI: 000000000004c1ff  RDI: ffff9287f5d4ef30
      RBP: ffff9289cee03d60   R8: ffff927b7df1ce40   R9: 0000000000000000
      R10: ffff92592ce48c30  R11: ffff9258c0772410  R12: 0000000000000000
      R13: ffff9287f5d4ef00  R14: 0000000000000002  R15: 0000000000000000
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  --- <NMI exception stack> ---
   #4 [ffff9289cee03d60] native_queued_spin_lock_slowpath at ffffffffa5b07ed2
   #5 [ffff9289cee03d68] queued_spin_lock_slowpath at ffffffffa615041b
   #6 [ffff9289cee03d78] _raw_spin_lock_irqsave at ffffffffa615db37
   #7 [ffff9289cee03d90] rt_mutex_slowlock at ffffffffa615bfcd
   #8 [ffff9289cee03e28] rt_mutex_lock at ffffffffa615c0df
   #9 [ffff9289cee03e40] _mutex_lock at ffffffffa615d74e
  #10 [ffff9289cee03e50] eventpoll_release_file at ffffffffa5c7ee10
  #11 [ffff9289cee03e90] __fput at ffffffffa5c30a76
  #12 [ffff9289cee03ed8] ____fput at ffffffffa5c30b6e
  #13 [ffff9289cee03ee8] task_work_run at ffffffffa5ab16fb
  #14 [ffff9289cee03f28] do_notify_resume at ffffffffa5a1ad5e
  #15 [ffff9289cee03f50] int_signal at ffffffffa61673b8
      RIP: 00007fe4d38f0620  RSP: 00007ffcdeb4ed08  RFLAGS: 00000246
      RAX: 0000000000000000  RBX: 0000000000007770  RCX: ffffffffffffffff
      RDX: 0000000000000000  RSI: 0000000000000001  RDI: 00000000000004fa
      RBP: 00000000000004fa   R8: 0000000000000008   R9: 0000000000000000
      R10: 000055f3759f0524  R11: 0000000000000246  R12: 00000000000004fa
      R13: 0000000000400000  R14: 0000000000000001  R15: 000055f375c22854
      ORIG_RAX: 0000000000000003  CS: 0033  SS: 002b

So, this task owns the epmutex and is waiting on another lock, looking at the source code we see this task take 2 mutexes in quick secession:

   910 /*
   911  * This is called from eventpoll_release() to unlink files from the eventpoll
   912  * interface. We need to have this facility to cleanup correctly files that are
   913  * closed without being removed from the eventpoll interface.
   914  */
   915 void eventpoll_release_file(struct file *file)
   916 {
   917         struct eventpoll *ep;
   918         struct epitem *epi;
   919 
   920         /*
   921          * We don't want to get "file->f_lock" because it is not
   922          * necessary. It is not necessary because we're in the "struct file"
   923          * cleanup path, and this means that no one is using this file anymore.
   924          * So, for example, epoll_ctl() cannot hit here since if we reach this
   925          * point, the file counter already went to zero and fget() would fail.
   926          * The only hit might come from ep_free() but by holding the mutex
   927          * will correctly serialize the operation. We do need to acquire
   928          * "ep->mtx" after "epmutex" because ep_remove() requires it when called
   929          * from anywhere but ep_free().
   930          *
   931          * Besides, ep_remove() acquires the lock, so we can't hold it here.
   932          */
   933         mutex_lock(&epmutex);                                        <<<<< takes epmutex
   934         list_for_each_entry_rcu(epi, &file->f_ep_links, fllink) {
   935                 ep = epi->ep;
   936                 mutex_lock_nested(&ep->mtx, 0);                      <<<<< takes eventpoll->mtx
   937                 ep_remove(ep, epi);
   938                 mutex_unlock(&ep->mtx);
   939         }
   940         mutex_unlock(&epmutex);
   941 }

Looking at the mutex we are trying to lock in the eventpoll struct, we pass the address via %r12:

  /usr/src/debug/kernel-3.10.0/linux-3.10.0.x86_64/fs/eventpoll.c: 935
  0xffffffffa5c7ee00 <eventpoll_release_file+0x50>:       mov    0x50(%rbx),%r13
  /usr/src/debug/kernel-3.10.0/linux-3.10.0.x86_64/fs/eventpoll.c: 936
  0xffffffffa5c7ee04 <eventpoll_release_file+0x54>:       lea    0x30(%r13),%r12
  0xffffffffa5c7ee08 <eventpoll_release_file+0x58>:       mov    %r12,%rdi     <<<<<
  0xffffffffa5c7ee0b <eventpoll_release_file+0x5b>:       callq  0xffffffffa615d740 <_mutex_lock>

%r12 is then preserved in rt_mutex_slowlock():

  0xffffffffa615bf70 <rt_mutex_slowlock>: nopl   0x0(%rax,%rax,1) [FTRACE NOP]
  0xffffffffa615bf75 <rt_mutex_slowlock+0x5>:     push   %rbp
  0xffffffffa615bf76 <rt_mutex_slowlock+0x6>:     mov    %rsp,%rbp
  0xffffffffa615bf79 <rt_mutex_slowlock+0x9>:     push   %r15
  0xffffffffa615bf7b <rt_mutex_slowlock+0xb>:     mov    %ecx,%r15d
  0xffffffffa615bf7e <rt_mutex_slowlock+0xe>:     lea    -0x78(%rbp),%r9
  0xffffffffa615bf82 <rt_mutex_slowlock+0x12>:    push   %r14
  0xffffffffa615bf84 <rt_mutex_slowlock+0x14>:    mov    %esi,%r14d
  0xffffffffa615bf87 <rt_mutex_slowlock+0x17>:    push   %r13
  0xffffffffa615bf89 <rt_mutex_slowlock+0x19>:    push   %r12                  <<<<<
  0xffffffffa615bf8e <rt_mutex_slowlock+0x1e>:    push   %rbx
  0xffffffffa615bf8f <rt_mutex_slowlock+0x1f>:    mov    %rdi,%rbx
  0xffffffffa615bf92 <rt_mutex_slowlock+0x22>:    sub    $0x60,%rsp

Annotating the stack:

 #7 [ffff9289cee03d90] rt_mutex_slowlock+0x5d at ffffffffa615bfcd
    /usr/src/debug/kernel-3.10.0/linux-3.10.0.x86_64/kernel/rtmutex.c: 1420
    ffff9289cee03d98: 0000000000000000 ffff9289cee03da8 
    ffff9289cee03da8: ffff9289cee03da8 ffffffffffffff10 
    ffff9289cee03db8: __call_rcu+0xba  ffff9289cee03dc0 
    ffff9289cee03dc8: 0000000000000246 ffff9289cee03de8 
    ffff9289cee03dd8: 0000000000000000 0000000000000246 
    ffff9289cee03de8: [ffff92ab5582e400:task_struct] 0000000014b88e60 
    ffff9289cee03df8: [ffff9287f5d4ef30:sock_inode_cache] [ffff9287f5d4ef30:sock_inode_cache] 
                        %rbx                                 %r12
    ffff9289cee03e08: [ffff9287f5d4ef00:sock_inode_cache] [ffff9258c0772508:filp] 
                        %r13                                 %r14
    ffff9289cee03e18: [ffff927d576b0720:mnt_cache] ffff9289cee03e38 
                        %r15                         %rbp
    ffff9289cee03e28: rt_mutex_lock+0x6f 
                        %rip
 #8 [ffff9289cee03e28] rt_mutex_lock+0x6f at ffffffffa615c0df

So our value for the mutex pointer is ffff9287f5d4ef30...however this is part of the sock_inode_cache.  The sock_inode_cache contains struct socket_alloc objects, which are made up of a struct socket and a struct inode.  The pointer that should be pointing to the mutex is actually pointing to the inode portion of the socket_alloc struct.  In eventpoll_release_file() we use the file pointer and a 'container of' to get struct epitem and then use that to get the struct eventpoll address, so this doesn't look right.


[I:7]  Going at this from another direction, lets see what was passed into eventpoll_release_file():

  /usr/src/debug/kernel-3.10.0/linux-3.10.0.x86_64/include/linux/eventpoll.h: 61
  0xffffffffa5c30a6e <__fput+0x24e>:      mov    %rbx,%rdi                     <<<<<
  0xffffffffa5c30a71 <__fput+0x251>:      callq  0xffffffffa5c7edb0 <eventpoll_release_file>

  0xffffffffa5c7edb0 <eventpoll_release_file>:    nopl   0x0(%rax,%rax,1) [FTRACE NOP]
  0xffffffffa5c7edb5 <eventpoll_release_file+0x5>:        push   %rbp
  0xffffffffa5c7edb6 <eventpoll_release_file+0x6>:        mov    %rsp,%rbp
  0xffffffffa5c7edb9 <eventpoll_release_file+0x9>:        push   %r14
  0xffffffffa5c7edbb <eventpoll_release_file+0xb>:        mov    %rdi,%r14
  0xffffffffa5c7edbe <eventpoll_release_file+0xe>:        mov    $0xffffffffa6cd2c60,%rdi
  0xffffffffa5c7edc5 <eventpoll_release_file+0x15>:       add    $0x108,%r14
  0xffffffffa5c7edcc <eventpoll_release_file+0x1c>:       push   %r13
  0xffffffffa5c7edce <eventpoll_release_file+0x1e>:       push   %r12
  0xffffffffa5c7edd0 <eventpoll_release_file+0x20>:       push   %rbx          <<<<<
  0xffffffffa5c7edd1 <eventpoll_release_file+0x21>:       sub    $0x10,%rsp

We see %rbx has our file pointer, and it is preserved on the stack of eventpoll_release_file:

  #10 [ffff9289cee03e50] eventpoll_release_file+0x60 at ffffffffa5c7ee10
      /usr/src/debug/kernel-3.10.0/linux-3.10.0.x86_64/fs/eventpoll.c: 937
      ffff9289cee03e58: [filp]           0000000014b88e60 
      ffff9289cee03e68: [filp]           0000000000000008 
                          %rbx             %r13
      ffff9289cee03e78: [dentry]         [sock_inode_cache] 
                          %r13             %r14
      ffff9289cee03e88: ffff9289cee03ed0 __fput+0x256     
                          %rbp             %rip
  #11 [ffff9289cee03e90] __fput+0x256 at ffffffffa5c30a76

  crash> rd ffff9289cee03e68
  ffff9289cee03e68:  ffff9258c0772400                    .$w.X...

So our file pointer is ffff9258c0772400, from here we can get the list of linked f_ep_links:

  crash> struct file.f_ep_links ffff9258c0772400     (+0x108)
    f_ep_links = {
      next = 0xffff9258c0772508, 
      prev = 0xffff9258c0772508
    }

There is only one item on this list.  In order to get the epitem address we can subtract 0x58 from the embedded list_head:

  crash> epitem ffff9258c07724b0
  struct epitem {
    {
      rbn = {
        __rb_parent_color = 0x0, 
        rb_right = 0x0, 
        rb_left = 0x0
      }, 
      rcu = {
        next = 0x0, 
        func = 0x0
      }
    }, 
    rdllink = {
      next = 0xffff928a1930a180, 
      prev = 0x0
    }, 
    next = 0x0, 
    ffd = {
      file = 0x0, 
      fd = 0x0
    }, 
    nwait = 0x0, 
    pwqlist = {
      next = 0x0, 
      prev = 0xffff92512a3610b0
    }, 
    ep = 0xffff92592ce48c00,                               <<<<< eventpoll pointer
    fllink = {
      next = 0xffff9258c0772508, 
      prev = 0xffff9258c0772508
    }, 
    ws = 0xffff9258c0772518, 
    event = {
      events = 0xc0772518, 
      data = 0x2ce48da8ffff9258
    }
  }

Here we see our ep pointer, which is the eventpoll struct that contains our 'mtx' mutex we are trying to acquire:

  crash> eventpoll ffff92592ce48c00
  struct eventpoll {
    lock = {
      lock = {
        wait_lock = {
          raw_lock = {
            val = {
              counter = 0x2
            }
          }
        }, 
        waiters = {
          rb_node = 0x0
        }, 
        waiters_leftmost = 0xffff92594fe4ed00, 
        owner = 0xffff9258c0772400, 
        save_state = 0xc42b6180
      }, 
      break_lock = 0xa62a0b00
    }, 
    mtx = {                                                <<<<< second mutex we are attempting to acquire
      lock = {
        wait_lock = {
          raw_lock = {
            val = {
              counter = 0x4c1ff
            }
          }
        }, 
        waiters = {
          rb_node = 0x0
        }, 
        waiters_leftmost = 0xffffffffffffffff, 
        owner = 0xffffffffffffffff, 
        save_state = 0xa629ac40
      }
    }, 
    wq = {
      lock = {
        lock = {
          wait_lock = {
            raw_lock = {
              val = {
                counter = 0x7f9e8800
              }
            }
          }, 
          waiters = {
            rb_node = 0xffff92592ce48da8
          }, 
          waiters_leftmost = 0xffff92588efd49a0, 
          owner = 0x37da79be, 
          save_state = 0x1
        }, 
        break_lock = 0x0
      }, 
      task_list = {
        next = 0x0, 
        prev = 0x0
      }
    }, 
    poll_wait = {
      lock = {
        lock = {
          wait_lock = {
            raw_lock = {
              val = {
                counter = 0x0
              }
            }
          }, 
          waiters = {
            rb_node = 0x0
          }, 
          waiters_leftmost = 0x0, 
          owner = 0x0, 
          save_state = 0x0
        }, 
        break_lock = 0x0
      }, 
      task_list = {
        next = 0x0, 
        prev = 0x0
      }
    }, 
    rdllist = {
      next = 0x0, 
      prev = 0x0
    }, 
    rbr = {
      rb_node = 0xc00000000
    }, 
    ovflist = 0x0, 
    ws = 0x0, 
    user = 0x0, 
    file = 0x0, 
    visited = 0x0, 
    visited_list_link = {
      next = 0x0, 
      prev = 0x0
    }
  }


[I:8]  Displaying only the ep->mtx here for clarity:

  crash> struct mutex ffff9287f5d4ef30  
  struct mutex {
    lock = {
      wait_lock = {
        raw_lock = {
          val = {
            counter = 0x141c1ff
          }
        }
      }, 
      waiters = {
        rb_node = 0x7dc
      }, 
      waiters_leftmost = 0xffffffffffffffff, 
      owner = 0xffffffffffffffff, 
      save_state = 0xa629ac40
    }
  }

This doesn't look valid.  Further, the slub cache believes this to be part of a completely different structure:

  crash> kmem ffff9287f5d4ef30
  CACHE             OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE  NAME
  ffff924effc04600      760      10597     12222    291    32k  sock_inode_cache
    SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
    ffffdb94afd75200  ffff9287f5d48000     1     42         39     3
    FREE / [ALLOCATED]
    [ffff9287f5d4ef00]

        PAGE         PHYSICAL      MAPPING       INDEX CNT FLAGS
  ffffdb94afd75380 3bf5d4e000                0        1  0 60000000008000 tail


[I:9]  Based on the backtrace, we are spinning here waiting to acquire the mutex:

  455         /*
  456          * if there was a previous node; link it and wait until reaching the
  457          * head of the waitqueue.
  458          */
  459         if (old & _Q_TAIL_MASK) {
  460                 prev = decode_tail(old);
  461                 WRITE_ONCE(prev->next, node);
  462 
  463                 pv_wait_node(node, prev);
  464                 arch_mcs_spin_lock_contended(&node->locked);        <<<<<
  465 
  466                 /*
  467                  * While waiting for the MCS lock, the next pointer may have
  468                  * been set by another lock waiter. We optimistically load
  469                  * the next pointer & prefetch the cacheline for writing
  470                  * to reduce latency in the upcoming MCS unlock operation.
  471                  */
  472                 next = READ_ONCE(node->next);
  473                 if (next)
  474                         prefetchw(next);
  475         }

Showing the disassembler:

  crash> bt
  PID: 72756  TASK: ffff92ab5582e480  CPU: 40  COMMAND: "haproxy"
   #0 [ffff927b7df09e40] crash_nmi_callback at ffffffffa5a45e57
   #1 [ffff927b7df09e50] nmi_handle at ffffffffa615f8e4
   #2 [ffff927b7df09ea8] do_nmi at ffffffffa615fafd
   #3 [ffff927b7df09ef0] end_repeat_nmi at ffffffffa615ed39
      [exception RIP: native_queued_spin_lock_slowpath+0x122]        <<<<<
      RIP: ffffffffa5b07ed2  RSP: ffff9289cee03d60  RFLAGS: 00000046
      RAX: 0000000000000000  RBX: 0000000000000246  RCX: 0000000001410000
      RDX: ffff927b7da1ce60  RSI: 000000000004c1ff  RDI: ffff9287f5d4ef30
      RBP: ffff9289cee03d60   R8: ffff927b7df1ce40   R9: 0000000000000000
      R10: ffff92592ce48c30  R11: ffff9258c0772410  R12: 0000000000000000
      R13: ffff9287f5d4ef00  R14: 0000000000000002  R15: 0000000000000000
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  --- <NMI exception stack> ---
   #4 [ffff9289cee03d60] native_queued_spin_lock_slowpath at ffffffffa5b07ed2
   #5 [ffff9289cee03d68] queued_spin_lock_slowpath at ffffffffa615041b
   #6 [ffff9289cee03d78] _raw_spin_lock_irqsave at ffffffffa615db37
   #7 [ffff9289cee03d90] rt_mutex_slowlock at ffffffffa615bfcd
   #8 [ffff9289cee03e28] rt_mutex_lock at ffffffffa615c0df
   #9 [ffff9289cee03e40] _mutex_lock at ffffffffa615d74e
  #10 [ffff9289cee03e50] eventpoll_release_file at ffffffffa5c7ee10
  #11 [ffff9289cee03e90] __fput at ffffffffa5c30a76
  #12 [ffff9289cee03ed8] ____fput at ffffffffa5c30b6e
  #13 [ffff9289cee03ee8] task_work_run at ffffffffa5ab16fb
  #14 [ffff9289cee03f28] do_notify_resume at ffffffffa5a1ad5e
  #15 [ffff9289cee03f50] int_signal at ffffffffa61673b8
      RIP: 00007fe4d38f0620  RSP: 00007ffcdeb4ed08  RFLAGS: 00000246
      RAX: 0000000000000000  RBX: 0000000000007770  RCX: ffffffffffffffff
      RDX: 0000000000000000  RSI: 0000000000000001  RDI: 00000000000004fa
      RBP: 00000000000004fa   R8: 0000000000000008   R9: 0000000000000000
      R10: 000055f3759f0524  R11: 0000000000000246  R12: 00000000000004fa
      R13: 0000000000400000  R14: 0000000000000001  R15: 000055f375c22854
      ORIG_RAX: 0000000000000003  CS: 0033  SS: 002b


The CPU is executing in this tight section of code:

  /usr/src/debug/kernel-3.10.0/linux-3.10.0.x86_64/arch/x86/include/asm/processor.h: 780
  0xffffffffa5b07ed0 <native_queued_spin_lock_slowpath+0x120>:    pause         <<<<--------------+
  /usr/src/debug/kernel-3.10.0/linux-3.10.0.x86_64/kernel/qspinlock.c: 464                        |
  0xffffffffa5b07ed2 <native_queued_spin_lock_slowpath+0x122>:    mov    0x8(%r8),%eax            |
  0xffffffffa5b07ed6 <native_queued_spin_lock_slowpath+0x126>:    test   %eax,%eax                |
  0xffffffffa5b07ed8 <native_queued_spin_lock_slowpath+0x128>:    je     0xffffffffa5b07ed0  >>>--+
  /usr/src/debug/kernel-3.10.0/linux-3.10.0.x86_64/include/linux/compiler.h: 205
  0xffffffffa5b07eda <native_queued_spin_lock_slowpath+0x12a>:    mov    (%r8),%r9
  /usr/src/debug/kernel-3.10.0/linux-3.10.0.x86_64/kernel/qspinlock.c: 473
  0xffffffffa5b07edd <native_queued_spin_lock_slowpath+0x12d>:    test   %r9,%r9
  0xffffffffa5b07ee0 <native_queued_spin_lock_slowpath+0x130>:    je     0xffffffffa5b07ee6 <native_queued_spin_lock_slowpath+0x136>

Getting the value from %r8, we see we will take the jump and continue looping.

crash> rd -32 ffff927b7df1ce48
ffff927b7df1ce48:  00000000  

No other CPU is executing with this lock held, so nothing will be changing this memory location.  We will likely loop forever.  


Version-Release number of selected component (if applicable):

This occurred on 3.10.0-957.12.2.rt56.929.el7.x86_64


How reproducible:  Unknown (the same customer sent us 2 cores from different systems)


Steps to Reproduce:  Unknown


Actual results:  A PID (haproxy in both cores from this customer) gets stuck trying to acquire a lock that is not a lock.


Expected results:  The lock that is trying to be acquired would be valid.

Note You need to log in before you can comment on or make changes to this bug.