Bug 2193167 - Attempted to kill init! exitcode=0x0000000b" caused by user stack overflow in systemd pid 1.
Summary: Attempted to kill init! exitcode=0x0000000b" caused by user stack overflow in...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: systemd
Version: 7.9
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: David Tardon
QA Contact: Frantisek Sumsal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-05-04 14:27 UTC by vlovejoy
Modified: 2023-08-08 12:26 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHELPLAN-156428 0 None None None 2023-05-04 14:29:10 UTC

Description vlovejoy 2023-05-04 14:27:03 UTC
Description of problem:

This panic was caused by systemd running off the top of the user stack. Unfortunately by the time a the OS does the kill on pid 1 a lot of things for it have been release and vars overwritten.

But we can see from the segfault message for systemd logged to the ring buffer show that the faulting address was just above the RSP  address an the error was created by a write from user mode out of  libc-2.17.so .

Version-Release number of selected component (if applicable):

RELEASE: 3.10.0-1160.62.1.el7.x86_64
systemd-219-78.el7_9.5.x86_64


Additional info:

systemd[1]: segfault at 7ffd0a6a9fc8 ip 00007f563537f68c sp 00007ffd0a6a9fd0 error 6 in libc-2.17.so[7f5635337000+1c4000]



And from the  dump we can see the bottom of the stack (bot) was 0x7ffd0aeaa000 and the stack limit was 8388608  (8MB).

crash> vm
PID: 1        TASK: ffff91e199b40000  CPU: 41   COMMAND: "systemd"
       MM               PGD          RSS    TOTAL_VM
       0                 0            0k       0k   
crash> task -R active_mm
PID: 1        TASK: ffff91e199b40000  CPU: 41   COMMAND: "systemd"
  active_mm = 0xffff91fc5f3addc0,

crash> vm  -M 0xffff91fc5f3addc0 | grep 7ffd0
ffff92007a48e360          0 7ffd0aeaa000 100173     <-------
ffff91f0800aeaf8 ffff91f0800aea20 7ffd0aee9000 8040075 
crash> 

crash> ps 1 -r | grep STACK
       STACK     8388608     (unlimited)
crash> 


Adding the 8Mb to the bot gives us a tos limit 0x7ffd0a6aa000.

Mapping:
  7ffd0a6a9fc8  <- fault address
  7ffd0a6a9fd0  <- current rsp
  7ffd0a6aa000  <- tos limit
  7ffd0aeaa000  <- bos

And you can see that the faulting address was 1 qword/8 bytes  above the current rsp.

crash> pd 0x7ffd0a6a9fd0-0x7ffd0a6a9fc8
$27 = 8

So the most likely thing that happened was we called into a routine while below the tos limit, it set up it’s stack which allocated stack space which moved RSP above the tos limit, and then made a call which tried to push the return address one qword above the current RSP thus tripping the page fault. Chasing the object code for the library it shows that scenario.

From a test system using gdb:

gdb) disass  0x7fab4bb9e600
Dump of assembler code for function _IO_vfprintf_internal:

 The pushes  happen while under the limit.

   0x00007fab4bb9e600 <+0>:     push   %rbp
   0x00007fab4bb9e601 <+1>:     mov    %rsp,%rbp
   0x00007fab4bb9e604 <+4>:     push   %r15
   0x00007fab4bb9e606 <+6>:     mov    %rdi,%r15
   0x00007fab4bb9e609 <+9>:     push   %r14
   0x00007fab4bb9e60b <+11>:    mov    %rsi,%r14
   0x00007fab4bb9e60e <+14>:    push   %r13
   0x00007fab4bb9e610 <+16>:    mov    %rdx,%r13
   0x00007fab4bb9e613 <+19>:    push   %r12
   0x00007fab4bb9e615 <+21>:    push   %rbx 

   0x00007fab4bb9e616 <+22>:    sub    $0x5c8,%rsp   <-- pushed us 0x30 bytes 
                                                       over the tos limit

So  all the stack access are within  the 0x598 bytes still inside of the tos limit.


   0x00007fab4bb9e61d <+29>:    mov    0x37e82c(%rip),%rax        # 0x7fab4bf1ce50
   0x00007fab4bb9e624 <+36>:    mov    %fs:(%rax),%eax
   0x00007fab4bb9e627 <+39>:    mov    %eax,-0x4d8(%rbp)
   0x00007fab4bb9e62d <+45>:    mov    0xc0(%rdi),%eax
   0x00007fab4bb9e633 <+51>:    test   %eax,%eax
   0x00007fab4bb9e635 <+53>:    jne    0x7fab4bb9e780 <_IO_vfprintf_internal+384>
   0x00007fab4bb9e63b <+59>:    movl   $0xffffffff,0xc0(%rdi)
   0x00007fab4bb9e645 <+69>:    mov    (%r15),%ebx
   0x00007fab4bb9e648 <+72>:    test   $0x8,%bl
   0x00007fab4bb9e64b <+75>:    jne    0x7fab4bb9ea60 <_IO_vfprintf_internal+1120>
   0x00007fab4bb9e651 <+81>:    test   %r14,%r14
   0x00007fab4bb9e654 <+84>:    je     0x7fab4bb9ea7e <_IO_vfprintf_internal+1150>
   0x00007fab4bb9e65a <+90>:    test   $0x2,%bl
   0x00007fab4bb9e65d <+93>:    jne    0x7fab4bb9e810 <_IO_vfprintf_internal+528>
   0x00007fab4bb9e663 <+99>:    mov    0x0(%r13),%rax
   0x00007fab4bb9e667 <+103>:   mov    $0x25,%esi
   0x00007fab4bb9e66c <+108>:   mov    %r14,%rdi
   0x00007fab4bb9e66f <+111>:   mov    %rax,-0x460(%rbp)
   0x00007fab4bb9e676 <+118>:   mov    0x8(%r13),%rax
   0x00007fab4bb9e67a <+122>:   mov    %rax,-0x458(%rbp)
   0x00007fab4bb9e681 <+129>:   mov    0x10(%r13),%rax
   0x00007fab4bb9e685 <+133>:   mov    %rax,-0x450(%rbp)

This call is the first stack access outside  of the tos limit which triggers the page fault.

   0x00007fab4bb9e68c <+140>:   callq  0x7fab4bbec4b0 <strchrnul>   <--------
   0x00007fab4bb9e691 <+145>:   and    $0x80,%bh

crash> bt
PID: 1        TASK: ffff91e199b40000  CPU: 41   COMMAND: "systemd"
 #0 [ffff91e199b4bb30] machine_kexec at ffffffffba8662f4
 #1 [ffff91e199b4bb90] __crash_kexec at ffffffffba922b82
 #2 [ffff91e199b4bc60] panic at ffffffffbaf802bc
 #3 [ffff91e199b4bce0] do_exit at ffffffffba8a223f
 #4 [ffff91e199b4bd78] do_group_exit at ffffffffba8a22bf
 #5 [ffff91e199b4bda8] get_signal_to_deliver at ffffffffba8b341e
 #6 [ffff91e199b4be40] do_signal at ffffffffba82c527
 #7 [ffff91e199b4bf30] do_notify_resume at ffffffffba82cc32
 #8 [ffff91e199b4bf50] retint_signal at ffffffffbaf9057c
    RIP: 00007f563537f68c  RSP: 00007ffd0a6a9fd0  RFLAGS: 00010246
    RAX: 00007ffd0a6aa760  RBX: 00000000fbad8001  RCX: 0000000000000000
    RDX: 00007ffd0a6aa748  RSI: 0000000000000025  RDI: 000055d811decd88
    RBP: 00007ffd0a6aa5c0   R8: 0000000000000000   R9: 00007ffd0a6aa748
    R10: 000055d811dee910  R11: 000055d811e0461b  R12: 000055d811decd88
    R13: 00007ffd0a6aa748  R14: 000055d811decd88  R15: 00007ffd0a6aa5d0
    ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b
crash> 




We can see a left over page fault marker in the stack. 

crash> bt -T | grep -e panic -e page -C 2
bt: get_cpus_online: online: 48
  [ffff91e199b4bb90] __crash_kexec at ffffffffba922b82
  [ffff91e199b4bc18] __crash_kexec at ffffffffba922c1f
  [ffff91e199b4bc60] panic at ffffffffbaf802bc
  [ffff91e199b4bce0] do_exit at ffffffffba8a223f
  [ffff91e199b4bd78] do_group_exit at ffffffffba8a22bf
--
  [ffff91e199b4be40] do_signal at ffffffffba82c527
  [ffff91e199b4be98] vprintk_default at ffffffffba89e669
  [ffff91e199b4beb0] __do_page_fault at ffffffffbaf948b7    <--
  [ffff91e199b4bf30] do_notify_resume at ffffffffba82cc32
  [ffff91e199b4bf50] retint_signal at ffffffffbaf9057c
crash> 

Which granted could just be really old data but it’s the page fault code which leads to the segfault message being logged.




   1362 do_page_fault(struct pt_regs *regs, unsigned long error_code)
   1363 {
   1364         enum ctx_state prev_state;
   1365         /* Get the faulting address: */
   1366         unsigned long address = read_cr2();
   1367 
   1368         prev_state = exception_enter();
   1369         __do_page_fault(regs, error_code, address);   <--  

   1141 static void __kprobes
   1142 __do_page_fault(struct pt_regs *regs, unsigned long error_code,
   1143                 unsigned long address)
   1144 {
. . .

   1294        if (unlikely(expand_stack(vma, address))) {   <-----
   1295                 bad_area(regs, error_code, address);  <---
   1296                 return;


 
    876 static noinline void
    877 bad_area(struct pt_regs *regs, unsigned long error_code, unsigned long address)
    878 {
    879         __bad_area(regs, error_code, address, NULL, SEGV_MAPERR);  <--
    880 }



    862 __bad_area(struct pt_regs *regs, unsigned long error_code,
    863            unsigned long address,  struct vm_area_struct *vma, int si_code)
    864 {
    865         struct mm_struct *mm = current->mm;
    866 
    867         /*
    868          * Something tried to access memory that isn't in our memory map..
    869          * Fix it, but check if it's kernel or user first..
    870          */
    871         up_read(&mm->mmap_sem);
    872 
    873         __bad_area_nosemaphore(regs, error_code, address, vma, si_code); <---




  
    798 __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
    799                        unsigned long address, struct vm_area_struct *vma,
    800                        int si_code)
    801 {
. . .
    836                 if (likely(show_unhandled_signals))
    837                         show_signal_msg(regs, error_code, address, tsk);   <---
    838 
    839                 tsk->thread.cr2         = address;
    840                 tsk->thread.error_code  = error_code;
    841                 tsk->thread.trap_nr     = X86_TRAP_PF;
    842 




    778 show_signal_msg(struct pt_regs *regs, unsigned long error_code,
    779                 unsigned long address, struct task_struct *tsk)
    780 {
    781         if (!unhandled_signal(tsk, SIGSEGV))
    782                 return;
    783 
    784         if (!printk_ratelimit())
    785                 return;
    786 
    787         printk("%s%s[%d]: segfault at %lx ip %p sp %p error %lx",  <----
    788                 task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
    789                 tsk->comm, task_pid_nr(tsk), address,
    790                 (void *)regs->ip, (void *)regs->sp, error_code);

And to confirm it.

crash> task | grep cr2 -A 2
    cr2 = 0x7ffd0a6a9fc8,  <- matches the faulting address in the message
    trap_nr = 0xe,
    error_code = 0x6,      <- matches the error
crash>

As user memory was not dumped we cannot see the user stack.

Comment 3 vlovejoy 2023-05-04 14:33:37 UTC
The dump is located on 
galvatron-x86.cee.ral3.lab.eng.rdu2.redhat.com

retrace-server-interact 547813400 crash


crash> sys
      KERNEL: /cores/retrace/repos/kernel/x86_64/usr/lib/debug/lib/modules/3.10.0-1160.62.1.el7.x86_64/vmlinux  [TAINTED]
    DUMPFILE: /cores/retrace/tasks/547813400/crash/vmcore  [PARTIAL DUMP]
        CPUS: 48
        DATE: Sun Apr 30 15:54:51 EDT 2023
      UPTIME: 80 days, 20:40:47
LOAD AVERAGE: 13.76, 11.69, 11.62
       TASKS: 16977
    NODENAME:
     RELEASE: 3.10.0-1160.62.1.el7.x86_64
     VERSION: #1 SMP Wed Mar 23 09:04:02 UTC 2022
     MACHINE: x86_64  (2294 Mhz)
      MEMORY: 127.7 GB
       PANIC: "Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b"


crash> sys -i 
        DMI_BIOS_VENDOR: Cisco Systems, Inc.
       DMI_BIOS_VERSION: B200M4.3.1.3k.0.022620181855
          DMI_BIOS_DATE: 02/26/2018
         DMI_SYS_VENDOR: Cisco Systems Inc
       DMI_PRODUCT_NAME: UCSB-B200-M4
    DMI_PRODUCT_VERSION: 1
     DMI_PRODUCT_SERIAL: 
       DMI_PRODUCT_UUID: 
       DMI_BOARD_VENDOR: Cisco Systems Inc
         DMI_BOARD_NAME: UCSB-B200-M4
      DMI_BOARD_VERSION: 73-15862-03
       DMI_BOARD_SERIAL: 
    DMI_BOARD_ASSET_TAG:  
     DMI_CHASSIS_VENDOR: Cisco Systems Inc
       DMI_CHASSIS_TYPE: 18
    DMI_CHASSIS_VERSION: 68-5091-04
     DMI_CHASSIS_SERIAL: 
  DMI_CHASSIS_ASSET_TAG:  
crash>


Note You need to log in before you can comment on or make changes to this bug.