1618792 – Kernel 4.17.14-202.fc28.x86_64 crashed

Bug 1618792 - Kernel 4.17.14-202.fc28.x86_64 crashed

Summary: Kernel 4.17.14-202.fc28.x86_64 crashed

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	28
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-08-17 15:22 UTC by H.J. Lu
Modified:	2018-09-02 15:26 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-09-02 15:26:37 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Kernel oops (4.17 MB, image/jpeg) 2018-08-17 15:22 UTC, H.J. Lu	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Linux Kernel	200867	0	None	None	None	2018-08-20 18:15:48 UTC

Description H.J. Lu 2018-08-17 15:22:28 UTC

Created attachment 1476643 [details]
Kernel oops

Kernel 4.17.14-202.fc28.x86_64 crashed under heavy load.
4.17.14-200.fc28.x86_64 is OK.  I am enclosing a kernel oops.

Comment 1 Laura Abbott 2018-08-17 16:55:50 UTC

How repeatable is this?

Comment 2 H.J. Lu 2018-08-17 17:50:06 UTC

(In reply to Laura Abbott from comment #1)
> How repeatable is this?

My machine has Intel i7-4770K CPU and 32 GB RAM.  I can reproduce:

[  899.780323] general protection fault: 0000 [#1] SMP PTI
[  899.780362] Modules linked in: netconsole devlink ebtable_filter ebtables ip6table_filter ip6_tables intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm snd_hda_codec_realtek snd_hda_codec_hdmi mei_wdt snd_hda_codec_generic iTCO_wdt gpio_ich ppdev iTCO_vendor_support snd_hda_intel irqbypass snd_hda_codec crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_core intel_cstate intel_uncore snd_hwdep intel_rapl_perf snd_seq snd_seq_device snd_pcm joydev snd_timer mei_me snd mei lpc_ich shpchp parport_pc soundcore i2c_i801 parport pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc i915 i2c_algo_bit drm_kms_helper drm crc32c_intel r8169 mii video
[  899.780597] CPU: 6 PID: 6216 Comm: cc1plus Not tainted 4.17.14-202.0.fc25.x86_64 #1
[  899.780627] Hardware name: Gigabyte Technology Co., Ltd. H87M-D3H/H87M-D3H, BIOS F11 08/18/2015
[  899.780665] RIP: 0010:free_pages_and_swap_cache+0x29/0xb0
[  899.780689] RSP: 0018:ffff9d3803883c80 EFLAGS: 00010202
[  899.780710] RAX: 0017fffe00040068 RBX: ffff91ea6597fa80 RCX: 0000000000000000
[  899.780739] RDX: 0017fffe00040068 RSI: 00000000000001fe RDI: ffff91eade39d2a0
[  899.780766] RBP: 00000000000001fe R08: ffffeda21ebc3a20 R09: ffff91eade5d5000
[  899.780793] R10: ffff91eade5d5e20 R11: ffff91eade5d5dc0 R12: ffff91ea6597f010
[  899.780821] R13: fffbeda21e1b8400 R14: ffff91ea65980000 R15: 00007f0b8c0ea000
[  899.780849] FS:  0000000000000000(0000) GS:ffff91eade380000(0000) knlGS:0000000000000000
[  899.780880] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  899.780903] CR2: 00007f0b9dbb753c CR3: 000000047c20a002 CR4: 00000000001606e0
[  899.780930] Call Trace:
[  899.780949]  tlb_flush_mmu_free+0x31/0x50
[  899.780967]  unmap_page_range+0xa32/0xc40
[  899.780987]  unmap_vmas+0x7a/0xb0
[  899.781003]  exit_mmap+0xaa/0x190
[  899.781021]  mmput+0x5f/0x130
[  899.781037]  do_exit+0x280/0xae0
[  899.781054]  ? __do_page_fault+0x263/0x4e0
[  899.781073]  do_group_exit+0x3a/0xa0
[  899.781091]  __x64_sys_exit_group+0x14/0x20
[  899.781111]  do_syscall_64+0x65/0x160
[  899.781130]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  899.781152] RIP: 0033:0x7f0b9d9fd3a6
[  899.781168] RSP: 002b:00007ffe108dab28 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[  899.781198] RAX: ffffffffffffffda RBX: 00007f0b9daee740 RCX: 00007f0b9d9fd3a6
[  899.781225] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
[  899.781253] RBP: 0000000000000000 R08: 00000000000000e7 R09: fffffffffffffe70
[  899.781281] R10: 00007ffe108da9c0 R11: 0000000000000246 R12: 00007f0b9daee740
[  899.781310] R13: 0000000000000038 R14: 00007f0b9daf7708 R15: 0000000000000000
[  899.781338] Code: 40 00 0f 1f 44 00 00 41 56 41 55 41 54 49 89 fc 55 89 f5 53 e8 59 91 fb ff 85 ed 7e 6b 8d 45 ff 4c 89 e3 4d 8d 74 c4 08 4c 8b 2b <49> 8b 55 20 48 8d 42 ff 83 e2 01 49 0f 44 c5 48 8b 48 20 48 8d 
[  899.781434] RIP: free_pages_and_swap_cache+0x29/0xb0 RSP: ffff9d3803883c80
[  899.781473] ---[ end trace 817e490010d352e3 ]---
[  899.781493] Fixing recursive fault but reboot is needed!

within 3 minutes with GCC 9 build:

/export/gnu/import/git/sources/gcc/configure --enable-cet --with-demangler-in-ld  --prefix=/usr/gcc-9.0.0-x86-64 --with-local-prefix=/usr/local --enable-gnu-indirect-function --enable-clocale=gnu --with-system-zlib --enable-libmpx --with-fpmath=sse --enable-languages=c,c++,fortran,lto,objc,ada,obj-c++,go

make -j 8 bootstrap

Sometimes I got

In file included from /export/gnu/import/git/sources/gcc/libgcc/libgcc2.c:56:
/export/gnu/import/git/sources/gcc/libgcc/libgcc2.h:29:9: internal compiler error: Segmentation fault
29 | #pragma GCC visibility push(default)
   |         ^~~
0x98203e lookup_page_table_entry
        /export/gnu/import/git/sources/gcc/gcc/ggc-page.c:632
0x983107 ggc_set_mark(void const*)
        /export/gnu/import/git/sources/gcc/gcc/ggc-page.c:1531
0x85bb89 gt_ggc_mx_lang_tree_node(void*)
        ./gt-c-c-decl.h:49
0x85cf72 gt_ggc_mx_lang_tree_node(void*)
        ./gt-c-c-decl.h:278
0x85da3a gt_ggc_mx_lang_tree_node(void*)
        ./gt-c-c-decl.h:416
0x85da3a gt_ggc_mx_lang_tree_node(void*)
        ./gt-c-c-decl.h:416
0x85da3a gt_ggc_mx_lang_tree_node(void*)
        ./gt-c-c-decl.h:416
0x85da3a gt_ggc_mx_lang_tree_node(void*)
        ./gt-c-c-decl.h:416
0x85da3a gt_ggc_mx_lang_tree_node(void*)
        ./gt-c-c-decl.h:416
0xcac270 gt_ggc_mx_tree_statement_list_node(void*)
        /export/build/gnu/tools-build/gcc/build-x86_64-linux/gcc/gtype-desc.c:1888
0x85ddf2 gt_ggc_mx_lang_tree_node(void*)
        ./gt-c-c-decl.h:464
0x85da3a gt_ggc_mx_lang_tree_node(void*)
        ./gt-c-c-decl.h:416
0x85d18f gt_ggc_mx_lang_tree_node(void*)
        ./gt-c-c-decl.h:295
0xcaaf48 gt_ggc_mx_symtab_node(void*)
        /export/build/gnu/tools-build/gcc/build-x86_64-linux/gcc/gtype-desc.c:1413
0x85d09a gt_ggc_mx_lang_tree_node(void*)
        ./gt-c-c-decl.h:288
0x85e49b gt_ggc_mx_c_binding(void*)
        ./gt-c-c-decl.h:577
0x85e4f2 gt_ggc_mx_c_binding(void*)
        ./gt-c-c-decl.h:580
0x85e1f5 gt_ggc_mx_lang_tree_node(void*)
        ./gt-c-c-decl.h:520
0x85d7e3 gt_ggc_mx_lang_tree_node(void*)
        ./gt-c-c-decl.h:381
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

It won't take more than 5 minutes.

Comment 3 H.J. Lu 2018-08-17 20:00:35 UTC

Linus tree:

commit 5c60a7389d795e001c8748b458eb76e3a5b6008c
Merge: b6d6a3076ac4 e1b437691a62
Author: Linus Torvalds <torvalds>
Date:   Thu Aug 16 10:53:45 2018 -0700

    Merge tag 'for-linus-4.19-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
    
is OK.

Comment 4 Clifford Perry 2018-08-20 16:13:48 UTC

(In reply to H.J. Lu from comment #0)
> Created attachment 1476643 [details]
> Kernel oops
> 
> Kernel 4.17.14-202.fc28.x86_64 crashed under heavy load.
> 4.17.14-200.fc28.x86_64 is OK.  I am enclosing a kernel oops.

If I read the image correctly, this shows a fc25 kernel, no fc28 as written up. Was this a custom built fc25 kernel, based on fc28 source kernel rpm? 

I'd like to confirm repeatable with the Fedora fc28 kernel, vs a rebuild of it.

Comment 5 H.J. Lu 2018-08-20 18:15:49 UTC

It is caused by L1 Terminal Fault patches. I opened:

https://bugzilla.kernel.org/show_bug.cgi?id=200867

Only 4.17 tree is affected.

Comment 6 Andi Kleen 2018-08-20 18:26:38 UTC

We tried to reproduce it by building gcc, unsuccessful so far.

May need a better reproducer

Comment 7 H.J. Lu 2018-08-20 19:55:13 UTC

(In reply to Andi Kleen from comment #6)
> We tried to reproduce it by building gcc, unsuccessful so far.
> 
> May need a better reproducer

I can only reproduce it on Haswell desktop processor so far.

Comment 8 Andi Kleen 2018-08-20 21:25:45 UTC

So when I read the crash correctly it looks like a tlb batch got corrupted

It's this function

	for (i = 0; i < nr; i++)
                free_swap_cache(pagep[i]);

and eventually at 0x1f0 the page it references in pagep is bogus

      fffbeda21e1b8400


A normal kernel address would be something like

     ffff9...

but the bogus address somehow lost bit 50. Very odd

So one of the callers of __tlb_remove_page / tlb_remove_page_size  has an invalid page computed. I looked at the callers and didn't see anything suspicious so far, and nothing that should differ from 4.18.

Can you attach some more crashes if you have them? I want to see if the pattern is always the same.

Can you also check if it makes a difference if you disable transparent huge pages?

(echo never > /sys/kernel/mm/transparent_hugepage/enabled  )

Comment 9 Andi Kleen 2018-08-20 22:40:53 UTC

Also when you reproduce please put the vmlinux somewhere

Comment 10 H.J. Lu 2018-09-02 15:26:37 UTC

It looks like a hardware issue.

Note You need to log in before you can comment on or make changes to this bug.