Created attachment 1476643 [details] Kernel oops Kernel 4.17.14-202.fc28.x86_64 crashed under heavy load. 4.17.14-200.fc28.x86_64 is OK. I am enclosing a kernel oops.
How repeatable is this?
(In reply to Laura Abbott from comment #1) > How repeatable is this? My machine has Intel i7-4770K CPU and 32 GB RAM. I can reproduce: [ 899.780323] general protection fault: 0000 [#1] SMP PTI [ 899.780362] Modules linked in: netconsole devlink ebtable_filter ebtables ip6table_filter ip6_tables intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm snd_hda_codec_realtek snd_hda_codec_hdmi mei_wdt snd_hda_codec_generic iTCO_wdt gpio_ich ppdev iTCO_vendor_support snd_hda_intel irqbypass snd_hda_codec crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_core intel_cstate intel_uncore snd_hwdep intel_rapl_perf snd_seq snd_seq_device snd_pcm joydev snd_timer mei_me snd mei lpc_ich shpchp parport_pc soundcore i2c_i801 parport pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc i915 i2c_algo_bit drm_kms_helper drm crc32c_intel r8169 mii video [ 899.780597] CPU: 6 PID: 6216 Comm: cc1plus Not tainted 4.17.14-202.0.fc25.x86_64 #1 [ 899.780627] Hardware name: Gigabyte Technology Co., Ltd. H87M-D3H/H87M-D3H, BIOS F11 08/18/2015 [ 899.780665] RIP: 0010:free_pages_and_swap_cache+0x29/0xb0 [ 899.780689] RSP: 0018:ffff9d3803883c80 EFLAGS: 00010202 [ 899.780710] RAX: 0017fffe00040068 RBX: ffff91ea6597fa80 RCX: 0000000000000000 [ 899.780739] RDX: 0017fffe00040068 RSI: 00000000000001fe RDI: ffff91eade39d2a0 [ 899.780766] RBP: 00000000000001fe R08: ffffeda21ebc3a20 R09: ffff91eade5d5000 [ 899.780793] R10: ffff91eade5d5e20 R11: ffff91eade5d5dc0 R12: ffff91ea6597f010 [ 899.780821] R13: fffbeda21e1b8400 R14: ffff91ea65980000 R15: 00007f0b8c0ea000 [ 899.780849] FS: 0000000000000000(0000) GS:ffff91eade380000(0000) knlGS:0000000000000000 [ 899.780880] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 899.780903] CR2: 00007f0b9dbb753c CR3: 000000047c20a002 CR4: 00000000001606e0 [ 899.780930] Call Trace: [ 899.780949] tlb_flush_mmu_free+0x31/0x50 [ 899.780967] unmap_page_range+0xa32/0xc40 [ 899.780987] unmap_vmas+0x7a/0xb0 [ 899.781003] exit_mmap+0xaa/0x190 [ 899.781021] mmput+0x5f/0x130 [ 899.781037] do_exit+0x280/0xae0 [ 899.781054] ? __do_page_fault+0x263/0x4e0 [ 899.781073] do_group_exit+0x3a/0xa0 [ 899.781091] __x64_sys_exit_group+0x14/0x20 [ 899.781111] do_syscall_64+0x65/0x160 [ 899.781130] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 899.781152] RIP: 0033:0x7f0b9d9fd3a6 [ 899.781168] RSP: 002b:00007ffe108dab28 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7 [ 899.781198] RAX: ffffffffffffffda RBX: 00007f0b9daee740 RCX: 00007f0b9d9fd3a6 [ 899.781225] RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000 [ 899.781253] RBP: 0000000000000000 R08: 00000000000000e7 R09: fffffffffffffe70 [ 899.781281] R10: 00007ffe108da9c0 R11: 0000000000000246 R12: 00007f0b9daee740 [ 899.781310] R13: 0000000000000038 R14: 00007f0b9daf7708 R15: 0000000000000000 [ 899.781338] Code: 40 00 0f 1f 44 00 00 41 56 41 55 41 54 49 89 fc 55 89 f5 53 e8 59 91 fb ff 85 ed 7e 6b 8d 45 ff 4c 89 e3 4d 8d 74 c4 08 4c 8b 2b <49> 8b 55 20 48 8d 42 ff 83 e2 01 49 0f 44 c5 48 8b 48 20 48 8d [ 899.781434] RIP: free_pages_and_swap_cache+0x29/0xb0 RSP: ffff9d3803883c80 [ 899.781473] ---[ end trace 817e490010d352e3 ]--- [ 899.781493] Fixing recursive fault but reboot is needed! within 3 minutes with GCC 9 build: /export/gnu/import/git/sources/gcc/configure --enable-cet --with-demangler-in-ld --prefix=/usr/gcc-9.0.0-x86-64 --with-local-prefix=/usr/local --enable-gnu-indirect-function --enable-clocale=gnu --with-system-zlib --enable-libmpx --with-fpmath=sse --enable-languages=c,c++,fortran,lto,objc,ada,obj-c++,go make -j 8 bootstrap Sometimes I got In file included from /export/gnu/import/git/sources/gcc/libgcc/libgcc2.c:56: /export/gnu/import/git/sources/gcc/libgcc/libgcc2.h:29:9: internal compiler error: Segmentation fault 29 | #pragma GCC visibility push(default) | ^~~ 0x98203e lookup_page_table_entry /export/gnu/import/git/sources/gcc/gcc/ggc-page.c:632 0x983107 ggc_set_mark(void const*) /export/gnu/import/git/sources/gcc/gcc/ggc-page.c:1531 0x85bb89 gt_ggc_mx_lang_tree_node(void*) ./gt-c-c-decl.h:49 0x85cf72 gt_ggc_mx_lang_tree_node(void*) ./gt-c-c-decl.h:278 0x85da3a gt_ggc_mx_lang_tree_node(void*) ./gt-c-c-decl.h:416 0x85da3a gt_ggc_mx_lang_tree_node(void*) ./gt-c-c-decl.h:416 0x85da3a gt_ggc_mx_lang_tree_node(void*) ./gt-c-c-decl.h:416 0x85da3a gt_ggc_mx_lang_tree_node(void*) ./gt-c-c-decl.h:416 0x85da3a gt_ggc_mx_lang_tree_node(void*) ./gt-c-c-decl.h:416 0xcac270 gt_ggc_mx_tree_statement_list_node(void*) /export/build/gnu/tools-build/gcc/build-x86_64-linux/gcc/gtype-desc.c:1888 0x85ddf2 gt_ggc_mx_lang_tree_node(void*) ./gt-c-c-decl.h:464 0x85da3a gt_ggc_mx_lang_tree_node(void*) ./gt-c-c-decl.h:416 0x85d18f gt_ggc_mx_lang_tree_node(void*) ./gt-c-c-decl.h:295 0xcaaf48 gt_ggc_mx_symtab_node(void*) /export/build/gnu/tools-build/gcc/build-x86_64-linux/gcc/gtype-desc.c:1413 0x85d09a gt_ggc_mx_lang_tree_node(void*) ./gt-c-c-decl.h:288 0x85e49b gt_ggc_mx_c_binding(void*) ./gt-c-c-decl.h:577 0x85e4f2 gt_ggc_mx_c_binding(void*) ./gt-c-c-decl.h:580 0x85e1f5 gt_ggc_mx_lang_tree_node(void*) ./gt-c-c-decl.h:520 0x85d7e3 gt_ggc_mx_lang_tree_node(void*) ./gt-c-c-decl.h:381 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See <https://gcc.gnu.org/bugs/> for instructions. It won't take more than 5 minutes.
Linus tree: commit 5c60a7389d795e001c8748b458eb76e3a5b6008c Merge: b6d6a3076ac4 e1b437691a62 Author: Linus Torvalds <torvalds> Date: Thu Aug 16 10:53:45 2018 -0700 Merge tag 'for-linus-4.19-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux is OK.
(In reply to H.J. Lu from comment #0) > Created attachment 1476643 [details] > Kernel oops > > Kernel 4.17.14-202.fc28.x86_64 crashed under heavy load. > 4.17.14-200.fc28.x86_64 is OK. I am enclosing a kernel oops. If I read the image correctly, this shows a fc25 kernel, no fc28 as written up. Was this a custom built fc25 kernel, based on fc28 source kernel rpm? I'd like to confirm repeatable with the Fedora fc28 kernel, vs a rebuild of it.
It is caused by L1 Terminal Fault patches. I opened: https://bugzilla.kernel.org/show_bug.cgi?id=200867 Only 4.17 tree is affected.
We tried to reproduce it by building gcc, unsuccessful so far. May need a better reproducer
(In reply to Andi Kleen from comment #6) > We tried to reproduce it by building gcc, unsuccessful so far. > > May need a better reproducer I can only reproduce it on Haswell desktop processor so far.
So when I read the crash correctly it looks like a tlb batch got corrupted It's this function for (i = 0; i < nr; i++) free_swap_cache(pagep[i]); and eventually at 0x1f0 the page it references in pagep is bogus fffbeda21e1b8400 A normal kernel address would be something like ffff9... but the bogus address somehow lost bit 50. Very odd So one of the callers of __tlb_remove_page / tlb_remove_page_size has an invalid page computed. I looked at the callers and didn't see anything suspicious so far, and nothing that should differ from 4.18. Can you attach some more crashes if you have them? I want to see if the pattern is always the same. Can you also check if it makes a difference if you disable transparent huge pages? (echo never > /sys/kernel/mm/transparent_hugepage/enabled )
Also when you reproduce please put the vmlinux somewhere
It looks like a hardware issue.