Bug 2282287
Summary: | 6.8.10-300 regression: general protection fault in nfsd_show when running sosreport with running NFS server | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Martin Pitt <mpitt> |
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | 40 | CC: | acaringi, adscvr, airlied, alciregi, anthony, bskeggs, edgar.hoch, fedoraproject, gilboad, hdegoede, hpa, idonaldson0, jonny, josef, jsullivan3, kernel-maint, kjell.m.randa, linville, masami256, mchehab, ngaywood, ptalbert, rg4redhat, steved, uwe.menges, vaibhav |
Target Milestone: | --- | Keywords: | Regression |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
URL: | https://cockpit-logs.us-east-1.linodeobjects.com/pull-0-f0d0c718-20240520-012823-fedora-40-updates-testing/TestSOS-testVerbose-fedora-40-127.0.0.2-2201-FAIL-1.log.gz | ||
Whiteboard: | CockpitTest | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2024-07-14 05:00:46 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Martin Pitt
2024-05-22 06:20:42 UTC
This issue is causing sysstat-collect.service to fail with SIGSEGV: × sysstat-collect.service - system activity accounting tool Loaded: loaded (/usr/lib/systemd/system/sysstat-collect.service; static) Drop-In: /usr/lib/systemd/system/service.d └─10-timeout-abort.conf Active: failed (Result: signal) since Thu 2024-05-23 20:30:05 EDT; 7min ago TriggeredBy: ● sysstat-collect.timer Docs: man:sa1(8) Process: 140780 ExecStart=/usr/lib64/sa/sa1 1 1 (code=killed, signal=SEGV) Main PID: 140780 (code=killed, signal=SEGV) CPU: 34ms May 23 20:30:05 myhost systemd[1]: Starting sysstat-collect.service - system activity accounting tool... May 23 20:30:05 myhost systemd[1]: sysstat-collect.service: Main process exited, code=killed, status=11/SEGV May 23 20:30:05 myhost systemd[1]: sysstat-collect.service: Failed with result 'signal'. May 23 20:30:05 myhost systemd[1]: Failed to start sysstat-collect.service - system activity accounting tool. The "journalctl -k" output shows the call nfsd_show call trace listed above at the same time as this service failure. I'm seeing a similar issue on nfs servers with 8.6.10 kernel, but the system isn't crashing; just generates periodic backtraces similar to the above. May 27 23:50:02 star kernel: RSP: 0018:ffff98434a423a30 EFLAGS: 00010046 May 27 23:50:02 star kernel: RAX: 0000000000000000 RBX: 0000000000000282 RCX: 000000000000001d May 27 23:50:02 star kernel: RDX: 0000000000000001 RSI: 0000000000000001 RDI: 7325203a53465100 May 27 23:50:02 star kernel: RBP: 7325203a53465100 R08: 0000000000000001 R09: 0000000000000000 May 27 23:50:02 star kernel: R10: ffff98434a423ac0 R11: 0000000000000000 R12: ffff8d1849e1bb40 May 27 23:50:02 star kernel: R13: 7325203a53464e00 R14: 0000000000000001 R15: ffff98434a423c48 May 27 23:50:02 star kernel: FS: 00007f8ce93c0740(0000) GS:ffff8d196fc80000(0000) knlGS:0000000000000000 May 27 23:50:02 star kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 May 27 23:50:02 star kernel: CR2: 00007ffe5f054ff0 CR3: 0000000172a58000 CR4: 00000000000406f0 May 27 23:50:02 star kernel: Call Trace: May 27 23:50:02 star kernel: <TASK> May 27 23:50:02 star kernel: ? die_addr+0x36/0x90 May 27 23:50:02 star kernel: ? exc_general_protection+0x1dd/0x450 May 27 23:50:02 star kernel: ? asm_exc_general_protection+0x26/0x30 May 27 23:50:02 star kernel: ? _raw_spin_lock_irqsave+0x27/0x50 May 27 23:50:02 star kernel: __percpu_counter_sum+0x18/0xb0 May 27 23:50:02 star kernel: ? __kmalloc_node+0x48c/0x4f0 May 27 23:50:02 star kernel: nfsd_show+0x53/0x1f0 [nfsd] May 27 23:50:02 star kernel: seq_read_iter+0x123/0x480 May 27 23:50:02 star kernel: seq_read+0x12f/0x170 May 27 23:50:02 star kernel: proc_reg_read+0x5d/0xa0 May 27 23:50:02 star kernel: vfs_read+0xaf/0x380 May 27 23:50:02 star kernel: ? _copy_to_user+0x24/0x40 May 27 23:50:02 star kernel: ? cp_new_stat+0x135/0x170 May 27 23:50:02 star kernel: ksys_read+0x6f/0xf0 May 27 23:50:02 star kernel: do_syscall_64+0x83/0x170 May 27 23:50:02 star kernel: ? __do_sys_newfstatat+0x4e/0x80 May 27 23:50:02 star kernel: ? syscall_exit_to_user_mode+0x83/0x230 May 27 23:50:02 star kernel: ? do_syscall_64+0x90/0x170 May 27 23:50:02 star kernel: ? do_filp_open+0xb3/0x160 May 27 23:50:02 star kernel: ? __pfx_proc_put_link+0x10/0x10 May 27 23:50:02 star kernel: ? __pfx_kfree_link+0x10/0x10 May 27 23:50:02 star kernel: ? do_sys_openat2+0x97/0xe0 May 27 23:50:02 star kernel: ? syscall_exit_to_user_mode+0x83/0x230 May 27 23:50:02 star kernel: ? do_syscall_64+0x90/0x170 May 27 23:50:02 star kernel: ? __irq_exit_rcu+0x4b/0xc0 That should read 6.8.10 kernel ... For now I've just reverted to the previous kernel I had handy, 6.8.4 The problem still exists on kernel 6.8.11 (on Fedora 39). The crash is triggered on systems running nfs-server by sysstat-collect.service, which is called by sysstat-collect.timer every ten minutes. I have stopped the timer temporary. The crash is also triggered by /usr/libexec/pcp/pmdas/linux/pmdalinux which is called by some services of package pcp. I have a similar problem in my FC39 install - also only noticed it with a regular general protection fault in my syslog. Asked about it here with no response https://forums.fedoraforum.org/showthread.php?332724-Kernel-general-protection-warning-every-10-minutes&p=1883838#post1883838 Today I uninstalled sysstat and I no longer get the warnings in syslog A local GitLam installation also trigger this in addition to sysstat Currently running 6.8.11 [Mon Jun 3 11:23:46 2024] RIP: 0010:_raw_spin_lock_irqsave+0x27/0x50 [Mon Jun 3 11:23:46 2024] Code: 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 53 9c 58 0f 1f 40 00 48 89 c3 fa 0f 1f 44 00 00 65 ff 05 28 2a ee 4a 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 09 48 89 d8 5b c3 cc cc cc cc 89 c6 e8 93 08 00 00 [Mon Jun 3 11:23:46 2024] RSP: 0018:ffffb880cd8a7950 EFLAGS: 00010046 [Mon Jun 3 11:23:46 2024] RAX: 0000000000000000 RBX: 0000000000000286 RCX: 000000000000003d [Mon Jun 3 11:23:46 2024] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 7325203a53465100 [Mon Jun 3 11:23:46 2024] RBP: 7325203a53465100 R08: 0000000000000001 R09: 0000000000000000 [Mon Jun 3 11:23:46 2024] R10: ffffb880cd8a79e0 R11: 0000000000000000 R12: ffff99cb86e3bca8 [Mon Jun 3 11:23:46 2024] R13: 7325203a53464e00 R14: 0000000000000001 R15: ffffb880cd8a7b68 [Mon Jun 3 11:23:46 2024] FS: 000000c000100090(0000) GS:ffff99ceaf380000(0000) knlGS:0000000000000000 [Mon Jun 3 11:23:46 2024] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Mon Jun 3 11:23:46 2024] CR2: 000000c0003d7000 CR3: 0000000388ad2003 CR4: 00000000000606f0 [Mon Jun 3 11:23:46 2024] note: node_exporter[173614] exited with irqs disabled [Mon Jun 3 11:23:46 2024] note: node_exporter[173614] exited with preempt_count 1 [Mon Jun 3 11:24:01 2024] general protection fault, probably for non-canonical address 0x7325203a53465100: 0000 [#13] PREEMPT SMP PTI [Mon Jun 3 11:24:01 2024] CPU: 1 PID: 173602 Comm: node_exporter Tainted: P D OE 6.8.11-200.fc39.x86_64 #1 [Mon Jun 3 11:24:01 2024] Hardware name: System manufacturer System Product Name/P8P67 DELUXE, BIOS 1502 03/02/2011 [Mon Jun 3 11:24:01 2024] RIP: 0010:_raw_spin_lock_irqsave+0x27/0x50 [Mon Jun 3 11:24:01 2024] Code: 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 53 9c 58 0f 1f 40 00 48 89 c3 fa 0f 1f 44 00 00 65 ff 05 28 2a ee 4a 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 09 48 89 d8 5b c3 cc cc cc cc 89 c6 e8 93 08 00 00 [Mon Jun 3 11:24:01 2024] RSP: 0018:ffffb880d5d3fa38 EFLAGS: 00010046 [Mon Jun 3 11:24:01 2024] RAX: 0000000000000000 RBX: 0000000000000286 RCX: 000000000000000c [Mon Jun 3 11:24:01 2024] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 7325203a53465100 [Mon Jun 3 11:24:01 2024] RBP: 7325203a53465100 R08: 0000000000000001 R09: 0000000000000000 [Mon Jun 3 11:24:01 2024] R10: ffffb880d5d3fac8 R11: 0000000000000000 R12: ffff99cb8c830690 [Mon Jun 3 11:24:01 2024] R13: 7325203a53464e00 R14: 0000000000000001 R15: ffffb880d5d3fc50 [Mon Jun 3 11:24:01 2024] FS: 000000000112e250(0000) GS:ffff99ceaf280000(0000) knlGS:0000000000000000 [Mon Jun 3 11:24:01 2024] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Mon Jun 3 11:24:01 2024] CR2: 000000c0007a9008 CR3: 0000000388ad2005 CR4: 00000000000606f0 [Mon Jun 3 11:24:01 2024] Call Trace: [Mon Jun 3 11:24:01 2024] <TASK> [Mon Jun 3 11:24:01 2024] ? die_addr+0x36/0x90 [Mon Jun 3 11:24:01 2024] ? exc_general_protection+0x1dd/0x450 [Mon Jun 3 11:24:01 2024] ? asm_exc_general_protection+0x26/0x30 [Mon Jun 3 11:24:01 2024] ? _raw_spin_lock_irqsave+0x27/0x50 [Mon Jun 3 11:24:01 2024] __percpu_counter_sum+0x18/0xb0 [Mon Jun 3 11:24:01 2024] nfsd_show+0x53/0x1f0 [nfsd] [Mon Jun 3 11:24:01 2024] seq_read_iter+0x123/0x480 [Mon Jun 3 11:24:01 2024] seq_read+0x12f/0x170 [Mon Jun 3 11:24:01 2024] proc_reg_read+0x5d/0xa0 [Mon Jun 3 11:24:01 2024] vfs_read+0xaf/0x380 [Mon Jun 3 11:24:01 2024] ? do_syscall_64+0x90/0x170 [Mon Jun 3 11:24:01 2024] ksys_read+0x6f/0xf0 [Mon Jun 3 11:24:01 2024] do_syscall_64+0x83/0x170 [Mon Jun 3 11:24:01 2024] ? __x64_sys_fcntl+0x81/0xc0 [Mon Jun 3 11:24:01 2024] ? syscall_exit_to_user_mode+0x83/0x230 [Mon Jun 3 11:24:01 2024] ? __memcg_slab_post_alloc_hook+0x17d/0x210 [Mon Jun 3 11:24:01 2024] ? kmem_cache_alloc+0x326/0x330 [Mon Jun 3 11:24:01 2024] ? syscall_exit_to_user_mode+0x83/0x230 [Mon Jun 3 11:24:01 2024] ? do_epoll_ctl+0x756/0x1000 [Mon Jun 3 11:24:01 2024] ? do_syscall_64+0x90/0x170 [Mon Jun 3 11:24:01 2024] ? ep_item_poll.isra.0+0x30/0x50 [Mon Jun 3 11:24:01 2024] ? do_epoll_ctl+0x1ce/0x1000 [Mon Jun 3 11:24:01 2024] ? __pfx_ep_ptable_queue_proc+0x10/0x10 [Mon Jun 3 11:24:01 2024] ? __x64_sys_epoll_ctl+0x70/0xa0 [Mon Jun 3 11:24:01 2024] ? syscall_exit_to_user_mode+0x83/0x230 [Mon Jun 3 11:24:01 2024] ? do_syscall_64+0x90/0x170 [Mon Jun 3 11:24:01 2024] ? do_syscall_64+0x90/0x170 [Mon Jun 3 11:24:01 2024] ? exc_page_fault+0x7f/0x180 [Mon Jun 3 11:24:01 2024] entry_SYSCALL_64_after_hwframe+0x78/0x80 [Mon Jun 3 11:24:01 2024] RIP: 0033:0x40720e [Mon Jun 3 11:24:01 2024] Code: 48 83 ec 38 e8 13 00 00 00 48 83 c4 38 5d c3 cc cc cc cc cc cc cc cc cc cc cc cc cc 49 89 f2 48 89 fa 48 89 ce 48 89 df 0f 05 <48> 3d 01 f0 ff ff 76 15 48 f7 d8 48 89 c1 48 c7 c0 ff ff ff ff 48 [Mon Jun 3 11:24:01 2024] RSP: 002b:000000c0005291d0 EFLAGS: 00000216 ORIG_RAX: 0000000000000000 [Mon Jun 3 11:24:01 2024] RAX: ffffffffffffffda RBX: 000000000000000b RCX: 000000000040720e [Mon Jun 3 11:24:01 2024] RDX: 0000000000001000 RSI: 000000c00066a000 RDI: 000000000000000b [Mon Jun 3 11:24:01 2024] RBP: 000000c000529210 R08: 0000000000000000 R09: 0000000000000000 [Mon Jun 3 11:24:01 2024] R10: 0000000000000000 R11: 0000000000000216 R12: 000000c000529350 [Mon Jun 3 11:24:01 2024] R13: 000000000112e1c0 R14: 000000c0002f56c0 R15: 0000000000000002 [Mon Jun 3 11:24:01 2024] </TASK> [Mon Jun 3 11:24:01 2024] Modules linked in: 8021q garp mrp overlay xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat ip6table_filter iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bridge stp llc qrtr rpcrdma rdma_cm iw_cm ib_cm ib_core nct6775 nct6775_core hwmon_vid nfsd auth_rpcgss nfs_acl lockd grace sunrpc tls bnep nvidia_drm(POE) nvidia_modeset(POE) nvidia_uvm(POE) nvidia(POE) binfmt_misc btusb btrtl btintel btbcm btmtk snd_hda_codec_realtek snd_hda_codec_generic bluetooth snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep pktcdvd xfs snd_seq iTCO_wdt intel_pmc_bxt raid456 async_raid6_recov intel_rapl_msr async_memcpy async_pq snd_seq_device iTCO_vendor_support async_xor async_tx snd_pcm at24 mei_me mei snd_timer snd soundcore intel_rapl_common i2c_i801 eeepc_wmi asus_wmi lpc_ich x86_pkg_temp_thermal intel_powerclamp i2c_smbus ledtrig_audio coretemp sparse_keymap platform_profile rapl [Mon Jun 3 11:24:01 2024] intel_cstate rfkill intel_uncore video wmi_bmof loop zram crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic firewire_ohci ghash_clmulni_intel sha512_ssse3 mxm_wmi raid1 sha256_ssse3 sha1_ssse3 r8169 realtek firewire_core crc_itu_t sata_mv e1000e wmi scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables dm_multipath fuse i2c_dev [Mon Jun 3 11:24:01 2024] ---[ end trace 0000000000000000 ]--- [Mon Jun 3 11:24:01 2024] RIP: 0010:_raw_spin_lock_irqsave+0x27/0x50 [Mon Jun 3 11:24:01 2024] Code: 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 53 9c 58 0f 1f 40 00 48 89 c3 fa 0f 1f 44 00 00 65 ff 05 28 2a ee 4a 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 09 48 89 d8 5b c3 cc cc cc cc 89 c6 e8 93 08 00 00 [Mon Jun 3 11:24:01 2024] RSP: 0018:ffffb880cd8a7950 EFLAGS: 00010046 [Mon Jun 3 11:24:01 2024] RAX: 0000000000000000 RBX: 0000000000000286 RCX: 000000000000003d [Mon Jun 3 11:24:01 2024] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 7325203a53465100 [Mon Jun 3 11:24:01 2024] RBP: 7325203a53465100 R08: 0000000000000001 R09: 0000000000000000 [Mon Jun 3 11:24:01 2024] R10: ffffb880cd8a79e0 R11: 0000000000000000 R12: ffff99cb86e3bca8 [Mon Jun 3 11:24:01 2024] R13: 7325203a53464e00 R14: 0000000000000001 R15: ffffb880cd8a7b68 [Mon Jun 3 11:24:01 2024] FS: 000000000112e250(0000) GS:ffff99ceaf280000(0000) knlGS:0000000000000000 [Mon Jun 3 11:24:01 2024] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [Mon Jun 3 11:24:01 2024] CR2: 000000c0007a9008 CR3: 0000000388ad2005 CR4: 00000000000606f0 [Mon Jun 3 11:24:01 2024] note: node_exporter[173602] exited with irqs disabled [Mon Jun 3 11:24:01 2024] note: node_exporter[173602] exited with preempt_count 1 I don't see this crash on kernel 6.8.12. Edgar, is nfsd running error-free for you on 6.8.12 or do you see different crashes? There is a discussion on https://bodhi.fedoraproject.org/updates/FEDORA-2024-2c08de9311of a different nfsd issue apparently introduced in 6.8.11 and continuing. Thomas, I don't see a nfs crash with kernel 6.8.12, neither on Fedora 39 nor on Fedora 40. But I don't use auth_rpcgss, which is mentioned in bug 2284279 and https://bodhi.fedoraproject.org/updates/FEDORA-2024-2c08de9311 . sysstat-collect.timer is running without causing a crash. kernel-6.8.12-200.fc39.x86_64 systemd-254.13-1.fc39.x86_64 sysstat-12.7.4-2.fc39.x86_64 kernel-6.8.12-300.fc40.x86_64 systemd-255.7-1.fc40.x86_64 sysstat-12.7.5-2.fc40.x86_64 Seeing the same, across ~10 machines with F40 with both 6.8.10 and 6.8.11. Reverting back to 6.8.5 (release kernel), solves the problem. The possibly easiest way to reproduce this bug is to read /proc/net/rpc/nfsd, which is where /usr/lib64/sa/sa1 (as called by sysstat-collect.service) dies. [root@opus ~]# cat /proc/net/rpc/nfsd Segmentation fault [root@opus ~]# Reproduced on Fedora 39 running 6.8.11-200.fc39.x86_64. The upcoming kernel https://bodhi.fedoraproject.org/updates/FEDORA-2024-f0bbf1af25 fixed that for me. # uname -r 6.8.12-200.fc39.x86_64 # cat /proc/net/rpc/nfsd rc 0 0 0 fh 0 0 0 0 0 io 0 0 th 8 0 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 ra 0 0 0 0 0 0 0 0 0 0 0 0 net 0 0 0 0 rpc 0 0 0 0 0 proc3 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 proc4 2 0 0 proc4ops 76 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 wdeleg_getattr 0 Uwe's referenced bodhi update was for Fedora 39. But indeed our automatic tracker [1] confirms that this is fixed since June 21. [1] https://github.com/cockpit-project/bots/issues/6411 |