Bug 2282287

Summary:	6.8.10-300 regression: general protection fault in nfsd_show when running sosreport with running NFS server
Product:	[Fedora] Fedora	Reporter:	Martin Pitt <mpitt>
Component:	kernel	Assignee:	Kernel Maintainer List <kernel-maint>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	40	CC:	acaringi, adscvr, airlied, alciregi, anthony, bskeggs, edgar.hoch, fedoraproject, gilboad, hdegoede, hpa, idonaldson0, jonny, josef, jsullivan3, kernel-maint, kjell.m.randa, linville, masami256, mchehab, ngaywood, ptalbert, rg4redhat, steved, uwe.menges, vaibhav
Target Milestone:	---	Keywords:	Regression
Target Release:	---
Hardware:	x86_64
OS:	Linux
URL:	https://cockpit-logs.us-east-1.linodeobjects.com/pull-0-f0d0c718-20240520-012823-fedora-40-updates-testing/TestSOS-testVerbose-fedora-40-127.0.0.2-2201-FAIL-1.log.gz
Whiteboard:	CockpitTest
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2024-07-14 05:00:46 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Martin Pitt 2024-05-22 06:20:42 UTC

1. Please describe the problem: Our Cockpit integration tests found [1] a kernel regression in 6.8.10-300 [2]. When  running `sos report` when NFS server is running, it triggers a kernel crash and hangs.

[1] https://github.com/cockpit-project/cockpit/issues/20488
[2] https://bodhi.fedoraproject.org/updates/FEDORA-2024-92664ae6fe


2. What is the Version-Release number of the kernel:

kernel-6.8.10-300.fc40


3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

Still worked up to 6.8.9-300 , the regression got introduced in 6.8.10-300.


4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below: 

systemctl start nfs-server
sos report --batch

This hangs at

  Starting 50/101 multipath       [Running: dnf logs memory multipath]
  Starting 51/101 networking      [Running: dnf logs multipath networking]
  Starting 52/101 networkmanager  [Running: dnf logs networking networkmanager]

 Plugin dnf timed out


 Plugin logs timed out


 Plugin networking timed out


 Plugin networkmanager timed out

and dmesg/journal show a kernel crash:

[   70.663153] general protection fault, probably for non-canonical address 0x207325000a646c74: 0000 [#1] PREEMPT SMP NOPTI
[   70.664352] CPU: 0 PID: 5630 Comm: sos Not tainted 6.8.10-300.fc40.x86_64 #1
[   70.665163] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
[   70.666123] RIP: 0010:_raw_spin_lock_irqsave+0x27/0x50
[   70.666668] Code: 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 53 9c 58 0f 1f 40 00 48 89 c3 fa 0f 1f 44 00 00 65 ff 05 48 c5 ec 7d 31 c0 ba 01 00 00 00 <3e> 0f b1 17 75 09 48 89 d8 5b c3 cc cc cc cc 89 c6 e8 93 08 00 00
[   70.668754] RSP: 0018:ffffa42044247a30 EFLAGS: 00010046
[   70.669346] RAX: 0000000000000000 RBX: 0000000000000282 RCX: 000000000000001d
[   70.670083] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 207325000a646c74
[   70.670818] RBP: 207325000a646c74 R08: 0000000000000001 R09: 0000000000000000
[   70.671522] R10: ffffa42044247ac0 R11: 0000000000000000 R12: ffff91c5c94e4bb8
[   70.672260] R13: 207325000a646974 R14: 0000000000000001 R15: 0000000000000001
[   70.672966] FS:  00007eff74c006c0(0000) GS:ffff91c606a00000(0000) knlGS:0000000000000000
[   70.673814] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   70.674444] CR2: 00007eff70003270 CR3: 0000000007dc2006 CR4: 0000000000370ef0
[   70.675135] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   70.675869] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   70.676596] Call Trace:
[   70.676851]  <TASK>
[   70.677072]  ? die_addr+0x36/0x90
[   70.677428]  ? exc_general_protection+0x17c/0x450
[   70.677918]  ? asm_exc_general_protection+0x26/0x30
[   70.678442]  ? _raw_spin_lock_irqsave+0x27/0x50
[   70.678903]  __percpu_counter_sum+0x18/0xb0
[   70.679336]  nfsd_show+0x53/0x1f0 [nfsd]
[   70.679814]  seq_read_iter+0x11f/0x480
[   70.680214]  seq_read+0x12f/0x170
[   70.680554]  proc_reg_read+0x5a/0xa0
[   70.681182]  vfs_read+0xac/0x380
[   70.681711]  ? do_syscall_64+0x8f/0x170
[   70.682323]  ksys_read+0x6d/0xf0
[   70.682856]  do_syscall_64+0x83/0x170
[   70.683483]  ? syscall_exit_to_user_mode+0x83/0x230
[   70.684264]  ? do_syscall_64+0x8f/0x170
[   70.684909]  ? current_time+0x3e/0xf0
[   70.685537]  ? atime_needs_update+0x9c/0x110
[   70.686229]  ? touch_atime+0x1e/0x120
[   70.686848]  ? splice_direct_to_actor+0x1e4/0x260
[   70.687585]  ? __pfx_direct_splice_actor+0x10/0x10
[   70.688349]  ? do_splice_direct+0x77/0xc0
[   70.689012]  ? __pfx_direct_file_splice_eof+0x10/0x10
[   70.689817]  ? do_sendfile+0x211/0x440
[   70.690460]  ? __x64_sys_sendfile64+0x78/0xd0
[   70.691182]  ? syscall_exit_to_user_mode+0x83/0x230
[   70.691969]  ? do_syscall_64+0x8f/0x170
[   70.692592]  ? syscall_exit_to_user_mode+0x83/0x230
[   70.693365]  ? do_syscall_64+0x8f/0x170
[   70.694020]  ? do_syscall_64+0x8f/0x170
[   70.694654]  ? switch_fpu_return+0x4f/0xe0
[   70.695302]  ? clear_bhb_loop+0x55/0xb0
[   70.695916]  ? clear_bhb_loop+0x55/0xb0
[   70.696540]  ? clear_bhb_loop+0x55/0xb0
[   70.697166]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[   70.697918] RIP: 0033:0x7eff8351dcfa
[   70.698504] Code: 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 e8 74 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 2e 44 89 c7 48 89 45 f8 e8 42 75 f8 ff 48 8b
[   70.701071] RSP: 002b:00007eff74bff710 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   70.702129] RAX: ffffffffffffffda RBX: 00007eff74c00638 RCX: 00007eff8351dcfa
[   70.703156] RDX: 0000000000010000 RSI: 00007eff4c00ad70 RDI: 0000000000000007
[   70.704162] RBP: 00007eff74bff730 R08: 0000000000000000 R09: 0000000000000000
[   70.705191] R10: 00007eff7f7ad780 R11: 0000000000000246 R12: 0000000000010000
[   70.706207] R13: 00007eff4c00ad70 R14: 0000000000000007 R15: 00007eff78002120
[   70.707210]  </TASK>
[   70.707625] Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_compat nf_nat_tftp nf_conntrack_tftp bridge stp llc overlay nfsd auth_rpcgss nfs_acl lockd grace nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables binfmt_misc intel_rapl_msr intel_rapl_common kvm_intel kvm irqbypass rapl virtio_balloon i2c_piix4 pktcdvd cirrus joydev vfat fat loop nfnetlink zram crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel virtio_net sha512_ssse3 sha256_ssse3 sha1_ssse3 net_failover virtio_blk virtio_scsi failover serio_raw ata_generic pata_acpi sunrpc iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables fuse dm_multipath qemu_fw_cfg
[   70.718141] ---[ end trace 0000000000000000 ]---
[   70.718911] RIP: 0010:_raw_spin_lock_irqsave+0x27/0x50
[   70.719751] Code: 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 53 9c 58 0f 1f 40 00 48 89 c3 fa 0f 1f 44 00 00 65 ff 05 48 c5 ec 7d 31 c0 ba 01 00 00 00 <3e> 0f b1 17 75 09 48 89 d8 5b c3 cc cc cc cc 89 c6 e8 93 08 00 00
[   70.722431] RSP: 0018:ffffa42044247a30 EFLAGS: 00010046
[   70.723273] RAX: 0000000000000000 RBX: 0000000000000282 RCX: 000000000000001d
[   70.724377] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 207325000a646c74
[   70.725468] RBP: 207325000a646c74 R08: 0000000000000001 R09: 0000000000000000
[   70.726576] R10: ffffa42044247ac0 R11: 0000000000000000 R12: ffff91c5c94e4bb8
[   70.727693] R13: 207325000a646974 R14: 0000000000000001 R15: 0000000000000001
[   70.728802] FS:  00007eff74c006c0(0000) GS:ffff91c606a00000(0000) knlGS:0000000000000000
[   70.730057] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   70.730986] CR2: 00007eff70003270 CR3: 0000000007dc2006 CR4: 0000000000370ef0
[   70.732090] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   70.733194] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   70.734314] note: sos[5630] exited with irqs disabled
[   70.735214] note: sos[5630] exited with preempt_count 1




5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Done that (with ``kernel-core`` though), and with 6.9.0-64.fc41 it does not crash.

6. Are you running any modules that not shipped with directly Fedora's kernel?:

No, standard Fedora cloud image, no additional repos.


7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Crash excerpt is above, full journal is here:
https://cockpit-logs.us-east-1.linodeobjects.com/pull-0-f0d0c718-20240520-012823-fedora-40-updates-testing/TestSOS-testVerbose-fedora-40-127.0.0.2-2201-FAIL-1.log.gz

Reproducible: Always

Comment 1 John F Sullivan 2024-05-24 00:42:15 UTC

This issue is causing sysstat-collect.service to fail with SIGSEGV:

× sysstat-collect.service - system activity accounting tool
     Loaded: loaded (/usr/lib/systemd/system/sysstat-collect.service; static)
    Drop-In: /usr/lib/systemd/system/service.d
             └─10-timeout-abort.conf
     Active: failed (Result: signal) since Thu 2024-05-23 20:30:05 EDT; 7min ago
TriggeredBy: ● sysstat-collect.timer
       Docs: man:sa1(8)
    Process: 140780 ExecStart=/usr/lib64/sa/sa1 1 1 (code=killed, signal=SEGV)
   Main PID: 140780 (code=killed, signal=SEGV)
        CPU: 34ms

May 23 20:30:05 myhost systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
May 23 20:30:05 myhost systemd[1]: sysstat-collect.service: Main process exited, code=killed, status=11/SEGV
May 23 20:30:05 myhost systemd[1]: sysstat-collect.service: Failed with result 'signal'.
May 23 20:30:05 myhost systemd[1]: Failed to start sysstat-collect.service - system activity accounting tool.

The "journalctl -k" output shows the call nfsd_show call trace listed above at the same time as this service failure.

Comment 2 Ian Donaldson 2024-05-28 06:10:20 UTC

I'm seeing a similar issue on nfs servers with 8.6.10 kernel, but the system isn't crashing; just generates periodic
backtraces similar to the above.

May 27 23:50:02 star kernel: RSP: 0018:ffff98434a423a30 EFLAGS: 00010046
May 27 23:50:02 star kernel: RAX: 0000000000000000 RBX: 0000000000000282 RCX: 000000000000001d
May 27 23:50:02 star kernel: RDX: 0000000000000001 RSI: 0000000000000001 RDI: 7325203a53465100
May 27 23:50:02 star kernel: RBP: 7325203a53465100 R08: 0000000000000001 R09: 0000000000000000
May 27 23:50:02 star kernel: R10: ffff98434a423ac0 R11: 0000000000000000 R12: ffff8d1849e1bb40
May 27 23:50:02 star kernel: R13: 7325203a53464e00 R14: 0000000000000001 R15: ffff98434a423c48
May 27 23:50:02 star kernel: FS:  00007f8ce93c0740(0000) GS:ffff8d196fc80000(0000) knlGS:0000000000000000
May 27 23:50:02 star kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 27 23:50:02 star kernel: CR2: 00007ffe5f054ff0 CR3: 0000000172a58000 CR4: 00000000000406f0
May 27 23:50:02 star kernel: Call Trace:
May 27 23:50:02 star kernel: <TASK>
May 27 23:50:02 star kernel: ? die_addr+0x36/0x90
May 27 23:50:02 star kernel: ? exc_general_protection+0x1dd/0x450
May 27 23:50:02 star kernel: ? asm_exc_general_protection+0x26/0x30
May 27 23:50:02 star kernel: ? _raw_spin_lock_irqsave+0x27/0x50
May 27 23:50:02 star kernel: __percpu_counter_sum+0x18/0xb0
May 27 23:50:02 star kernel: ? __kmalloc_node+0x48c/0x4f0
May 27 23:50:02 star kernel: nfsd_show+0x53/0x1f0 [nfsd]
May 27 23:50:02 star kernel: seq_read_iter+0x123/0x480
May 27 23:50:02 star kernel: seq_read+0x12f/0x170
May 27 23:50:02 star kernel: proc_reg_read+0x5d/0xa0
May 27 23:50:02 star kernel: vfs_read+0xaf/0x380
May 27 23:50:02 star kernel: ? _copy_to_user+0x24/0x40
May 27 23:50:02 star kernel: ? cp_new_stat+0x135/0x170
May 27 23:50:02 star kernel: ksys_read+0x6f/0xf0
May 27 23:50:02 star kernel: do_syscall_64+0x83/0x170
May 27 23:50:02 star kernel: ? __do_sys_newfstatat+0x4e/0x80
May 27 23:50:02 star kernel: ? syscall_exit_to_user_mode+0x83/0x230
May 27 23:50:02 star kernel: ? do_syscall_64+0x90/0x170
May 27 23:50:02 star kernel: ? do_filp_open+0xb3/0x160
May 27 23:50:02 star kernel: ? __pfx_proc_put_link+0x10/0x10
May 27 23:50:02 star kernel: ? __pfx_kfree_link+0x10/0x10
May 27 23:50:02 star kernel: ? do_sys_openat2+0x97/0xe0
May 27 23:50:02 star kernel: ? syscall_exit_to_user_mode+0x83/0x230
May 27 23:50:02 star kernel: ? do_syscall_64+0x90/0x170
May 27 23:50:02 star kernel: ? __irq_exit_rcu+0x4b/0xc0

Comment 3 Ian Donaldson 2024-05-28 06:13:06 UTC

That should read 6.8.10 kernel ...

For now I've just reverted to the previous kernel I had handy, 6.8.4

Comment 4 Edgar Hoch 2024-05-28 12:17:25 UTC

The problem still exists on kernel 6.8.11 (on Fedora 39).

The crash is triggered on systems running nfs-server by sysstat-collect.service, which is called by sysstat-collect.timer every ten minutes. I have stopped the timer temporary.

The crash is also triggered by /usr/libexec/pcp/pmdas/linux/pmdalinux which is called by some services of package pcp.

Comment 5 Anthony 2024-05-30 15:20:27 UTC

I have a similar problem in my FC39 install - also only noticed it with a regular general protection fault in my syslog. Asked about it here with no response  https://forums.fedoraforum.org/showthread.php?332724-Kernel-general-protection-warning-every-10-minutes&p=1883838#post1883838

Today I uninstalled sysstat and I no longer get the warnings in syslog

Comment 6 Kjell Randa 2024-06-03 10:11:16 UTC

A local GitLam installation also trigger this in addition to sysstat
Currently running 6.8.11

[Mon Jun  3 11:23:46 2024] RIP: 0010:_raw_spin_lock_irqsave+0x27/0x50
[Mon Jun  3 11:23:46 2024] Code: 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 53 9c 58 0f 1f 40 00 48 89 c3 fa 0f 1f 44 00 00 65 ff 05 28 2a ee 4a 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 09 48 89 d8 5b c3 cc cc cc cc 89 c6 e8 93 08 00 00
[Mon Jun  3 11:23:46 2024] RSP: 0018:ffffb880cd8a7950 EFLAGS: 00010046
[Mon Jun  3 11:23:46 2024] RAX: 0000000000000000 RBX: 0000000000000286 RCX: 000000000000003d
[Mon Jun  3 11:23:46 2024] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 7325203a53465100
[Mon Jun  3 11:23:46 2024] RBP: 7325203a53465100 R08: 0000000000000001 R09: 0000000000000000
[Mon Jun  3 11:23:46 2024] R10: ffffb880cd8a79e0 R11: 0000000000000000 R12: ffff99cb86e3bca8
[Mon Jun  3 11:23:46 2024] R13: 7325203a53464e00 R14: 0000000000000001 R15: ffffb880cd8a7b68
[Mon Jun  3 11:23:46 2024] FS:  000000c000100090(0000) GS:ffff99ceaf380000(0000) knlGS:0000000000000000
[Mon Jun  3 11:23:46 2024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Mon Jun  3 11:23:46 2024] CR2: 000000c0003d7000 CR3: 0000000388ad2003 CR4: 00000000000606f0
[Mon Jun  3 11:23:46 2024] note: node_exporter[173614] exited with irqs disabled
[Mon Jun  3 11:23:46 2024] note: node_exporter[173614] exited with preempt_count 1
[Mon Jun  3 11:24:01 2024] general protection fault, probably for non-canonical address 0x7325203a53465100: 0000 [#13] PREEMPT SMP PTI
[Mon Jun  3 11:24:01 2024] CPU: 1 PID: 173602 Comm: node_exporter Tainted: P      D    OE      6.8.11-200.fc39.x86_64 #1
[Mon Jun  3 11:24:01 2024] Hardware name: System manufacturer System Product Name/P8P67 DELUXE, BIOS 1502 03/02/2011
[Mon Jun  3 11:24:01 2024] RIP: 0010:_raw_spin_lock_irqsave+0x27/0x50
[Mon Jun  3 11:24:01 2024] Code: 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 53 9c 58 0f 1f 40 00 48 89 c3 fa 0f 1f 44 00 00 65 ff 05 28 2a ee 4a 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 09 48 89 d8 5b c3 cc cc cc cc 89 c6 e8 93 08 00 00
[Mon Jun  3 11:24:01 2024] RSP: 0018:ffffb880d5d3fa38 EFLAGS: 00010046
[Mon Jun  3 11:24:01 2024] RAX: 0000000000000000 RBX: 0000000000000286 RCX: 000000000000000c
[Mon Jun  3 11:24:01 2024] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 7325203a53465100
[Mon Jun  3 11:24:01 2024] RBP: 7325203a53465100 R08: 0000000000000001 R09: 0000000000000000
[Mon Jun  3 11:24:01 2024] R10: ffffb880d5d3fac8 R11: 0000000000000000 R12: ffff99cb8c830690
[Mon Jun  3 11:24:01 2024] R13: 7325203a53464e00 R14: 0000000000000001 R15: ffffb880d5d3fc50
[Mon Jun  3 11:24:01 2024] FS:  000000000112e250(0000) GS:ffff99ceaf280000(0000) knlGS:0000000000000000
[Mon Jun  3 11:24:01 2024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Mon Jun  3 11:24:01 2024] CR2: 000000c0007a9008 CR3: 0000000388ad2005 CR4: 00000000000606f0
[Mon Jun  3 11:24:01 2024] Call Trace:
[Mon Jun  3 11:24:01 2024]  <TASK>
[Mon Jun  3 11:24:01 2024]  ? die_addr+0x36/0x90
[Mon Jun  3 11:24:01 2024]  ? exc_general_protection+0x1dd/0x450
[Mon Jun  3 11:24:01 2024]  ? asm_exc_general_protection+0x26/0x30
[Mon Jun  3 11:24:01 2024]  ? _raw_spin_lock_irqsave+0x27/0x50
[Mon Jun  3 11:24:01 2024]  __percpu_counter_sum+0x18/0xb0
[Mon Jun  3 11:24:01 2024]  nfsd_show+0x53/0x1f0 [nfsd]
[Mon Jun  3 11:24:01 2024]  seq_read_iter+0x123/0x480
[Mon Jun  3 11:24:01 2024]  seq_read+0x12f/0x170
[Mon Jun  3 11:24:01 2024]  proc_reg_read+0x5d/0xa0
[Mon Jun  3 11:24:01 2024]  vfs_read+0xaf/0x380
[Mon Jun  3 11:24:01 2024]  ? do_syscall_64+0x90/0x170
[Mon Jun  3 11:24:01 2024]  ksys_read+0x6f/0xf0
[Mon Jun  3 11:24:01 2024]  do_syscall_64+0x83/0x170
[Mon Jun  3 11:24:01 2024]  ? __x64_sys_fcntl+0x81/0xc0
[Mon Jun  3 11:24:01 2024]  ? syscall_exit_to_user_mode+0x83/0x230
[Mon Jun  3 11:24:01 2024]  ? __memcg_slab_post_alloc_hook+0x17d/0x210
[Mon Jun  3 11:24:01 2024]  ? kmem_cache_alloc+0x326/0x330
[Mon Jun  3 11:24:01 2024]  ? syscall_exit_to_user_mode+0x83/0x230
[Mon Jun  3 11:24:01 2024]  ? do_epoll_ctl+0x756/0x1000
[Mon Jun  3 11:24:01 2024]  ? do_syscall_64+0x90/0x170
[Mon Jun  3 11:24:01 2024]  ? ep_item_poll.isra.0+0x30/0x50
[Mon Jun  3 11:24:01 2024]  ? do_epoll_ctl+0x1ce/0x1000
[Mon Jun  3 11:24:01 2024]  ? __pfx_ep_ptable_queue_proc+0x10/0x10
[Mon Jun  3 11:24:01 2024]  ? __x64_sys_epoll_ctl+0x70/0xa0
[Mon Jun  3 11:24:01 2024]  ? syscall_exit_to_user_mode+0x83/0x230
[Mon Jun  3 11:24:01 2024]  ? do_syscall_64+0x90/0x170
[Mon Jun  3 11:24:01 2024]  ? do_syscall_64+0x90/0x170
[Mon Jun  3 11:24:01 2024]  ? exc_page_fault+0x7f/0x180
[Mon Jun  3 11:24:01 2024]  entry_SYSCALL_64_after_hwframe+0x78/0x80
[Mon Jun  3 11:24:01 2024] RIP: 0033:0x40720e
[Mon Jun  3 11:24:01 2024] Code: 48 83 ec 38 e8 13 00 00 00 48 83 c4 38 5d c3 cc cc cc cc cc cc cc cc cc cc cc cc cc 49 89 f2 48 89 fa 48 89 ce 48 89 df 0f 05 <48> 3d 01 f0 ff ff 76 15 48 f7 d8 48 89 c1 48 c7 c0 ff ff ff ff 48
[Mon Jun  3 11:24:01 2024] RSP: 002b:000000c0005291d0 EFLAGS: 00000216 ORIG_RAX: 0000000000000000
[Mon Jun  3 11:24:01 2024] RAX: ffffffffffffffda RBX: 000000000000000b RCX: 000000000040720e
[Mon Jun  3 11:24:01 2024] RDX: 0000000000001000 RSI: 000000c00066a000 RDI: 000000000000000b
[Mon Jun  3 11:24:01 2024] RBP: 000000c000529210 R08: 0000000000000000 R09: 0000000000000000
[Mon Jun  3 11:24:01 2024] R10: 0000000000000000 R11: 0000000000000216 R12: 000000c000529350
[Mon Jun  3 11:24:01 2024] R13: 000000000112e1c0 R14: 000000c0002f56c0 R15: 0000000000000002
[Mon Jun  3 11:24:01 2024]  </TASK>
[Mon Jun  3 11:24:01 2024] Modules linked in: 8021q garp mrp overlay xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat ip6table_filter iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bridge stp llc qrtr rpcrdma rdma_cm iw_cm ib_cm ib_core nct6775 nct6775_core hwmon_vid nfsd auth_rpcgss nfs_acl lockd grace sunrpc tls bnep nvidia_drm(POE) nvidia_modeset(POE) nvidia_uvm(POE) nvidia(POE) binfmt_misc btusb btrtl btintel btbcm btmtk snd_hda_codec_realtek snd_hda_codec_generic bluetooth snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep pktcdvd xfs snd_seq iTCO_wdt intel_pmc_bxt raid456 async_raid6_recov intel_rapl_msr async_memcpy async_pq snd_seq_device iTCO_vendor_support async_xor async_tx snd_pcm at24 mei_me mei snd_timer snd soundcore intel_rapl_common i2c_i801 eeepc_wmi asus_wmi lpc_ich x86_pkg_temp_thermal intel_powerclamp i2c_smbus ledtrig_audio coretemp sparse_keymap platform_profile rapl
[Mon Jun  3 11:24:01 2024]  intel_cstate rfkill intel_uncore video wmi_bmof loop zram crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic firewire_ohci ghash_clmulni_intel sha512_ssse3 mxm_wmi raid1 sha256_ssse3 sha1_ssse3 r8169 realtek firewire_core crc_itu_t sata_mv e1000e wmi scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables dm_multipath fuse i2c_dev
[Mon Jun  3 11:24:01 2024] ---[ end trace 0000000000000000 ]---
[Mon Jun  3 11:24:01 2024] RIP: 0010:_raw_spin_lock_irqsave+0x27/0x50
[Mon Jun  3 11:24:01 2024] Code: 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 53 9c 58 0f 1f 40 00 48 89 c3 fa 0f 1f 44 00 00 65 ff 05 28 2a ee 4a 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 09 48 89 d8 5b c3 cc cc cc cc 89 c6 e8 93 08 00 00
[Mon Jun  3 11:24:01 2024] RSP: 0018:ffffb880cd8a7950 EFLAGS: 00010046
[Mon Jun  3 11:24:01 2024] RAX: 0000000000000000 RBX: 0000000000000286 RCX: 000000000000003d
[Mon Jun  3 11:24:01 2024] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 7325203a53465100
[Mon Jun  3 11:24:01 2024] RBP: 7325203a53465100 R08: 0000000000000001 R09: 0000000000000000
[Mon Jun  3 11:24:01 2024] R10: ffffb880cd8a79e0 R11: 0000000000000000 R12: ffff99cb86e3bca8
[Mon Jun  3 11:24:01 2024] R13: 7325203a53464e00 R14: 0000000000000001 R15: ffffb880cd8a7b68
[Mon Jun  3 11:24:01 2024] FS:  000000000112e250(0000) GS:ffff99ceaf280000(0000) knlGS:0000000000000000
[Mon Jun  3 11:24:01 2024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Mon Jun  3 11:24:01 2024] CR2: 000000c0007a9008 CR3: 0000000388ad2005 CR4: 00000000000606f0
[Mon Jun  3 11:24:01 2024] note: node_exporter[173602] exited with irqs disabled
[Mon Jun  3 11:24:01 2024] note: node_exporter[173602] exited with preempt_count 1

Comment 7 Edgar Hoch 2024-06-03 10:21:44 UTC

I don't see this crash on kernel 6.8.12.

Comment 8 Thomas Clark 2024-06-05 20:16:23 UTC

Edgar, is nfsd running error-free for you on 6.8.12 or do you see different crashes? There is a discussion on https://bodhi.fedoraproject.org/updates/FEDORA-2024-2c08de9311of a different nfsd issue apparently introduced in 6.8.11 and continuing.

Comment 9 Edgar Hoch 2024-06-06 00:08:15 UTC

Thomas, I don't see a nfs crash with kernel 6.8.12, neither on Fedora 39 nor on Fedora 40.
But I don't use auth_rpcgss, which is mentioned in bug 2284279 and https://bodhi.fedoraproject.org/updates/FEDORA-2024-2c08de9311 .

sysstat-collect.timer is running without causing a crash.

kernel-6.8.12-200.fc39.x86_64
systemd-254.13-1.fc39.x86_64
sysstat-12.7.4-2.fc39.x86_64

kernel-6.8.12-300.fc40.x86_64
systemd-255.7-1.fc40.x86_64
sysstat-12.7.5-2.fc40.x86_64

Comment 10 Gilboa Davara 2024-06-08 08:35:11 UTC

Seeing the same, across ~10 machines with F40 with both 6.8.10 and 6.8.11.
Reverting back to 6.8.5 (release kernel), solves the problem.

Comment 11 Richard G 2024-06-09 02:35:20 UTC

The possibly easiest way to reproduce this bug is to read /proc/net/rpc/nfsd, which is where /usr/lib64/sa/sa1 (as called by sysstat-collect.service) dies.

[root@opus ~]# cat /proc/net/rpc/nfsd
Segmentation fault
[root@opus ~]#

Reproduced on Fedora 39 running 6.8.11-200.fc39.x86_64.

Comment 12 Uwe Menges 2024-06-12 12:19:52 UTC

The upcoming kernel https://bodhi.fedoraproject.org/updates/FEDORA-2024-f0bbf1af25 fixed that for me.
# uname -r
6.8.12-200.fc39.x86_64
# cat /proc/net/rpc/nfsd
rc 0 0 0
fh 0 0 0 0 0
io 0 0
th 8 0 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
ra 0 0 0 0 0 0 0 0 0 0 0 0
net 0 0 0 0
rpc 0 0 0 0 0
proc3 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
proc4 2 0 0
proc4ops 76 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
wdeleg_getattr 0

Comment 13 Martin Pitt 2024-07-14 05:00:46 UTC

Uwe's referenced bodhi update was for Fedora 39. But indeed our automatic tracker [1] confirms that this is fixed since June 21.

[1] https://github.com/cockpit-project/bots/issues/6411