Bug 1776572 - Frequent kernel freeze with key_garbage_collector
Summary: Frequent kernel freeze with key_garbage_collector
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 31
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-26 02:55 UTC by Dmitrij S. Kryzhevich
Modified: 2020-03-09 05:11 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-09 05:11:59 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Linux Kernel 205671 0 None None None 2019-11-27 01:15:04 UTC

Description Dmitrij S. Kryzhevich 2019-11-26 02:55:23 UTC
1. Please describe the problem:
System freeze every 1-4 hours. Logs contain the following:

> Workqueue: events key_garbage_collector


2. What is the Version-Release number of the kernel:
5.3.xx


3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

It worked in f30 and stopped after upgrade to f31.

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Just start the system and wait, it will freeze in 1-4 hours.


5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

It works with kernel-5.2.17-200.fc30.x86_64

6. Are you running any modules that not shipped with directly Fedora's kernel?:

Yes and no. I use nvidia.ko but did try without it with no difference.

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

nov 19 10:53:55 bear.pdc.lkkm kernel: BUG: kernel NULL pointer dereference, address: 0000000000000014
nov 19 10:53:55 bear.pdc.lkkm kernel: #PF: supervisor read access in kernel mode
nov 19 10:53:55 bear.pdc.lkkm kernel: #PF: error_code(0x0000) - not-present page
nov 19 10:53:55 bear.pdc.lkkm kernel: PGD 0 P4D 0
nov 19 10:53:56 bear.pdc.lkkm kernel: Oops: 0000 [#1] SMP PTI
nov 19 10:53:56 bear.pdc.lkkm kernel: CPU: 14 PID: 216 Comm: kworker/14:1 Tainted: P           OE     5.3.11-300.fc31.x86_64 #1
nov 19 10:53:56 bear.pdc.lkkm kernel: Hardware name: ASUS All Series/X99-A, BIOS 3801 08/10/2017
nov 19 10:53:56 bear.pdc.lkkm kernel: Workqueue: events key_garbage_collector
nov 19 10:53:56 bear.pdc.lkkm kernel: RIP: 0010:keyring_gc_check_iterator+0x2c/0x40
nov 19 10:53:56 bear.pdc.lkkm kernel: Code: 44 00 00 48 83 e7 fc b8 01 00 00 00 f6 87 80 00 00 00 21 75 19 48 8b 57 58 48 39 16 7c 05 48 85 d2 7f 0b 48 8b 87 a0 00 00 00 <0f> b6 40 14 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f
nov 19 10:53:56 bear.pdc.lkkm kernel: RSP: 0018:ffffbbc3016e7dd8 EFLAGS: 00010246
nov 19 10:53:56 bear.pdc.lkkm kernel: RAX: 0000000000000000 RBX: ffff9636b239c1c8 RCX: ffffbbc3016e7e20
nov 19 10:53:56 bear.pdc.lkkm kernel: RDX: 0000000000000000 RSI: ffffbbc3016e7e20 RDI: ffff963895da2800
nov 19 10:53:56 bear.pdc.lkkm kernel: RBP: ffffbbc3016e7e20 R08: ffff96387dd04a08 R09: 000000000000000f
nov 19 10:53:56 bear.pdc.lkkm kernel: R10: ffff96389ba3472c R11: 0000000000000018 R12: ffffffff91427f10
nov 19 10:53:56 bear.pdc.lkkm kernel: R13: ffff9636b239c210 R14: ffff963effffff41 R15: ffff9636b239c180
nov 19 10:53:56 bear.pdc.lkkm kernel: FS:  0000000000000000(0000) GS:ffff96389fb80000(0000) knlGS:0000000000000000
nov 19 10:53:56 bear.pdc.lkkm kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
nov 19 10:53:56 bear.pdc.lkkm kernel: CR2: 0000000000000014 CR3: 00000007ee026006 CR4: 00000000001626e0
nov 19 10:53:56 bear.pdc.lkkm kernel: Call Trace:
nov 19 10:53:56 bear.pdc.lkkm kernel:  assoc_array_subtree_iterate+0x57/0xd0
nov 19 10:53:56 bear.pdc.lkkm kernel:  keyring_gc+0x3d/0x80
nov 19 10:53:56 bear.pdc.lkkm kernel:  key_garbage_collector+0x363/0x3d0
nov 19 10:53:56 bear.pdc.lkkm kernel:  ? __schedule+0x2a7/0x680
nov 19 10:53:56 bear.pdc.lkkm kernel:  process_one_work+0x19d/0x340
nov 19 10:53:56 bear.pdc.lkkm kernel:  worker_thread+0x50/0x3b0
nov 19 10:53:56 bear.pdc.lkkm kernel:  kthread+0xfb/0x130
nov 19 10:53:56 bear.pdc.lkkm kernel:  ? process_one_work+0x340/0x340
nov 19 10:53:56 bear.pdc.lkkm kernel:  ? kthread_park+0x80/0x80
nov 19 10:53:56 bear.pdc.lkkm kernel:  ret_from_fork+0x35/0x40
nov 19 10:53:56 bear.pdc.lkkm kernel: Modules linked in: 8021q garp mrp stp llc nfsv4 dns_resolver nfs fscache vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) rp csec_gss_krb5 vfat fat nvidia_drm(POE) nvidia_modeset(POE) nvidia_uvm(OE) nvidia(POE) snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel intel_rapl_msr intel_rapl_common snd_hda_codec joydev x86_pkg_temp_thermal intel_powerclamp snd_hda_core coretemp kvm_intel snd_hwdep snd_seq snd_seq_device snd_pcm kvm snd_timer drm_kms_helper drm ipmi_devintf ipmi_msghandler irqbypass crct10dif_pclmul crc32_pclmul snd raid1 ghash_clmulni_intel intel_cstate mei_me mei eee pc_wmi asus_wmi sparse_keymap rfkill video intel_wmi_thunderbolt wmi_bmof intel_uncore iTCO_wdt iTCO_vendor_support intel_rapl_perf i2c_i801 soundcore e1000e lpc_ich nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables uas usb_storage mxm_wmi crc32c_intel wmi fuse [last unloaded: vboxdrv]
nov 19 10:53:56 bear.pdc.lkkm kernel: CR2: 0000000000000014
nov 19 10:53:56 bear.pdc.lkkm kernel: ---[ end trace 86d8799f1807a8d4 ]---
nov 19 10:53:56 bear.pdc.lkkm kernel: RIP: 0010:keyring_gc_check_iterator+0x2c/0x40
nov 19 10:53:56 bear.pdc.lkkm kernel: Code: 44 00 00 48 83 e7 fc b8 01 00 00 00 f6 87 80 00 00 00 21 75 19 48 8b 57 58 48 39 16 7c 05 48 85 d2 7f 0b 48 8b 87 a0 00 00 00 <0f> b6 40 14 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f
nov 19 10:53:56 bear.pdc.lkkm kernel: RSP: 0018:ffffbbc3016e7dd8 EFLAGS: 00010246
nov 19 10:53:56 bear.pdc.lkkm kernel: RAX: 0000000000000000 RBX: ffff9636b239c1c8 RCX: ffffbbc3016e7e20
nov 19 10:53:56 bear.pdc.lkkm kernel: RDX: 0000000000000000 RSI: ffffbbc3016e7e20 RDI: ffff963895da2800
nov 19 10:53:56 bear.pdc.lkkm kernel: RBP: ffffbbc3016e7e20 R08: ffff96387dd04a08 R09: 000000000000000f
nov 19 10:53:56 bear.pdc.lkkm kernel: R10: ffff96389ba3472c R11: 0000000000000018 R12: ffffffff91427f10
nov 19 10:53:56 bear.pdc.lkkm kernel: R13: ffff9636b239c210 R14: ffff963effffff41 R15: ffff9636b239c180
nov 19 10:53:56 bear.pdc.lkkm kernel: FS:  0000000000000000(0000) GS:ffff96389fb80000(0000) knlGS:0000000000000000
nov 19 10:53:56 bear.pdc.lkkm kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
nov 19 10:53:56 bear.pdc.lkkm kernel: CR2: 0000000000000014 CR3: 00000007ee026006 CR4: 00000000001626e0

Comment 1 Greg Oster 2019-11-26 14:56:44 UTC
We are seeing this issue on four different hosts.  Trace is about the same:

Nov 23 19:24:21 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000014
Nov 23 19:24:21 kernel: #PF: supervisor read access in kernel mode
Nov 23 19:24:21 kernel: #PF: error_code(0x0000) - not-present page
Nov 23 19:24:21 kernel: PGD 0 P4D 0 
Nov 23 19:24:21 kernel: Oops: 0000 [#1] SMP PTI
Nov 23 19:24:21 kernel: CPU: 41 PID: 27467 Comm: kworker/41:2 Tainted: G           OE     5.3.11-200.fc30.x86_64 #1
Nov 23 19:24:21 kernel: Hardware name: Dell Inc. PowerEdge R430/0HFG24, BIOS 1.6.2 01/08/2016
Nov 23 19:24:21 kernel: Workqueue: events key_garbage_collector
Nov 23 19:24:21 kernel: RIP: 0010:keyring_gc_check_iterator+0x2c/0x40
Nov 23 19:24:21 kernel: Code: 44 00 00 48 83 e7 fc b8 01 00 00 00 f6 87 80 00 00 00 21 75 19 48 8b 57 58 48 39 16 7c 05 48 85
 d2 7f 0b 48 8b 87 a0 00 00 00 <0f> b6 40 14 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f
Nov 23 19:24:21 kernel: RSP: 0018:ffffbb7d47717dd8 EFLAGS: 00010246
Nov 23 19:24:21 kernel: RAX: 0000000000000000 RBX: ffff9e0e363ff450 RCX: ffffbb7d47717e20
Nov 23 19:24:21 kernel: RDX: 0000000000000000 RSI: ffffbb7d47717e20 RDI: ffff9e1d7168cc00
Nov 23 19:24:21 kernel: RBP: ffffbb7d47717e20 R08: ffff9e0e32e1d208 R09: 8080808080808080
Nov 23 19:24:21 kernel: R10: ffff9e1d726790ec R11: 0000000000000018 R12: ffffffff97427f10
Nov 23 19:24:21 kernel: R13: ffff9e0e363ff4d0 R14: ffff9e1d7168cc00 R15: ffff9e0e363ff440
Nov 23 19:24:21 kernel: FS:  0000000000000000(0000) GS:ffff9e1e3fb00000(0000) knlGS:0000000000000000
Nov 23 19:24:21 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 23 19:24:21 kernel: CR2: 0000000000000014 CR3: 0000001ed2f2a001 CR4: 00000000001606e0
Nov 23 19:24:21 kernel: Call Trace:
Nov 23 19:24:21 kernel: assoc_array_subtree_iterate+0x57/0xd0
Nov 23 19:24:21 kernel: keyring_gc+0x3d/0x80
Nov 23 19:24:21 kernel: key_garbage_collector+0x363/0x3d0
Nov 23 19:24:21 kernel: ? __schedule+0x2a7/0x680
Nov 23 19:24:21 kernel: process_one_work+0x19d/0x340
Nov 23 19:24:21 kernel: worker_thread+0x50/0x3b0
Nov 23 19:24:21 kernel: kthread+0xfb/0x130
Nov 23 19:24:21 kernel: ? process_one_work+0x340/0x340
Nov 23 19:24:21 kernel: ? kthread_park+0x80/0x80
Nov 23 19:24:21 kernel: ret_from_fork+0x35/0x40
Nov 23 19:24:21 kernel: Modules linked in: xt_multiport vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) rpcsec_gss_krb5
 nfsv4 dns_resolver xt_CHECKSUM xt_MASQUERADE tun bridge stp llc nfsv3 nfs_acl nfs lockd grace fscache cfg80211 rfkill nf_conntrack_net
bios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtabl
e_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_
conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rp
crdma ib_isert iscsi_target_mod vfat ib_iser fat ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_umad iw_cxgb4 r
dma_cm iw_cm ib_cm iw_cxgb3 ib_uverbs ib_core intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp k
vm_intel kvm irqbypass intel_cstate pktcdvd intel_uncore ipmi_ssif
Nov 23 19:24:21 kernel: intel_rapl_perf iTCO_wdt iTCO_vendor_support dcdbas mei_me ipmi_si ipmi_devintf mei lpc_ich ipmi_msgh
andler acpi_power_meter auth_rpcgss ip_tables mgag200 drm_vram_helper i2c_algo_bit drm_kms_helper crct10dif_pclmul crc32_pclmul ttm crc
32c_intel drm ghash_clmulni_intel tg3 megaraid_sas wmi dm_multipath sunrpc be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcx
gbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi fuse
Nov 23 19:24:21 kernel: CR2: 0000000000000014
Nov 23 19:24:21 kernel: ---[ end trace f0ff9fb68506d008 ]---
Nov 23 19:24:21 kernel: RIP: 0010:keyring_gc_check_iterator+0x2c/0x40
Nov 23 19:24:21 kernel: Code: 44 00 00 48 83 e7 fc b8 01 00 00 00 f6 87 80 00 00 00 21 75 19 48 8b 57 58 48 39 16 7c 05 48 85
 d2 7f 0b 48 8b 87 a0 00 00 00 <0f> b6 40 14 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f
Nov 23 19:24:21 kernel: RSP: 0018:ffffbb7d47717dd8 EFLAGS: 00010246
Nov 23 19:24:21 kernel: RAX: 0000000000000000 RBX: ffff9e0e363ff450 RCX: ffffbb7d47717e20
Nov 23 19:24:21 kernel: RDX: 0000000000000000 RSI: ffffbb7d47717e20 RDI: ffff9e1d7168cc00
Nov 23 19:24:21 kernel: RBP: ffffbb7d47717e20 R08: ffff9e0e32e1d208 R09: 8080808080808080
Nov 23 19:24:21 kernel: R10: ffff9e1d726790ec R11: 0000000000000018 R12: ffffffff97427f10
Nov 23 19:24:21 kernel: R13: ffff9e0e363ff4d0 R14: ffff9e1d7168cc00 R15: ffff9e0e363ff440
Nov 23 19:24:21 kernel: FS:  0000000000000000(0000) GS:ffff9e1e3fb00000(0000) knlGS:0000000000000000
Nov 23 19:24:21 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 23 19:24:21 kernel: CR2: 0000000000000014 CR3: 0000001ed2f2a001 CR4: 00000000001606e0


We started seeing this on an upgrade from Fedora 29 to Fedora 30, when we switched to the 5.3 kernel.  We're currently testing a 5.0.9-301.fc30.x86_64 kernel in the hopes
that works as a temporary stop-gap.

Time to hit the bug has varied between as little as 25 minutes to over 2 weeks.  The systems are multi-user (10-15 users) with NFSv4 home directories.

Once the above NULL reference is hit the system starts to degenerate.  Many processes go into IO wait, and eventually /usr/lib/systemd/systemd (PID 1) goes into IO Wait as well.  A reboot is required to recover (where the "reboot" doesn't actually do anything, and 'echo 1 > /proc/sys/kernel/sysrq ; echo b > /proc/sysrq-trigger' is needed to get things to reset).

If I had to bet money on the bug, my bet would be on security/keys/keyring.c keyring_gc(), where the comment above the function says:

 * Not called with any locks held.  The keyring's key struct will not be
 * deallocated under us as only our caller may deallocate it.

I'm betting that is no longer the case, since about the time that key_move() was introduced.

I'd be happy to provide any other info as I'm able to...

Thanks for looking at this!

Comment 2 Dmitrij S. Kryzhevich 2019-11-27 01:14:35 UTC
> The systems are multi-user (10-15 users) with NFSv4 home directories.

Yes, we have nfs mounted homes too.

Some google search shows https://lkml.org/lkml/2019/9/20/822 also.

Comment 3 Ben Cotton 2019-11-27 14:23:48 UTC
Fedora 29 changed to end-of-life (EOL) status on 2019-11-26. Fedora 29 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 4 Dmitrij S. Kryzhevich 2019-12-04 01:31:50 UTC
kernel-5.4.0-2.fc32.x86_64 tested, still bad.

Comment 5 James 2019-12-15 21:12:04 UTC
Seeing this on kernel-5.3.15-300.fc31.x86_64 (and earlier), when hammering NFS4 sec=krb5i, sync doing a kernel compile.

Comment 6 Justin M. Forbes 2020-03-03 16:31:20 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 31 kernel bugs.

Fedora 31 has now been rebased to 5.5.7-200.fc31.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 32, and are still experiencing this issue, please change the version to Fedora 32.

If you experience different issues, please open a new bug report for those.

Comment 7 Dmitrij S. Kryzhevich 2020-03-09 05:11:59 UTC
Latest kernel works fine for me.
If anyone has this issue please reopen.


Note You need to log in before you can comment on or make changes to this bug.