Bug 1851855

Summary: Kernel NULL pointer dereference in amdgpu on Radeon VII with kernel 5.7.*
Product: [Fedora] Fedora Reporter: Ivan Mironov <mironov.ivan>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED EOL QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 32CC: acaringi, airlied, bskeggs, fredrik, hdegoede, ichavero, itamar, jarodwilson, jeremy, jglisse, john.j5live, jonathan, josef, kernel-maint, lgoncalv, linville, masami256, mchehab, mjg59, steved
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: ---
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-25 17:44:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Ivan Mironov 2020-06-29 09:15:12 UTC
1. Please describe the problem:

[79260.489120] BUG: kernel NULL pointer dereference, address: 0000000000000128
[79260.489123] #PF: supervisor write access in kernel mode
[79260.489123] #PF: error_code(0x0002) - not-present page
[79260.489124] PGD 0 P4D 0 
[79260.489125] Oops: 0002 [#1] SMP NOPTI
[79260.489127] CPU: 0 PID: 17315 Comm: modprobe Tainted: G            E     5.7.5-200.fc32.x86_64 #1
[79260.489127] Hardware name: System manufacturer System Product Name/PRIME X570-P, BIOS 2204 06/17/2020
[79260.489173] RIP: 0010:lock_bus+0x42/0x60 [amdgpu]
[79260.489174] Code: 53 be 01 00 00 00 48 8b 9f 70 bb 00 00 48 8b bf f8 f3 ff ff e8 df 8f 26 c8 85 c0 74 0d 48 c7 c7 48 bf eb c0 5b e9 3e 9e da ff <c6> 83 28 01 00 00 01 5b c3 48 c7 c7 48 bf eb c0 e9 29 9e da ff 66
[79260.489175] RSP: 0018:ffffb34445117c48 EFLAGS: 00010246
[79260.489176] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[79260.489177] RDX: ffff9d57092b2680 RSI: 000000000001629a RDI: ffff9d57981cb818
[79260.489177] RBP: ffff9d5789627058 R08: 00000000ffffffff R09: 0000000000000000
[79260.489177] R10: 0000000000000002 R11: 00000000000000f0 R12: 0000000000000000
[79260.489178] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[79260.489179] FS:  00007fe04f0b5740(0000) GS:ffff9d57aea00000(0000) knlGS:0000000000000000
[79260.489179] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[79260.489180] CR2: 0000000000000128 CR3: 000000123e312000 CR4: 0000000000340ef0
[79260.489180] Call Trace:
[79260.489186]  i2c_smbus_xfer+0x3d/0xf0
[79260.489187]  i2c_default_probe+0xf3/0x130
[79260.489189]  i2c_detect.isra.0+0xfe/0x2b0
[79260.489191]  ? kfree+0xa3/0x200
[79260.489193]  ? kobject_uevent_env+0x11f/0x6a0
[79260.489193]  ? i2c_detect.isra.0+0x2b0/0x2b0
[79260.489194]  __process_new_driver+0x1b/0x20
[79260.489196]  bus_for_each_dev+0x64/0x90
[79260.489197]  ? 0xffffffffc13ff000
[79260.489198]  i2c_register_driver+0x73/0xc0
[79260.489200]  do_one_initcall+0x46/0x200
[79260.489202]  ? _cond_resched+0x16/0x40
[79260.489203]  ? kmem_cache_alloc_trace+0x167/0x220
[79260.489205]  ? do_init_module+0x23/0x260
[79260.489206]  do_init_module+0x5c/0x260
[79260.489207]  __do_sys_init_module+0x14f/0x170
[79260.489208]  do_syscall_64+0x5b/0xf0
[79260.489209]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[79260.489210] RIP: 0033:0x7fe04f1e540e
[79260.489211] Code: 48 8b 0d 8d 0a 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5a 0a 0c 00 f7 d8 64 89 01 48
[79260.489212] RSP: 002b:00007ffdb0489df8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[79260.489212] RAX: ffffffffffffffda RBX: 000055b7bf883ae0 RCX: 00007fe04f1e540e
[79260.489213] RDX: 000055b7bd962288 RSI: 000000000000385e RDI: 000055b7bf892810
[79260.489213] RBP: 000055b7bf892810 R08: 0000000000000000 R09: 000055b7bf892860
[79260.489214] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
[79260.489214] R13: 000055b7bd962288 R14: 000055b7bf883c80 R15: 000055b7bf883c60
[79260.489216] Modules linked in: jc42(+) uinput rfcomm xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp tun bridge nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter cmac bnep rpcrdma ib_isert iscsi_target_mod ib_iser ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_umad iw_cxgb4 ib_uverbs rdma_cm iw_cm ib_cm ib_core snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi uvcvideo snd_hda_intel snd_intel_dspcfg videobuf2_vmalloc snd_hda_codec videobuf2_memops videobuf2_v4l2 btusb snd_usb_audio raid1 btrtl videobuf2_common snd_hda_core btbcm amd64_edac_mod
[79260.489231]  snd_usbmidi_lib btintel edac_mce_amd videodev snd_hwdep snd_seq bluetooth kvm_amd snd_rawmidi eeepc_wmi mc asus_wmi xpad kvm snd_seq_device joydev sparse_keymap snd_pcm ecdh_generic ff_memless rfkill irqbypass ecc video snd_timer wmi_bmof snd pcspkr sp5100_tco soundcore i2c_piix4 k10temp acpi_cpufreq ip_tables isofs squashfs dm_multipath amdgpu amd_iommu_v2 gpu_sched i2c_algo_bit ttm drm_kms_helper 8021q garp mrp stp llc crct10dif_pclmul crc32_pclmul drm ghash_clmulni_intel ccp hpsa r8169 scsi_transport_sas wmi pinctrl_amd uas usb_storage btrfs blake2b_generic libcrc32c crc32c_intel xor raid6_pq sunrpc be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi loop fuse scsi_transport_iscsi [last unloaded: minix]
[79260.489248] CR2: 0000000000000128
[79260.489249] ---[ end trace ebc75789e03eebf1 ]---
[79260.489282] RIP: 0010:lock_bus+0x42/0x60 [amdgpu]
[79260.489283] Code: 53 be 01 00 00 00 48 8b 9f 70 bb 00 00 48 8b bf f8 f3 ff ff e8 df 8f 26 c8 85 c0 74 0d 48 c7 c7 48 bf eb c0 5b e9 3e 9e da ff <c6> 83 28 01 00 00 01 5b c3 48 c7 c7 48 bf eb c0 e9 29 9e da ff 66
[79260.489283] RSP: 0018:ffffb34445117c48 EFLAGS: 00010246
[79260.489284] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[79260.489284] RDX: ffff9d57092b2680 RSI: 000000000001629a RDI: ffff9d57981cb818
[79260.489285] RBP: ffff9d5789627058 R08: 00000000ffffffff R09: 0000000000000000
[79260.489285] R10: 0000000000000002 R11: 00000000000000f0 R12: 0000000000000000
[79260.489286] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[79260.489286] FS:  00007fe04f0b5740(0000) GS:ffff9d57aea00000(0000) knlGS:0000000000000000
[79260.489287] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[79260.489287] CR2: 0000000000000128 CR3: 000000123e312000 CR4: 0000000000340ef0

This happens when some i2c device driver tries to scan for devices on i2c bus. In my case it is triggered by `modprobe jc42`.

Here is the fix: https://lkml.org/lkml/2020/6/25/624


2. What is the Version-Release number of the kernel:

5.7.5-200.fc32.x86_64 (from Test Day:2020-06-22 Kernel 5.7 Test Week)


3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

This appeared on 5.7 mainline kernel.


4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

1. Boot kernel 5.7.* on system with Radeon VII.
2. `modprobe jc42`.
3. See `dmesg`.


5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Not tried with Rawhide kernel, but tried to build and run torvalds/linux master. Problem still occur there.


6. Are you running any modules that not shipped with directly Fedora's kernel?:

No.


7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Comment 1 Fredrik Chabot 2020-07-09 19:18:02 UTC
The GUI just hangs the moment the AMDGPU module is loaded on 5.7.7-200.fc32.x86_64.

jul 09 20:48:34 localhost.localdomain boltd[840]: [0008ea78-ae9b-Core X                     ] authorize: finished: ok (status: authorized, flags: 2)
jul 09 20:48:34 localhost.localdomain boltd[840]: [0008ea78-ae9b-Core X                     ] auto-auth: authorization successful
jul 09 20:48:35 localhost.localdomain boltd[840]: [0008ea78-ae9b-Core X                     ] udev: device changed: authorized -> authorized
jul 09 20:48:35 localhost.localdomain kernel: [drm] amdgpu kernel modesetting enabled.
jul 09 20:48:35 localhost.localdomain kernel: CRAT table not found
jul 09 20:48:35 localhost.localdomain kernel: Virtual CRAT table created for CPU
jul 09 20:48:35 localhost.localdomain kernel: Parsing CRAT table with 1 nodes
jul 09 20:48:35 localhost.localdomain kernel: Creating topology SYSFS entries
jul 09 20:48:35 localhost.localdomain kernel: Topology: Add CPU node
jul 09 20:48:35 localhost.localdomain kernel: Finished initializing topology
jul 09 20:48:35 localhost.localdomain kernel: amdgpu 0000:08:00.0: enabling device (0000 -> 0003)
jul 09 20:48:35 localhost.localdomain kernel: [drm] initializing kernel modesetting (NAVI14 0x1002:0x7340 0x1462:0x3822 0xC5).
jul 09 20:48:35 localhost.localdomain kernel: [drm] register mmio base: 0x80000000
jul 09 20:48:35 localhost.localdomain kernel: [drm] register mmio size: 524288
jul 09 20:48:35 localhost.localdomain kernel: [drm] PCIE atomic ops is not supported
jul 09 20:48:36 localhost.localdomain kernel: hrtimer: interrupt took 250582589 ns
jul 09 20:48:36 localhost.localdomain kernel: [drm:amdgpu_discovery_init [amdgpu]] *ERROR* invalid ip discovery binary signature
jul 09 20:48:36 localhost.localdomain kernel: amdgpu 0000:08:00.0: amdgpu_discovery_init failed
jul 09 20:48:36 localhost.localdomain kernel: amdgpu 0000:08:00.0: Fatal error during GPU init
jul 09 20:48:36 localhost.localdomain kernel: [drm] amdgpu: finishing device.
jul 09 20:48:36 localhost.localdomain kernel: BUG: kernel NULL pointer dereference, address: 00000000000000b0
jul 09 20:48:36 localhost.localdomain kernel: #PF: supervisor read access in kernel mode
jul 09 20:48:36 localhost.localdomain kernel: #PF: error_code(0x0000) - not-present page
jul 09 20:48:36 localhost.localdomain kernel: PGD 0 P4D 0 
jul 09 20:48:36 localhost.localdomain kernel: Oops: 0000 [#1] SMP NOPTI
jul 09 20:48:36 localhost.localdomain kernel: CPU: 7 PID: 3472 Comm: systemd-udevd Not tainted 5.7.7-200.fc32.x86_64 #1
jul 09 20:48:36 localhost.localdomain kernel: Hardware name: Notebook                         N150CU                          /N150CU                          , BIOS 1.>
jul 09 20:48:36 localhost.localdomain kernel: RIP: 0010:drm_plane_register_all+0x2d/0x60 [drm]
jul 09 20:48:36 localhost.localdomain kernel: Code: 00 00 55 48 8d af d0 02 00 00 53 48 8b 87 d0 02 00 00 48 39 c5 74 32 48 8d 58 f8 eb 0d 48 8b 43 08 48 8d 58 f8 48 39>
jul 09 20:48:36 localhost.localdomain kernel: RSP: 0018:ffffabe58338bbc8 EFLAGS: 00010282
jul 09 20:48:36 localhost.localdomain kernel: RAX: 0000000000000000 RBX: fffffffffffffff8 RCX: 0000000000001a73
jul 09 20:48:36 localhost.localdomain kernel: RDX: ffffffffc19b0120 RSI: fbf9c06674c9dbb4 RDI: ffff9cb17ee99800
jul 09 20:48:36 localhost.localdomain kernel: RBP: ffff9cb17ee99ad0 R08: 0000000000000000 R09: ffff9cb19d066c10
jul 09 20:48:36 localhost.localdomain kernel: R10: ffff9cb160f75f70 R11: 0000000000000000 R12: 0000000000000000
jul 09 20:48:36 localhost.localdomain kernel: R13: 000000000000001a R14: ffff9cb160f75f70 R15: 0000000000000000
jul 09 20:48:36 localhost.localdomain kernel: FS:  00007f3c03d0cb80(0000) GS:ffff9cb1a07c0000(0000) knlGS:0000000000000000
jul 09 20:48:36 localhost.localdomain kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
jul 09 20:48:36 localhost.localdomain kernel: CR2: 00000000000000b0 CR3: 000000081dab0002 CR4: 00000000003606e0
jul 09 20:48:36 localhost.localdomain kernel: Call Trace:
jul 09 20:48:36 localhost.localdomain kernel:  drm_modeset_register_all+0x10/0x70 [drm]
jul 09 20:48:36 localhost.localdomain kernel:  drm_dev_register+0x15d/0x180 [drm]
jul 09 20:48:36 localhost.localdomain kernel:  amdgpu_pci_probe+0x100/0x180 [amdgpu]
jul 09 20:48:36 localhost.localdomain kernel:  local_pci_probe+0x42/0x80
jul 09 20:48:36 localhost.localdomain kernel:  ? _cond_resched+0x16/0x40
jul 09 20:48:36 localhost.localdomain kernel:  pci_device_probe+0xd9/0x190
jul 09 20:48:36 localhost.localdomain kernel:  really_probe+0x167/0x410
jul 09 20:48:36 localhost.localdomain kernel:  driver_probe_device+0xb6/0x100
jul 09 20:48:36 localhost.localdomain kernel:  device_driver_attach+0xa1/0xb0
jul 09 20:48:36 localhost.localdomain kernel:  __driver_attach+0x8a/0x150
jul 09 20:48:36 localhost.localdomain kernel:  ? device_driver_attach+0xb0/0xb0
jul 09 20:48:36 localhost.localdomain kernel:  ? device_driver_attach+0xb0/0xb0
jul 09 20:48:36 localhost.localdomain kernel:  bus_for_each_dev+0x64/0x90
jul 09 20:48:36 localhost.localdomain kernel:  bus_add_driver+0x12b/0x1e0
jul 09 20:48:36 localhost.localdomain kernel:  driver_register+0x8b/0xe0
jul 09 20:48:36 localhost.localdomain kernel:  ? 0xffffffffc1ab2000
jul 09 20:48:36 localhost.localdomain kernel:  do_one_initcall+0x46/0x200
jul 09 20:48:36 localhost.localdomain kernel:  ? _cond_resched+0x16/0x40
jul 09 20:48:36 localhost.localdomain kernel:  ? kmem_cache_alloc_trace+0x167/0x220
jul 09 20:48:36 localhost.localdomain kernel:  ? do_init_module+0x23/0x260
jul 09 20:48:36 localhost.localdomain kernel:  do_init_module+0x5c/0x260
jul 09 20:48:36 localhost.localdomain kernel:  __do_sys_init_module+0x14f/0x170
jul 09 20:48:36 localhost.localdomain kernel:  do_syscall_64+0x5b/0xf0
jul 09 20:48:36 localhost.localdomain kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
jul 09 20:48:36 localhost.localdomain kernel: RIP: 0033:0x7f3c04e5e40e
jul 09 20:48:36 localhost.localdomain kernel: Code: 48 8b 0d 8d 0a 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00>
jul 09 20:48:36 localhost.localdomain kernel: RSP: 002b:00007ffd39eb6948 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
jul 09 20:48:36 localhost.localdomain kernel: RAX: ffffffffffffffda RBX: 0000564ee6754490 RCX: 00007f3c04e5e40e
jul 09 20:48:36 localhost.localdomain kernel: RDX: 00007f3c04ab895d RSI: 00000000009e4006 RDI: 00007f3c01105010
jul 09 20:48:36 localhost.localdomain kernel: RBP: 00007f3c01105010 R08: 0000564ee66c40c0 R09: 00000000009e4010
jul 09 20:48:36 localhost.localdomain kernel: R10: 0000000000000006 R11: 0000000000000246 R12: 0000000000000000
jul 09 20:48:36 localhost.localdomain kernel: R13: 00007f3c04ab895d R14: 0000564ee6750590 R15: 0000564ee66280a0
jul 09 20:48:36 localhost.localdomain kernel: Modules linked in: amdgpu(+) amd_iommu_v2 gpu_sched ttm uinput rfcomm xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf>
jul 09 20:48:36 localhost.localdomain kernel:  videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc ecdh_generic rfkill ecc snd_hda_codec_hdmi>
jul 09 20:48:36 localhost.localdomain kernel: CR2: 00000000000000b0
jul 09 20:48:36 localhost.localdomain kernel: ---[ end trace ed01d9c9e912db76 ]---
jul 09 20:48:36 localhost.localdomain kernel: RIP: 0010:drm_plane_register_all+0x2d/0x60 [drm]
jul 09 20:48:36 localhost.localdomain kernel: Code: 00 00 55 48 8d af d0 02 00 00 53 48 8b 87 d0 02 00 00 48 39 c5 74 32 48 8d 58 f8 eb 0d 48 8b 43 08 48 8d 58 f8 48 39>
jul 09 20:48:36 localhost.localdomain kernel: RSP: 0018:ffffabe58338bbc8 EFLAGS: 00010282
jul 09 20:48:36 localhost.localdomain kernel: RAX: 0000000000000000 RBX: fffffffffffffff8 RCX: 0000000000001a73
jul 09 20:48:36 localhost.localdomain kernel: RDX: ffffffffc19b0120 RSI: fbf9c06674c9dbb4 RDI: ffff9cb17ee99800
jul 09 20:48:36 localhost.localdomain kernel: RBP: ffff9cb17ee99ad0 R08: 0000000000000000 R09: ffff9cb19d066c10
jul 09 20:48:36 localhost.localdomain kernel: R10: ffff9cb160f75f70 R11: 0000000000000000 R12: 0000000000000000
jul 09 20:48:36 localhost.localdomain kernel: R13: 000000000000001a R14: ffff9cb160f75f70 R15: 0000000000000000
jul 09 20:48:36 localhost.localdomain kernel: FS:  00007f3c03d0cb80(0000) GS:ffff9cb1a07c0000(0000) knlGS:0000000000000000
jul 09 20:48:36 localhost.localdomain kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
jul 09 20:48:36 localhost.localdomain kernel: CR2: 00000000000000b0 CR3: 000000081dab0002 CR4: 00000000003606e0
jul 09 20:48:36 localhost.localdomain systemd-udevd[613]: Worker [3472] terminated by signal 9 (KILL)
jul 09 20:48:36 localhost.localdomain systemd-udevd[613]: 0000:08:00.0: Worker [3472] failed
jul 09 20:48:36 localhost.localdomain gnome-shell[1894]: Failed to hotplug secondary gpu '/dev/dri/renderD129': GDBus.Error:System.Error.ENODEV: No


It used to kinda work on 5.6.18-300.fc32

Comment 2 Fedora Program Management 2021-04-29 17:04:13 UTC
This message is a reminder that Fedora 32 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '32'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 32 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 3 Ben Cotton 2021-05-25 17:44:06 UTC
Fedora 32 changed to end-of-life (EOL) status on 2021-05-25. Fedora 32 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.