Bug 2012882

Summary: WARNING: CPU: 1 PID: 407 at drivers/gpu/drm/ttm/ttm_bo.c:409 ttm_bo_release+0x2d2/0x300 [ttm] [amdgpu]
Product: [Fedora] Fedora Reporter: Dominik 'Rathann' Mierzejewski <dominik>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 36CC: acaringi, adscvr, airlied, alciregi, auxsvr, bskeggs, hdegoede, jarodwilson, jeremy, jglisse, jonathan, josef, kernel-maint, lgoncalv, linville, masami256, mchehab, ptalbert, steved
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
URL: https://retrace.fedoraproject.org/faf/reports/261251/
Whiteboard:
Fixed In Version: kernel-6.0.5-200.fc36 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-11-07 13:03:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
kernel-5.14.9 dmesg (journalctl -b0 --no-hostname --output=short-monotonic -k) none

Description Dominik 'Rathann' Mierzejewski 2021-10-11 14:17:07 UTC
Created attachment 1831891 [details]
kernel-5.14.9 dmesg (journalctl -b0 --no-hostname --output=short-monotonic -k)

1. Please describe the problem:
Since upgrading to 5.14.9-200.fc34, I'm getting this WARNING on every boot.

2. What is the Version-Release number of the kernel:
5.14.9-200.fc34

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :
Yes. No WARNING with 5.13.x and earlier kernels. I haven't tried earlier 5.14.x koji kernels yet.

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:
Yes, it happens on every boot.

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:
Unknown, I haven't tried yet.

6. Are you running any modules that not shipped with directly Fedora's kernel?:
No.

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.
Attached.

Additional info:
This looks similar to bug 1985880, but the stack trace after task_work_run:

WARNING: CPU: 1 PID: 407 at drivers/gpu/drm/ttm/ttm_bo.c:409 ttm_bo_release+0x2d2/0x300 [ttm]
Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip_set_hash_net ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nf_tables rfkill nfnetlink ip6table_filter ip6_tables iptable_filter drivetemp f71882fg sunrpc intel_rapl_msr intel_rapl_common vfat fat x86_pkg_temp_thermal pktcdvd intel_powerclamp coretemp kvm_intel kvm snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi ledtrig_audio mei_hdcp snd_usb_audio snd_hda_intel at24 snd_intel_dspcfg iTCO_wdt intel_pmc_bxt snd_intel_sdw_acpi iTCO_vendor_support snd_hda_codec snd_usbmidi_lib irqbypass rapl intel_cstate snd_hda_core snd_hwdep snd_rawmidi snd_seq intel_uncore uvcvideo snd_seq_device snd_pcm videobuf2_vmalloc videobuf2_memops mxm_wmi videobuf2_v4l2 videobuf2_common snd_timer mei_me videodev snd
 mc joydev i2c_i801 mei soundcore i2c_smbus lpc_ich binfmt_misc zram ip_tables hid_logitech_hidpp hid_jabra hid_logitech_dj r8152 mii amdgpu i915 iommu_v2 gpu_sched drm_ttm_helper i2c_algo_bit ttm crct10dif_pclmul crc32_pclmul crc32c_intel drm_kms_helper ghash_clmulni_intel uas e1000e usb_storage cec drm wmi video i2c_dev fuse
CPU: 1 PID: 407 Comm: plymouthd Tainted: G        W         5.14.9-200.fc34.x86_64 #1
Hardware name: MSI MS-7751/Z77A-GD65 (MS-7751), BIOS V10.11 10/09/2013
RIP: 0010:ttm_bo_release+0x2d2/0x300 [ttm]
Code: 8d b6 b8 fe ff ff e8 dd ea dd ff 49 8b 76 08 48 89 ef e8 91 21 00 00 49 8b 7e 98 e9 6f fd ff ff e8 93 82 19 cc e9 a4 fd ff ff <0f> 0b e9 4f fd ff ff e8 c2 80 19 cc e9 f6 fe ff ff be 03 00 00 00
RSP: 0018:ffffa95540453d10 EFLAGS: 00010202
RAX: 0000000000000001 RBX: ffffa95540453d58 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8a09cb4b79b8
RBP: ffff8a09cb545288 R08: ffff8a09cb4b79b8 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffff8a09c9682000
R13: ffff8a09cb4b7858 R14: ffff8a09cb4b79b8 R15: ffff8a09c05484b8
FS:  0000000000000000(0000) GS:ffff8a0cdf680000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f4840aeb000 CR3: 00000003a7c10005 CR4: 00000000001706e0
Call Trace:
 amdgpu_bo_unref+0x1a/0x30 [amdgpu]
 amdgpu_gem_object_free+0x20/0x30 [amdgpu]
 drm_gem_object_release_handle+0x6b/0x80 [drm]
 ? drm_gem_object_handle_put_unlocked+0xd0/0xd0 [drm]
 idr_for_each+0x4e/0xc0
 drm_gem_release+0x1c/0x30 [drm]
 drm_file_free.part.0+0x1e3/0x250 [drm]
 drm_release+0x65/0x110 [drm]
 __fput+0x94/0x240
 task_work_run+0x65/0xa0
 do_exit+0x33d/0xa90
 ? __audit_syscall_entry+0x100/0x130
 do_group_exit+0x33/0xa0
 __x64_sys_exit_group+0x14/0x20
 do_syscall_64+0x3b/0x90
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f4841954021
Code: Unable to access opcode bytes at RIP 0x7f4841953ff7.
RSP: 002b:00007ffd554cc208 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 00007f4841a4c470 RCX: 00007f4841954021
RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
RBP: 0000000000000000 R08: ffffffffffffff88 R09: 0000000000000001
R10: 00007f484189a468 R11: 0000000000000246 R12: 00007f4841a4c470
R13: 0000000000000001 R14: 00007f4841a4c948 R15: 0000000000000000

Comment 1 Dominik 'Rathann' Mierzejewski 2021-10-21 21:40:31 UTC
Still reproducible on F35 kernel 5.14.14-300.fc35

Comment 2 Dominik 'Rathann' Mierzejewski 2021-10-27 11:59:43 UTC
Note that kernel is "tainted" only because I'm getting hit by bug 1985090 on ever boot as well.

Comment 3 Dominik 'Rathann' Mierzejewski 2021-12-02 15:47:23 UTC
5.15.4 is still showing the issue, but it is no longer reproducible with 5.15.6 (I haven't tested 5.15.5).

Comment 4 Dominik 'Rathann' Mierzejewski 2022-05-31 09:24:02 UTC
Still reproducible on F36 with kernel 5.17.11-300.fc36.x86_64.

Comment 5 Dominik 'Rathann' Mierzejewski 2022-06-13 09:11:11 UTC
I think I forgot to mention that this is on a TAHITI Pro GPU that is driven by radeon module by default, but I enabled si_support in amdgpu module instead:
$ cat /etc/modprobe.d/amdgpu.conf 
blacklist radeon
options amdgpu si_support=1
options amdgpu cik_support=1
options amdgpu hw_i2c=1

Still reproducible with 5.17.14-300.fc36.x86_64:

[   10.559066] ------------[ cut here ]------------
[   10.559069] WARNING: CPU: 2 PID: 412 at drivers/gpu/drm/ttm/ttm_bo.c:411 ttm_bo_release+0x34d/0x370 [ttm]
[   10.559078] Modules linked in: nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip_set_hash_net ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security rfkill ip_set nf_tables nfnetlink ip6table_filter iptable_filter drivetemp f71882fg sunrpc binfmt_misc vfat fat intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp at24 kvm_intel mei_hdcp iTCO_wdt intel_pmc_bxt mei_pxp iTCO_vendor_support pktcdvd kvm snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi irqbypass rapl snd_hda_intel snd_intel_dspcfg snd_usb_audio snd_intel_sdw_acpi snd_hda_codec snd_usbmidi_lib intel_cstate snd_hda_core snd_rawmidi snd_hwdep uvcvideo intel_uncore videobuf2_vmalloc snd_seq videobuf2_memops mxm_wmi videobuf2_v4l2 snd_seq_device videobuf2_common snd_pcm joydev videodev snd_timer mei_me snd mc i2c_i801 soundcore lpc_ich
[   10.559112]  i2c_smbus mei zram hid_logitech_hidpp amdgpu hid_logitech_dj hid_jabra i915 crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel e1000e uas usb_storage iommu_v2 gpu_sched drm_ttm_helper ttm wmi video r8152 mii ip6_tables ip_tables ipmi_devintf ipmi_msghandler fuse i2c_dev
[   10.559126] CPU: 2 PID: 412 Comm: plymouthd Not tainted 5.17.14-300.fc36.x86_64 #1
[   10.559128] Hardware name: MSI MS-7751/Z77A-GD65 (MS-7751), BIOS V10.11 10/09/2013
[   10.559129] RIP: 0010:ttm_bo_release+0x34d/0x370 [ttm]
[   10.559134] Code: 00 e8 97 47 49 d3 48 8b 43 e8 eb a8 be 03 00 00 00 e8 e7 eb 21 d3 e9 96 fd ff ff e8 2d 26 49 d3 e9 8c fd ff ff 48 89 e8 eb 8a <0f> 0b e9 d6 fc ff ff e8 17 26 49 d3 e9 dd fe ff ff be 03 00 00 00
[   10.559136] RSP: 0018:ffffb735c18e3cf8 EFLAGS: 00010202
[   10.559137] RAX: 0000000000000001 RBX: ffff9beb6dc211b8 RCX: 0000000000000000
[   10.559139] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9beb6dc211b8
[   10.559139] RBP: ffff9beb6ed25280 R08: 0000000000000000 R09: 000000008040003f
[   10.559140] R10: ffff9beb6d08cfc0 R11: 0000000000000000 R12: ffff9beb6dc21058
[   10.559141] R13: 0000000000000001 R14: ffff9be840d0de40 R15: ffff9be843eec700
[   10.559142] FS:  0000000000000000(0000) GS:ffff9beb5c500000(0000) knlGS:0000000000000000
[   10.559143] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   10.559145] CR2: 00007ff6c4f0f000 CR3: 00000003f6e10005 CR4: 00000000001706e0
[   10.559146] Call Trace:
[   10.559148]  <TASK>
[   10.559149]  ? drm_vma_node_revoke+0x63/0x70
[   10.559154]  ? kfree+0x1eb/0x220
[   10.559158]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
[   10.559318]  amdgpu_gem_object_free+0x20/0x30 [amdgpu]
[   10.559458]  drm_gem_object_release_handle+0x69/0x80
[   10.559463]  ? drm_gem_object_handle_put_unlocked+0xe0/0xe0
[   10.559465]  idr_for_each+0x4e/0xb0
[   10.559468]  drm_gem_release+0x1c/0x30
[   10.559470]  drm_file_free.part.0+0x1e1/0x250
[   10.559473]  drm_release+0x65/0x110
[   10.559475]  __fput+0x91/0x250
[   10.559479]  task_work_run+0x5c/0x90
[   10.559483]  do_exit+0x31d/0xad0
[   10.559486]  ? __audit_syscall_entry+0xec/0x130
[   10.559490]  do_group_exit+0x2d/0x90
[   10.559491]  __x64_sys_exit_group+0x14/0x20
[   10.559493]  do_syscall_64+0x3a/0x80
[   10.559496]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   10.559500] RIP: 0033:0x7ff6c648a711
[   10.559516] Code: Unable to access opcode bytes at RIP 0x7ff6c648a6e7.
[   10.559517] RSP: 002b:00007ffc043845d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[   10.559518] RAX: ffffffffffffffda RBX: 00007ff6c65a09e0 RCX: 00007ff6c648a711
[   10.559520] RDX: 000000000000003c RSI: 00000000000000e7 RDI: 0000000000000000
[   10.559520] RBP: 0000000000000000 R08: ffffffffffffff80 R09: 00007ff6c65abb20
[   10.559521] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ff6c65a09e0
[   10.559522] R13: 0000000000000000 R14: 00007ff6c65a5ee8 R15: 00007ff6c65a5f00
[   10.559525]  </TASK>
[   10.559525] ---[ end trace 0000000000000000 ]---

Comment 6 Dominik 'Rathann' Mierzejewski 2022-11-07 13:03:06 UTC
FWIW, this seems to be gone in 6.0.5 and 6.0.7 F36 kernels. I haven't tested any other versions. 5.9.16 seems to be the last version where this is occurring, so I'm closing this.