Bug 1715873 - booting with kernel version 5.1.5 on RX 580 hangs
Summary: booting with kernel version 5.1.5 on RX 580 hangs
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: xorg-x11-drv-amdgpu
Version: 30
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-05-31 13:17 UTC by Gobinda Joy
Modified: 2020-05-26 15:44 UTC (History)
19 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2020-05-26 15:44:31 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Linux version 5.1.5-300.fc30.x86_64 (85.71 KB, text/plain)
2019-05-31 13:17 UTC, Gobinda Joy
no flags Details
Linux version 5.1.6-350.vanilla.knurd.1.fc30.x86_64 (91.54 KB, text/plain)
2019-06-03 03:29 UTC, Gobinda Joy
no flags Details

Description Gobinda Joy 2019-05-31 13:17:19 UTC
Created attachment 1575695 [details]
Linux version 5.1.5-300.fc30.x86_64

Description of the problem:
Kernel stalls, no tty or response to Ctrl+Alt+del press.

Problematic kernel version:
Linux version 5.1.5-300.fc30.x86_64

Last working version:
Kernel version 5.0.17 is the last working version so far.

The version problem started:
All kernel preceding 5.1.0 have this issue.

Steps to reproduce the problem:
Install kernel version 5.1+.
Use a GPU RX 580 8GB with z77 chipset and i7 3770 Processor.


Latest rawhide kernel version (kernel-5.2.0-0.rc1.git2.2.fc31.x86_64) also exhibit this problem.

Not using any external modules.

Attached kernel log for version 5.1.5

Comment 1 Gobinda Joy 2019-05-31 13:36:47 UTC
If using kernel command line amdgpu.dpm=0, kernel boots.

But on rawhide kernel when using amdgpu.dpm=0 it produces this error:
kernel: [drm] amdgpu kernel modesetting enabled.
kernel: CRAT table not found
kernel: Virtual CRAT table created for CPU
kernel: Parsing CRAT table with 1 nodes
kernel: Creating topology SYSFS entries
kernel: Topology: Add CPU node
kernel: Finished initializing topology
kernel: amdgpu 0000:04:00.0: remove_conflicting_pci_framebuffers: bar 0: 0xe0000000 -> 0xefffffff
kernel: amdgpu 0000:04:00.0: remove_conflicting_pci_framebuffers: bar 2: 0xf0000000 -> 0xf01fffff
kernel: amdgpu 0000:04:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xf7800000 -> 0xf783ffff
kernel: checking generic (e0000000 300000) vs hw (e0000000 10000000)
kernel: fb0: switching to amdgpudrmfb from EFI VGA
kernel: Console: switching to colour dummy device 80x25
kernel: amdgpu 0000:04:00.0: vgaarb: deactivate vga console
kernel: [drm] initializing kernel modesetting (POLARIS10 0x1002:0x67DF 0x1DA2:0xE387 0xE7).
kernel: [drm] register mmio base: 0xF7800000
kernel: [drm] register mmio size: 262144
kernel: [drm] add ip block number 0 <vi_common>
kernel: [drm] add ip block number 1 <gmc_v8_0>
kernel: [drm] add ip block number 2 <tonga_ih>
kernel: [drm] add ip block number 3 <gfx_v8_0>
kernel: [drm] add ip block number 4 <sdma_v3_0>
kernel: [drm] add ip block number 5 <powerplay>
kernel: [drm] add ip block number 6 <dm>
kernel: [drm] add ip block number 7 <uvd_v6_0>
kernel: [drm] add ip block number 8 <vce_v3_0>
kernel: kfd kfd: skipped device 1002:67df, PCI rejects atomics
kernel: [drm] UVD is enabled in VM mode
kernel: [drm] UVD ENC is enabled in VM mode
kernel: [drm] VCE enabled in VM mode
kernel: resource sanity check: requesting [mem 0x000c0000-0x000dffff], which spans more than PCI Bus 0000:00 [mem 0x000d0000-0x000d3fff window]
kernel: caller pci_map_rom+0x6a/0x17d mapping multiple BARs
kernel: amdgpu 0000:04:00.0: No more image in the PCI ROM
kernel: ATOM BIOS: 113-1E3870U-O45
kernel: [drm] RAS INFO: ras initialized successfully, hardware ability[0] ras_mask[0]
kernel: [drm] vm size is 128 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
kernel: amdgpu 0000:04:00.0: VRAM: 8192M 0x000000F400000000 - 0x000000F5FFFFFFFF (8192M used)
kernel: amdgpu 0000:04:00.0: GART: 256M 0x000000FF00000000 - 0x000000FF0FFFFFFF
kernel: [drm] Detected VRAM RAM=8192M, BAR=256M
kernel: [drm] RAM width 256bits GDDR5
kernel: [TTM] Zone  kernel: Available graphics memory: 12350340 KiB
kernel: [TTM] Zone   dma32: Available graphics memory: 2097152 KiB
kernel: [TTM] Initializing pool allocator
kernel: [TTM] Initializing DMA pool allocator
kernel: [drm] amdgpu: 8192M of VRAM memory ready
kernel: [drm] amdgpu: 8192M of GTT memory ready.
kernel: [drm] GART: num cpu pages 65536, num gpu pages 65536
kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400300000).
kernel: [drm] Chained IB support enabled!
kernel: [drm] Found UVD firmware Version: 1.130 Family ID: 16
kernel: [drm] Found VCE firmware Version: 53.26 Binary ID: 3
kernel: BUG: unable to handle page fault for address: ffffa5bd8394f650
kernel: #PF: supervisor read access in kernel mode
kernel: #PF: error_code(0x0000) - not-present page
kernel: PGD 606549067 P4D 606549067 PUD 0 
kernel: Oops: 0000 [#1] SMP PTI
kernel: CPU: 6 PID: 461 Comm: systemd-udevd Not tainted 5.2.0-0.rc1.git1.1.vanilla.knurd.1.fc30.x86_64 #1
kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./G1.Sniper 3, BIOS F8k 04/29/2013
kernel: RIP: 0010:bw_calcs_data_update_from_pplib.isra.0+0x378/0x4d0 [amdgpu]
kernel: Code: 00 00 5b 5d 41 5c 41 5d 41 5e c3 48 8b 7d 00 4c 89 f2 be 02 00 00 00 e8 26 bf f9 ff 8b 04 24 4c 8b 23 be e8 03 00 00 83 e8 01 <8b> 7c 84 04 e8 6f 4d fb ff be e8 03 00 00 49 89 44 24 60 8b 04 24
kernel: RSP: 0018:ffffa5b98394f650 EFLAGS: 00010297
kernel: RAX: 00000000ffffffff RBX: ffff928b34cb92d8 RCX: 0000000000000000
kernel: RDX: ffffa5b98394f58c RSI: 00000000000003e8 RDI: ffff928b39c12800
kernel: RBP: ffff928b34cb9208 R08: 0000000000000020 R09: 000000032a000000
kernel: R10: 00000003ce000000 R11: 0000001770000000 R12: ffff928b3ac0b300
kernel: R13: ffffa5b98394f76c R14: ffffa5b98394f650 R15: ffffffffc0839d60
kernel: FS:  00007f1133ad1940(0000) GS:ffff928b46b80000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: ffffa5bd8394f650 CR3: 00000005faf54004 CR4: 00000000001606e0
kernel: Call Trace:
kernel:  dce112_create_resource_pool+0x6de/0x700 [amdgpu]
kernel:  dc_create_resource_pool+0x16c/0x220 [amdgpu]
kernel:  ? dal_gpio_service_create+0x92/0x110 [amdgpu]
kernel:  dc_create+0x219/0x620 [amdgpu]
kernel:  ? amdgpu_cgs_create_device+0x23/0x50 [amdgpu]
kernel:  amdgpu_dm_init+0xeb/0x160 [amdgpu]
kernel:  dm_hw_init+0xe/0x20 [amdgpu]
kernel:  amdgpu_device_init.cold+0x128d/0x161f [amdgpu]
kernel:  ? kmalloc_order+0x14/0x30
kernel:  amdgpu_driver_load_kms+0x88/0x270 [amdgpu]
kernel:  drm_dev_register+0x111/0x150 [drm]
kernel:  amdgpu_pci_probe+0xbd/0x120 [amdgpu]
kernel:  ? __pm_runtime_resume+0x58/0x80
kernel:  local_pci_probe+0x42/0x80
kernel:  pci_device_probe+0x115/0x190
kernel:  really_probe+0xf0/0x390
kernel:  driver_probe_device+0xb6/0x100
kernel:  device_driver_attach+0x53/0x60
kernel:  __driver_attach+0x8a/0x150
kernel:  ? device_driver_attach+0x60/0x60
kernel:  bus_for_each_dev+0x78/0xc0
kernel:  bus_add_driver+0x14a/0x1e0
kernel:  driver_register+0x6c/0xb0
kernel:  ? 0xffffffffc09b9000
kernel:  do_one_initcall+0x46/0x1f4
kernel:  ? _cond_resched+0x15/0x30
kernel:  ? kmem_cache_alloc_trace+0x154/0x1c0
kernel:  ? do_init_module+0x23/0x230
kernel:  do_init_module+0x5c/0x230
kernel:  load_module+0x22eb/0x28e0
kernel:  ? __do_sys_init_module+0x16e/0x1a0
kernel:  __do_sys_init_module+0x16e/0x1a0
kernel:  do_syscall_64+0x5b/0x180
kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
kernel: RIP: 0033:0x7f1134ad1bae
kernel: Code: 48 8b 0d dd 42 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d aa 42 0c 00 f7 d8 64 89 01 48
kernel: RSP: 002b:00007ffe9cb83118 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
kernel: RAX: ffffffffffffffda RBX: 0000563b364ce650 RCX: 00007f1134ad1bae
kernel: RDX: 0000563b364b50a0 RSI: 00000000006dfa2e RDI: 0000563b36d998b0
kernel: RBP: 0000563b36d998b0 R08: 0000563b364ba730 R09: 0000000000000001
kernel: R10: 0000000000000002 R11: 0000000000000246 R12: 0000563b364b50a0
kernel: R13: 0000000000000006 R14: 0000563b364c9fa0 R15: 0000000000000000
kernel: Modules linked in: amdgpu(+) amd_iommu_v2 gpu_sched i2c_algo_bit ttm drm_kms_helper crc32c_intel serio_raw drm e1000e(+) alx mdio video wmi vfio_pci irqbypass vfio_virqfd vfio_iommu_type1 vfio
kernel: CR2: ffffa5bd8394f650
kernel: ---[ end trace e14f412d43dd70ae ]---
kernel: RIP: 0010:bw_calcs_data_update_from_pplib.isra.0+0x378/0x4d0 [amdgpu]
kernel: Code: 00 00 5b 5d 41 5c 41 5d 41 5e c3 48 8b 7d 00 4c 89 f2 be 02 00 00 00 e8 26 bf f9 ff 8b 04 24 4c 8b 23 be e8 03 00 00 83 e8 01 <8b> 7c 84 04 e8 6f 4d fb ff be e8 03 00 00 49 89 44 24 60 8b 04 24
kernel: RSP: 0018:ffffa5b98394f650 EFLAGS: 00010297
kernel: RAX: 00000000ffffffff RBX: ffff928b34cb92d8 RCX: 0000000000000000
kernel: RDX: ffffa5b98394f58c RSI: 00000000000003e8 RDI: ffff928b39c12800
kernel: RBP: ffff928b34cb9208 R08: 0000000000000020 R09: 000000032a000000
kernel: R10: 00000003ce000000 R11: 0000001770000000 R12: ffff928b3ac0b300
kernel: R13: ffffa5b98394f76c R14: ffffa5b98394f650 R15: ffffffffc0839d60
kernel: FS:  00007f1133ad1940(0000) GS:ffff928b46b80000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: ffffa5bd8394f650 CR3: 00000005faf54004 CR4: 00000000001606e0

Comment 2 Gobinda Joy 2019-06-03 03:28:01 UTC
My hardware is as follows:
CPU: i7 3770 at stock clock
Motherboard: Gigabyte G1.Sniper 3 latest BIOS available
RAM: 24 GB DDR3 at 1600 mhz
GPU: RX 580 8GB (Sapphire) latest VBIOS

Tried mainline stable branch version 5.1.6 the results are same.
Display hangs when amdgpu driver loads. I'm unable to determine if the booting is continued or hangs as well. Disk activity stops after couple seconds and not possible to switch TTY.

Ctrl+Alt+Del is unresponsive as well.

This problem goes away when amdgpu.dpm=0 is used but in that case dynamic power scaling is not available and gpu stuck at low clock, graphics performance is abysmal. Also GPU temp/fan speed utilities doesn't work.

Here is the excerpt of the problematic log lines:

Jun 02 09:54:05 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:06 kernel: amdgpu: [powerplay] 
                         failed to send message 15b ret is 65535 
Jun 02 09:54:06 kernel: hrtimer: interrupt took 287743313 ns
Jun 02 09:54:06 kernel: clocksource: timekeeping watchdog on CPU3: Marking clocksource 'tsc' as unstable because the skew is too large:
Jun 02 09:54:06 kernel: clocksource:                       'hpet' wd_now: 628dd7b wd_last: 5fef431 mask: ffffffff
Jun 02 09:54:06 kernel: clocksource:                       'tsc' cs_now: 254aa24747 cs_last: 25104a5bfd mask: ffffffffffffffff
Jun 02 09:54:06 kernel: tsc: Marking TSC unstable due to clocksource watchdog
Jun 02 09:54:07 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:07 kernel: amdgpu: [powerplay] 
                         failed to send message 148 ret is 65535 
Jun 02 09:54:07 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:07 kernel: amdgpu: [powerplay] 
                         failed to send message 145 ret is 65535 
Jun 02 09:54:08 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:08 kernel: TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
Jun 02 09:54:08 kernel: sched_clock: Marking unstable (8791691311, 362291)<-(8817904668, -25851212)
Jun 02 09:54:08 kernel: amdgpu: [powerplay] 
                         failed to send message 146 ret is 65535 
Jun 02 09:54:08 kernel: hid-generic 0003:09DA:FC7C.0003: input,hidraw2: USB HID v1.11 Mouse [COMPANY USB Device] on usb-0000:00:1a.0-1.5.3/input0
Jun 02 09:54:09 kernel: hid-generic 0003:09DA:FC7C.0004: hiddev97,hidraw3: USB HID v1.11 Device [COMPANY USB Device] on usb-0000:00:1a.0-1.5.3/input1
Jun 02 09:54:11 kernel: clocksource: Switched to clocksource hpet
Jun 02 09:54:13 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:13 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:14 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:15 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:15 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:15 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:15 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         last message was failed ret is 65535
Jun 02 09:54:16 kernel: amdgpu: [powerplay] 
                         failed to send message 260 ret is 65535 
Jun 02 09:54:17 kernel: [drm] Initialized amdgpu 3.30.0 20150101 for 0000:04:00.0 on minor 0
Jun 02 09:54:17 kernel: EXT4-fs (sda3): mounted filesystem with ordered data mode. Opts: (null)
Jun 02 09:54:20 kernel: amdgpu 0000:04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx (-110).
Jun 02 09:54:21 kernel: [drm:amdgpu_device_ip_late_init_func_handler [amdgpu]] *ERROR* ib ring test failed (-110).

Any help is appreciated. Also let me know if I can help in any way.

Comment 3 Gobinda Joy 2019-06-03 03:29:09 UTC
Created attachment 1576462 [details]
Linux version 5.1.6-350.vanilla.knurd.1.fc30.x86_64

Comment 4 Gobinda Joy 2019-06-05 11:51:48 UTC
Bug report progress is here: https://bugs.freedesktop.org/show_bug.cgi?id=110822

Comment 5 Gobinda Joy 2019-06-10 13:45:13 UTC
Problem still exist in 5.1.7 and 5.1.8 from updates-testing repo.

Also in 5.1.8 and 5.2.0-0.rc3.git3.1 from vanilla fedora repo.

Comment 6 Ben Cotton 2020-04-30 21:47:14 UTC
This message is a reminder that Fedora 30 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 30 on 2020-05-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '30'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 30 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 7 Ben Cotton 2020-05-26 15:44:31 UTC
Fedora 30 changed to end-of-life (EOL) status on 2020-05-26. Fedora 30 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.