Bug 1548961 - Boot hangs or desktop freezes on Ryzen 5 2400G + B350 mobo
Summary: Boot hangs or desktop freezes on Ryzen 5 2400G + B350 mobo
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 28
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-02-26 06:09 UTC by Suvayu
Modified: 2018-08-01 03:22 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-07-02 07:11:01 UTC
Type: Bug


Attachments (Terms of Use)
lshw output (30.36 KB, text/plain)
2018-02-26 06:09 UTC, Suvayu
no flags Details
lspci output (14.06 KB, text/plain)
2018-02-26 06:10 UTC, Suvayu
no flags Details
screenshot of when the boot hangs (1.38 MB, image/jpeg)
2018-02-26 06:13 UTC, Suvayu
no flags Details
dmesg of 4.15.7 (155.70 KB, text/plain)
2018-03-12 21:10 UTC, Jorge Martínez López
no flags Details
screenshot after booting showing screen corruption (1.16 MB, image/jpeg)
2018-03-14 20:06 UTC, Suvayu
no flags Details
screenshot after booting showing screen corruption (improved) (1.29 MB, image/jpeg)
2018-03-22 15:00 UTC, Suvayu
no flags Details
traceback from amdgpu w/ logging turned on (27.08 KB, text/plain)
2018-03-22 15:03 UTC, Suvayu
no flags Details
traceback from crashed boot (81.36 KB, text/plain)
2018-04-17 02:29 UTC, Suvayu
no flags Details
crashed boot screenshot 1 (2.30 MB, image/jpeg)
2018-04-17 02:33 UTC, Suvayu
no flags Details
crashed boot screenshot 2 (1.82 MB, image/jpeg)
2018-04-17 02:34 UTC, Suvayu
no flags Details
crashed boot screenshot 3 (2.00 MB, image/jpeg)
2018-04-17 02:34 UTC, Suvayu
no flags Details
crashed boot screenshot 4 (1.72 MB, image/jpeg)
2018-04-17 02:35 UTC, Suvayu
no flags Details

Description Suvayu 2018-02-26 06:09:06 UTC
Created attachment 1400738 [details]
lshw output

Description of problem:
I can't boot with any of the 4.15+ kernels on my Ryzen system.  The
boot process hangs after the initial stage.  I have attached a
screenshot, but I doubt these errors are relvant, as I also see them
when I "successfully" boot with the 4.13.9 kernel.

Please note, even when I can boot, there are all manner of things that
do not work.  Here's a quick list:
1. modesetting fails
2. graphics does not work properly (e.g. mplayer does not go
   fullscreen, although VLC does)
3. lm-sensors does not detect any sensors (this is apparently fixed in
   4.15+: https://github.com/groeck/lm-sensors/issues/16)
4. random hangs, resolved only by a hard shutdown.  When it hangs, I
   also can't login remotely with ssh.

Version-Release number of selected component (if applicable):
4.15+ is unbootable, earlier versions are very buggy.

How reproducible:
Always

Steps to Reproduce:
1. try to boot on the mentioned hardware

Actual results:
- Boot process hangs for 4.15+ kernels
- Booted system is very buggy (mostly graphics issues) for older
  kernels

Expected results:
Everything should just work

Additional info:
I have checked, amdgpu is loaded, so I guess my hardware is still not
supported entirely. My hardware:
- Ryzen 5 2400G
- Gigabyte B350M Gaming 3 motherboard

Comment 1 Suvayu 2018-02-26 06:10:12 UTC
Created attachment 1400739 [details]
lspci output

Comment 2 Suvayu 2018-02-26 06:13:11 UTC
Created attachment 1400740 [details]
screenshot of when the boot hangs

Comment 3 Jorge Martínez López 2018-03-05 19:36:13 UTC
Same thing here. 

4.14+ works for me but there are some screen artifacts when the resolution changes.

4.15 boots if I remove the "quiet" option however the screen is split, Wayland is unusable because you can't see the login screen and the console login is split in 4 with the two upper quarters showing the same output. 

I can attach a dmesg if it helps.

Comment 4 Jorge Martínez López 2018-03-09 12:23:03 UTC
Some additional testing, 4.15 boots with "rhgb quiet nomodeset".

Comment 5 Suvayu 2018-03-10 00:10:33 UTC
I can confirm the above, but then that stands to reason since amdgpu is no longer loaded (confirmed with lsmod).  I tried loading the module after boot, with modprobe, but that fails too.

# modprobe -v amdgpu
insmod /lib/modules/4.15.6-300.fc27.x86_64/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko.xz 
modprobe: ERROR: could not insert 'amdgpu': Invalid argument
# modprobe -fv amdgpu
insmod /lib/modules/4.15.6-300.fc27.x86_64/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko.xz 
modprobe: ERROR: could not insert 'amdgpu': Key was rejected by service

As to the second problem, lm_sensors still do not detect anything.  When I run sensors-detect, it goes through the list of AMD chipsets and stops at 16H; as I understand, B350 belongs to the 17H family.

Maybe I'll run it like this and see if it's any more stable.

Comment 6 Jorge Martínez López 2018-03-12 21:09:16 UTC
4.15.7 boots with quiet but it shows the split screen so it's still unusable. Using "nomodeset" fixes it.

Attaching a dmesg.

Comment 7 Jorge Martínez López 2018-03-12 21:10:28 UTC
Created attachment 1407385 [details]
dmesg of 4.15.7

This dmesg was taken with 4.15.7 booting to a corrupted display.

Comment 8 Suvayu 2018-03-13 07:35:06 UTC
(In reply to Jorge Martínez López from comment #7)
> Created attachment 1407385 [details]
> dmesg of 4.15.7
> 
> This dmesg was taken with 4.15.7 booting to a corrupted display.

Hi Jorge, if I understand correctly, you can still login on the console, but graphical login is corrupted, is that right?  And I guess you generated the dmesg from the console?  For me, the boot freezes, I can't even login to a console so that I can dump the dmesg output.  Are you using any special kernel arguments?

Comment 9 Jorge Martínez López 2018-03-13 07:43:21 UTC
(In reply to Suvayu from comment #8)
> (In reply to Jorge Martínez López from comment #7)
> > Created attachment 1407385 [details]
> > dmesg of 4.15.7
> > 
> > This dmesg was taken with 4.15.7 booting to a corrupted display.
> 
> Hi Jorge, if I understand correctly, you can still login on the console, but
> graphical login is corrupted, is that right?  And I guess you generated the
> dmesg from the console?  For me, the boot freezes, I can't even login to a
> console so that I can dump the dmesg output.  Are you using any special
> kernel arguments?

That's correct, the console is also corrupted but it's readable. To make it work I'm using "nomodeset", to get the corrupted screen I don't use any special parameters.

Comment 10 Suvayu 2018-03-14 20:06:48 UTC
Created attachment 1408136 [details]
screenshot after booting showing screen corruption

@Jorge, thanks for the clarification.

In the meantime, with today's update (4.15.8) I can boot but with a corrupted display like Jorge.  Accordingly, I'm updating the screenshots I had attached.  You can see the display is split in two, and the lower halves are corrupted by screen tears.  I managed to login to a desktop by blindly typing in my password.

Comment 11 Jorge Martínez López 2018-03-15 13:20:15 UTC
Good news(?)

In an act of desperation I did a reinstall and all my problems are gone.

I used "nomodeset" during the install so that setting was kept. Once rebooted into the new install I upgraded all the packages including the kernel and removed "nomodeset" from the default grub.

I'm not sure if it's relevant but I also moved from BIOS compatibility (CSM) to UEFI boot.

I'm using 4.15.8-300.fc27.x86_64 now with KMS, AMDGPU, Wayland.

On the not so good side of things I have had a kernel core which I think it's unrelated as I have had some random hangs for a while.

Comment 12 Suvayu 2018-03-18 12:10:01 UTC
Unfortunately I don't have the time/energy to reinstall.  But I did try booting with UEFI, no luck.  I get the same corrupted screen.

lm_sensors are still not working, do they work for you?

That said, I might have found the bug causing my random hangs (bug #1514734), maybe you are also affected by that?  It seems to be triggered by ata errors.  In my case that comes from a rather old HDD.

I tried 4.15.9 for this latest round of testing.

Comment 13 Jorge Martínez López 2018-03-18 12:33:23 UTC
(In reply to Suvayu from comment #12)
> Unfortunately I don't have the time/energy to reinstall.  But I did try
> booting with UEFI, no luck.  I get the same corrupted screen.
> 
> lm_sensors are still not working, do they work for you?

Partially, it detects some sensors in the bus but I don't get any reading from CPU or fan speed.

> 
> That said, I might have found the bug causing my random hangs (bug
> #1514734), maybe you are also affected by that?  It seems to be triggered by
> ata errors.  In my case that comes from a rather old HDD.
> 
> I tried 4.15.9 for this latest round of testing.

That bug seems related to PAE kernels which you shouldn't be using on a 64 bit machine.

My computer is still running stable (I have since then removed the "nopti" config option). I had a scare earlier when it became unresponsive for a few seconds but it recover itself.

Comment 14 Suvayu 2018-03-19 01:30:49 UTC
(In reply to Jorge Martínez López from comment #13)
> 
> That bug seems related to PAE kernels which you shouldn't be using on a 64
> bit machine.
> 
> My computer is still running stable (I have since then removed the "nopti"
> config option). I had a scare earlier when it became unresponsive for a few
> seconds but it recover itself.

Well I didn't notice it was a PAE bug until you mentioned it.  The traceback however is identical.

Comment 15 Suvayu 2018-03-22 15:00:33 UTC
Created attachment 1411756 [details]
screenshot after booting showing screen corruption (improved)

Now (left) half of the screen is rendered correctly, instead of the earlier (top-left) quarter of the screen.

Comment 16 Suvayu 2018-03-22 15:03:32 UTC
Created attachment 1411757 [details]
traceback from amdgpu w/ logging turned on

Comment 17 Suvayu 2018-03-22 15:05:39 UTC
In search of a solution, I have now moved on to 4.16.0-0.rc6.git0.2.fc28.x86_64 on F28 (beta), still no go.  The screen corruption has "improved" somewhat; instead of rendering a quarter of the screen correctly, it's now rendering half of the screen correctly.  I have updated the screenshot accordingly. 

I also turned on AMDGPU DC logging, and I notice (multiple) tracebacks from amdgpu.  I have attached a trimmed dump of the journal.

Since this is the newest kernel available on Fedora, I sought help on phoronix forums ("https://www.phoronix.com/forums/forum/linux-graphics-x-org-drivers/open-source-amd-linux/1014199-getting-amdgpu-to-work-on-ryzen-5-2400g-b350").  A AMD dev there pointed me to the amd-staging-drm-next branch ("https://cgit.freedesktop.org/~agd5f/linux/?h=amd-staging-drm-next"). Next, I'll try to build this and see if that works.

Comment 18 Suvayu 2018-03-29 05:04:59 UTC
I can finally boot without nomodeset with the latest kernel in F28 updates-testing (4.16.0-0.rc7.git0.1.fc28.x86_64).  However, it's still not stable.  The mouse pointer is sluggish and jittery, and the my desktop still freezes randomly leaving no logs in the journal.

Comment 19 Jerry 2018-03-31 18:58:45 UTC
See also 1562530, may be the same freeze issue. I have not had any boot issues but random screen complete freeze. May be related bugs.

Comment 20 Suvayu 2018-04-17 02:26:22 UTC
@Jerry: it does seem somewhat similar.

In the meantime, here are some developments.  I'm running the latest kernel available on Fedora: 4.16.2-300.fc28.x86_64.

1. I still have random boot issues: either there are crashes at boot, or the screen goes blank after modesetting and stays that way indefinitely. I will attach logs and screnshots from a particularly severe example of a crash at boot.  From the logs you can see the first backtrace is recorded (matches with screenshot 1), but the others don't show up in the log.

2. Random hangs: sometimes the desktop freezes, but I can still login remotely (and shutdown).  Typically though, I can't remote login also fails with "No route to host".  When I inspect the logs later, typically there is nothing to be found, but on some of the freezes I found the following:

  [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, last signaled seq=2018070, last emitted seq=2018072
  [drm] IP block:psp is hung!
  [drm] GPU recovery disabled.

Please note, the so far I have never had a hang from the terminal (Ctrl+Alt+F2 etc).  They occur only when I'm logged into a graphical desktop (XFCE in my case).

3. I observe screen corruption when using tools like display or import (from ImageMagick). Import fails to capture what I'm seeing on the display, it's some hotchpotch of all the windows that are open.  Similarly, display shows a weird overlap of all the windows.  Strangely, XFCE's own screenshot tool works without a hitch.

Comment 21 Suvayu 2018-04-17 02:29:45 UTC
Created attachment 1422866 [details]
traceback from crashed boot

Comment 22 Suvayu 2018-04-17 02:33:41 UTC
Created attachment 1422868 [details]
crashed boot screenshot 1

Comment 23 Suvayu 2018-04-17 02:34:24 UTC
Created attachment 1422869 [details]
crashed boot screenshot 2

Comment 24 Suvayu 2018-04-17 02:34:59 UTC
Created attachment 1422870 [details]
crashed boot screenshot 3

Comment 25 Suvayu 2018-04-17 02:35:37 UTC
Created attachment 1422871 [details]
crashed boot screenshot 4

Comment 26 Jeff Strehlow 2018-04-28 05:10:22 UTC
I'm using Fedora 28 and am experiencing occasional lockups. I just upgraded yesterday and am using the 4.16.3-300.fc28.x86_64 kernel. My system consists of:

Asus Prime B350M-A motherboard
Ryzen 3 2200G APU
CORSAIR Vengeance LPX 3000MHz DDR4 RAM, set for DOCP 2933MHz in the BIOS

Comment 27 Jeff Strehlow 2018-04-28 07:49:31 UTC
Here is some additional information. It freezes, on average, once every other day. In fact it just froze again. It usually freezes when I'm using firefox, scrolling through webpages with the mouse wheel; I do that a lot. But I've also seen it freeze when clicking on a file, while in the file manager. When it freezes a hard reset is required to unfreeze it.

I'm using the Fedora 28 GNOME workstation release, which I believe uses Wayland as the default.

Comment 28 Suvayu 2018-04-29 10:16:28 UTC
Hi Jeff, your issues sound just like mine.  Do you also get error messages like the one mentioned in comment 20 (#c20) in your journal?  Try:

  $ journalctl -b -n -k  # where you had the crash n boots ago

I get them intermittently.  I think when they don't show up in the journal, it's because the computer froze before the journal could record the message.

<rant>
That said, you are out of luck getting help here.  As you can see no has responded in two months, despite being marked as severe.  I have highlighted the bug on bodhi kernel updates, phoronix forums, and on IRC #fedora-devel, to no avail.

Your best bet is to ask for help on IRC #radeon (as advised by a person on #fedora-devel).  I haven't done that since they will most likely ask you to test with the 4.17 dev kernel (like the phoronix post in #c17); unfortunately I have no experience building custom kernels, and I am somewhat intimidated.
</rant>

Good luck

Comment 29 Jeff Strehlow 2018-04-30 00:42:47 UTC
Hi Suvayu. Thanks for responding to my post and for your suggestions. I wasn't really expecting help here. I was just reporting a bug. I figure that if more people report it's more likely to get fixed in a timely manner. I tried using journalctl as you suggested and came up with similar errors. Here's the output showing the last log entries; it appears that the problem starts occurring at Apr 26 17:09:12:

Apr 26 15:40:01 localhost.localdomain kernel: logitech-hidpp-device 0003:046D:400A.0006: HID++ 2.0 device connected.
Apr 26 15:40:36 localhost.localdomain kernel: fuse init (API version 7.26)
Apr 26 15:40:41 localhost.localdomain kernel: rfkill: input handler disabled
Apr 26 17:09:12 localhost.localdomain kernel: gmc_v9_0_process_interrupt: 9 callbacks suppressed
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:4 pas_id:0)
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0:   at page 0x0000000106243000 from 27
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00401031
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:4 pas_id:0)
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0:   at page 0x0000000106245000 from 27
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:4 pas_id:0)
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0:   at page 0x0000000106241000 from 27
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:4 pas_id:0)
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0:   at page 0x000000010624a000 from 27
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:4 pas_id:0)
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0:   at page 0x0000000106207000 from 27
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:4 pas_id:0)
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0:   at page 0x000000010624c000 from 27
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:4 pas_id:0)
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0:   at page 0x0000000106247000 from 27
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:4 pas_id:0)
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0:   at page 0x0000000106209000 from 27
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:4 pas_id:0)
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0:   at page 0x0000000106249000 from 27
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:4 pas_id:0)
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0:   at page 0x000000010620e000 from 27
Apr 26 17:09:12 localhost.localdomain kernel: amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 26 17:09:22 localhost.localdomain kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=55595, last emitted seq=55597
Apr 26 17:09:22 localhost.localdomain kernel: [drm] IP block:psp is hung!
Apr 26 17:09:22 localhost.localdomain kernel: [drm] GPU recovery disabled.
lines 1090-1134/1134 (END)

I could try compiling and installing a newer kernel. I've done it about 15 years ago, so I can probably do it now. The link for cloning the kernel at the amd-staging-drm-next branch didn't work for me, but linux kernel 4.17-rc3 is available at kernel.org. Thanks again.

Comment 30 Jeff Strehlow 2018-04-30 07:58:36 UTC
Earlier today I did a "dnf update" and got kernel 4.16.4-300.fc28.x86_64. It also freezes up with that kernel. The freeze occurred at Apr 30 00:18:57 and here's the log output. There isn't a normal powerdown sequence because I pushed the hardware reset button in order to recover, which must have terminated this boot session's log.  

Apr 29 22:35:32 localhost.localdomain kernel: wlp4s0: associated
Apr 29 22:35:32 localhost.localdomain kernel: IPv6: ADDRCONF(NETDEV_CHANGE): wlp4s0: link becomes ready
Apr 29 22:35:54 localhost.localdomain kernel: logitech-hidpp-device 0003:046D:400A.0006: HID++ 2.0 device connected.
Apr 29 22:36:25 localhost.localdomain kernel: fuse init (API version 7.26)
Apr 29 22:36:30 localhost.localdomain kernel: rfkill: input handler disabled
Apr 30 00:18:57 localhost.localdomain kernel: gmc_v9_0_process_interrupt: 8 callbacks suppressed
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:4 pas_id:0)
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0:   at page 0x0000000106452000 from 27
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00401031
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:4 pas_id:0)
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0:   at page 0x0000000106454000 from 27
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:4 pas_id:0)
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0:   at page 0x000000010644a000 from 27
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:4 pas_id:0)
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0:   at page 0x000000010644c000 from 27
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:4 pas_id:0)
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0:   at page 0x0000000106456000 from 27
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:4 pas_id:0)
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0:   at page 0x0000000106458000 from 27
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:4 pas_id:0)
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0:   at page 0x000000010644e000 from 27
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:4 pas_id:0)
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0:   at page 0x0000000106450000 from 27
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:4 pas_id:0)
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0:   at page 0x0000000106461000 from 27
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:4 pas_id:0)
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0:   at page 0x0000000106463000 from 27
Apr 30 00:18:57 localhost.localdomain kernel: amdgpu 0000:08:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Apr 30 00:19:08 localhost.localdomain kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=44677, last emitted seq=44679
Apr 30 00:19:08 localhost.localdomain kernel: [drm] IP block:psp is hung!
Apr 30 00:19:08 localhost.localdomain kernel: [drm] GPU recovery disabled.
lines 1093-1139/1139 (END)

Comment 31 Suvayu 2018-05-01 03:44:28 UTC
Hi Jeff, the freezes are definitely correlated to the hung GPU.  Just like you, the timestamps coincided everytime I could get an error in the journal.

That said, yesterday I upgraded my BIOS to the version based on AGESA 1.0.0.2a (Gigabyte version F23d), and now my workstation has been up 26 hrs without a crash.  I have done my usual compiling, browsing, video encoding, watched a TV series, and youtube videos.

Maybe Asus also has a BIOS update?  If you try that, make sure are aware of the risks, and backup your current BIOS first.

Comment 32 Jeff Strehlow 2018-05-01 18:48:49 UTC
Thanks Suvayu. I upgraded my BIOS to version 4008 about 10 days ago. I just checked and that's still the latest version. It could be the BIOS, but I figure it's most likely the driver since they just included it in the kernel with version 4.15 and it's known to have bugs. Good luck with your system. I hope that fixed the freezes for you.

Comment 33 Suvayu 2018-05-08 05:23:04 UTC
A development, newer kernels have regressed, and I can't boot.  I have to blacklist amdgpu or disable modesetting to boot successfully.  Approximately 1/6-7 boots suceed with modesetting, and eventually I see a desktop freeze with a hung GPU.  4.16.4 does not seem to have this problem.  I have been running on this kernel for almost two days again.

Comment 34 Jeff Strehlow 2018-05-08 17:51:32 UTC
In my case--with an ASUS Prime B350M-A motherboard--the newest kernel boots OK. I upgraded last night and got the 4.16.6-302.fc28.x86_64 kernel. This morning I booted with it without any issues. After seeing Suvayu's post I tried booting a couple additional times and there weren't any problems. So it seems that the newest kernels only fail to boot properly with certain hardware.

I've been using the 4.16.5 kernel for a few days without any lockups, but yesterday it did lock up with that kernel one time. Version 4.16.5 also boots for me without any issues.

Comment 35 Suvayu 2018-05-14 13:23:00 UTC
I have been using 4.17.0-0.rc4.git[n] kernels from the repo mentioned in the following wiki page: https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories

It's been running more or less smoothly.  I have had one or two failed boots, and one crash so far.  The crash was with an earlier build (rc4.git0, the latest is rc4.git4).  At the moment I sometimes experience screen tears with Chrome sometimes, but it resolves itself in a few seconds on moving the browser window, changing the tab, or reloading the page.

Comment 36 Suvayu 2018-07-02 07:11:01 UTC
With 4.17 kernels now available in the repo, I do not have any freezes any more.

Comment 37 Jeff Strehlow 2018-08-01 03:22:24 UTC
Shortly after the 4.17 kernel became available in the FC28 repository, I did a "dnf update" and have been running with that kernel ever since. With 4.17 I haven't experienced any freezes, not even one. My system is now very stable. Thanks.


Note You need to log in before you can comment on or make changes to this bug.