Bug 1478219 - glmark2 and gnome session random blank or "grey background" screen freeze radeon rx550 AMD POLARIS12 due to dpm on amdgpu
Summary: glmark2 and gnome session random blank or "grey background" screen freeze rad...
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: xorg-x11-drv-amdgpu
Version: 28
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
Assignee: Adam Jackson
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Keywords: Reopened
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-08-04 02:27 UTC by Pablo Estigarribia
Modified: 2018-08-17 23:43 UTC (History)
19 users (show)

(edit)
Clone Of:
(edit)
Last Closed: 2018-08-17 23:43:14 UTC


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
FreeDesktop.org 101976 None None None Never

Description Pablo Estigarribia 2017-08-04 02:27:01 UTC
Description of problem:

Details and attachments are reported in: 

https://bugs.freedesktop.org/show_bug.cgi?id=101976


Version-Release number of selected component (if applicable): I have tried with mesa stock version from fedora 17.1.5, kernel 4.12.4-300.fc26.x86_64) and also with che/mesa git version 17.3


How reproducible:

Use some POLARIS12 AMD card and will got random screen freeze, or use glmark2 and will get the freeze at:

glmark2
=======================================================
    glmark2 2014.03
=======================================================
    OpenGL Information
    GL_VENDOR:     X.Org
    GL_RENDERER:   Gallium 0.4 on AMD POLARIS12 (DRM 3.15.0 / 4.12.4-300.fc26.x86_64, LLVM 4.0.0)
    GL_VERSION:    2.1 Mesa 17.1.5
=======================================================
[build] use-vbo=false: FPS: 3337 FrameTime: 0.300 ms
[build] use-vbo=true: FPS: 5185 FrameTime: 0.193 ms
[texture] texture-filter=nearest: FPS: 5145 FrameTime: 0.194 ms
[texture] texture-filter=linear: FPS: 5123 FrameTime: 0.195 ms
[texture] texture-filter=mipmap: FPS: 5071 FrameTime: 0.197 ms
[shading] shading=gouraud: FPS: 5113 FrameTime: 0.196 ms
[shading] shading=blinn-phong-inf: FPS: 5145 FrameTime: 0.194 ms
[shading] shading=phong: FPS: 4941 FrameTime: 0.202 ms
[shading] shading=cel:


Actual results:

screen freeze

Expected results:

screen never freezes!

Additional info:

Using var:

MESA_GL_VERSION_OVERRIDE=2.1 

for glmark2 is working as workaround

using var 
MESA_GL_VERSION_OVERRIDE=2.1 in /etc/environment also works as workaround for gnome session (I got random screen freezes also playing videos or on some logins).

Comment 1 Pablo Estigarribia 2017-08-25 02:43:48 UTC
On latest test th glmark2 passed but changing dpm from auto to high with: 

echo "high" > /sys/class/drm/card0/device/power_dpm_force_performance_level

as root. 

glmark2
=======================================================
    glmark2 2014.03
=======================================================
    OpenGL Information
    GL_VENDOR:     X.Org
    GL_RENDERER:   Gallium 0.4 on AMD POLARIS12 (DRM 3.15.0 / 4.12.8-300.fc26.x86_64, LLVM 4.0.0)
    GL_VERSION:    3.0 Mesa 17.1.7
=======================================================
[build] use-vbo=false: FPS: 3486 FrameTime: 0.287 ms
[build] use-vbo=true: FPS: 5095 FrameTime: 0.196 ms
[texture] texture-filter=nearest: FPS: 4927 FrameTime: 0.203 ms
[texture] texture-filter=linear: FPS: 5014 FrameTime: 0.199 ms
[texture] texture-filter=mipmap: FPS: 4899 FrameTime: 0.204 ms
[shading] shading=gouraud: FPS: 4960 FrameTime: 0.202 ms
[shading] shading=blinn-phong-inf: FPS: 4979 FrameTime: 0.201 ms
[shading] shading=phong: FPS: 4945 FrameTime: 0.202 ms
[shading] shading=cel: FPS: 4880 FrameTime: 0.205 ms
[bump] bump-render=high-poly: FPS: 5064 FrameTime: 0.197 ms
[bump] bump-render=normals: FPS: 4824 FrameTime: 0.207 ms
[bump] bump-render=height: FPS: 4688 FrameTime: 0.213 ms
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 5140 FrameTime: 0.195 ms
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 3688 FrameTime: 0.271 ms
[pulsar] light=false:quads=5:texture=false: FPS: 4306 FrameTime: 0.232 ms
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 2557 FrameTime: 0.391 ms
[desktop] effect=shadow:windows=4: FPS: 2273 FrameTime: 0.440 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 557 FrameTime: 1.795 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 779 FrameTime: 1.284 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 636 FrameTime: 1.572 ms
[ideas] speed=duration: FPS: 1468 FrameTime: 0.681 ms
[jellyfish] <default>: FPS: 3937 FrameTime: 0.254 ms
[terrain] <default>: FPS: 636 FrameTime: 1.572 ms
[shadow] <default>: FPS: 3587 FrameTime: 0.279 ms
[refract] <default>: FPS: 1310 FrameTime: 0.763 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 4893 FrameTime: 0.204 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 4779 FrameTime: 0.209 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 4807 FrameTime: 0.208 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 4830 FrameTime: 0.207 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 4475 FrameTime: 0.223 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 4424 FrameTime: 0.226 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 4343 FrameTime: 0.230 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 4441 FrameTime: 0.225 ms
=======================================================
                                  glmark2 Score: 3806 
=======================================================


OpenGL vendor string: X.Org
OpenGL renderer string: Gallium 0.4 on AMD POLARIS12 (DRM 3.15.0 / 4.12.8-300.fc26.x86_64, LLVM 4.0.0)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 17.1.7
OpenGL core profile shading language version string: 4.50
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

Comment 2 Pablo Estigarribia 2017-08-25 02:48:19 UTC
also tested game Insurgency that had same problem as glmark2 and it works now! 

seems that dpm is buggy on this radeon card, probably is better to disable it by default or put it on high performance for a more stable usage for users. 

https://www.x.org/wiki/RadeonFeature/#note_15 say:

"high" forces the gpu to be in the "high" power state all the time. The "low" power state is selected when the monitors are in the dpms off state. The "profile" method is not as agressive as "dynpm," but is currently much more stable and flicker free and works with multiple heads active.

Comment 3 Pablo Estigarribia 2017-08-27 13:04:27 UTC
Everything works fine disabling dpm on amdgpu. 

Workaround steps: 

edit /etc/default/grub

add amdgpu.dpm=0 to GRUB_CMDLINE_LINUX=

generate new config: 

sudo grub2-mkconfig -o /boot/grub2/grub.cfg

reboot (to test new permanent config is done. 

You will notice now /sys/class/drm/card0/device/power_dpm_force_performance_level doesn't exist. 

For months I though the problem was somewhere else, and was looking for many combinations without success, without this workaround I had a very annoying experience working, playing videos, whatever, because randomly I had freezed display and had to restart. I couldn't play any game before, now tested with many games and everything is perfect. 

Could be good to apply this workaround by default? so users doesn't experience the annoying bug?

Comment 4 Pablo Estigarribia 2017-09-12 10:39:17 UTC
latest update sent in https://bodhi.fedoraproject.org/updates/FEDORA-2017-2873fa1fb6, still doesn't fix this bug. 

good workaround could be to put default performance in high for all users? until getting the bug definitively fixed?

Comment 5 Pablo Estigarribia 2017-09-12 23:14:04 UTC
Got better workaround while waiting to have this permanently fixed:

1) download radcard script: 

https://raw.githubusercontent.com/superjamie/snippets/master/radcard

2) chmod +x radcard

3) sudo cp radcard /usr/local/bin/radcard

4) Create systemd unit to change value at logon, unit file: 

sudo gedit /etc/systemd/system/amdgpu-dpm.service


    [Unit]
    Description=Change dpm to performance high

    [Service]
    Type=oneshot
    ExecStart=/usr/local/bin/radcard set high
    ExecStop=/usr/local/bin/radcard set bal

    [Install]
    WantedBy=multi-user.target

5) Start and see status

sudo systemctl start amdgpu-dpm.service
sudo systemctl status amdgpu-dpm.service

Should see something like:

[root@192 ~]# systemctl status amdgpu-dpm.service 
● amdgpu-dpm.service - Change dpm to performance high
   Loaded: loaded (/etc/systemd/system/amdgpu-dpm.service; enabled; vendor pres
   Active: inactive (dead) since Tue 2017-09-12 20:08:08 -03; 4min 29s ago
  Process: 903 ExecStop=/usr/local/bin/radcard set bal (code=exited, status=0/S
  Process: 884 ExecStart=/usr/local/bin/radcard set high (code=exited, status=0
 Main PID: 884 (code=exited, status=0/SUCCESS)

sep 12 20:08:08 localhost.localdomain systemd[1]: Starting Change dpm to perfor
sep 12 20:08:08 localhost.localdomain radcard[884]: power_dpm_state: performanc
sep 12 20:08:08 localhost.localdomain radcard[884]: power_dpm_force_performance
sep 12 20:08:08 localhost.localdomain radcard[903]: power_dpm_state: performanc
sep 12 20:08:08 localhost.localdomain radcard[903]: power_dpm_force_performance
sep 12 20:08:08 localhost.localdomain systemd[1]: Started Change dpm to perform

6) Enable this

sudo systemctl start amdgpu-dpm.service

---

You can change the commands to your better choices.

Comment 6 Pablo Estigarribia 2017-10-11 13:21:33 UTC
I have noticed that the "oneshot" unit in systemd execs both : start/stop script, so to have it always high I had to change this: 


4) Create systemd unit to change value at logon, unit file: 

sudo gedit /etc/systemd/system/amdgpu-dpm.service


    [Unit]
    Description=Change dpm to performance high

    [Service]
    Type=oneshot
    ExecStart=/usr/local/bin/radcard set high
    ExecStop=/usr/local/bin/radcard set high

    [Install]
    WantedBy=multi-user.target

Comment 7 Chris Siebenmann 2018-02-23 15:42:36 UTC
I have experienced this hang or something similar enough to it that it
was mitigated by setting 'amdgpu.dpm=0' on the kernel command line. My
environment is Fedora 27, a Gigabyte Radeon RX 550 card, AMD Ryzen 1800X,
Asus Prime X370-Pro motherboard, and Kingston ECC RAM.

(This happened in early January, which is a few Fedora 27 kernels ago.
I haven't tried to reproduce in the latest ones because my interest in
having my primary machine lock up on me is low.)

FYI: your amdgpu-dpm.service ExecStart isn't working the way you expected
because you also need 'RemainAfterExit=True'. Without it, systemd thinks
that your script has failed/stopped after the script exits, so it runs
your ExecStop script.

Comment 8 Chris Siebenmann 2018-04-04 17:19:22 UTC
With all of the current updates applied, my system no longer appears to
lock up this way and the 'amdgpu.dpm=0' kernel command line parameter is
unnecessary. Various package versions:

   xorg-x11-drv-amdgpu-18.0.1-1.fc27.x86_64
   kernel-4.15.13-300.fc27.x86_64
   libdrm-2.4.91-1.fc27.x86_64

(and possibly other relevant packages, but I don't know.)

Comment 9 Pablo Estigarribia 2018-04-05 22:44:40 UTC
You are right Chris, 

I have removed the customizations on my system and found the bug is not there, same with live cd.

What I see:

Someone disabled dpm by default, probably upstream? because I haven't see any comment about it on some other places.

After boot without customizations:

$ cat /sys/module/amdgpu/parameters/dpm
0

$ cat /etc/default/grub 
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap rhgb quiet"
GRUB_DISABLE_RECOVERY="true"

Comment 10 Tim Cuthbertson 2018-04-06 00:14:51 UTC
My HP notebook system running Fedora 28 Beta (fully updated) still has this problem.

CPU: 3 PID: 800 Comm: systemd-logind Not tainted 4.16.0-300.fc28.x86_64 #1
Hardware name: HP HP ProBook 450 G3/8101, BIOS N78 Ver. 01.24 01/29/2018

amdgpu.dpm=0 does resolve it.

Is the upstream bug the one at freedesktop #101976?

Comment 11 Pablo Estigarribia 2018-04-06 01:51:44 UTC
Ok, what's the status of dpm by default for your system?

I have reported that in freedesktop yes, I have forgot it as no response and no comment from anyone there.

Comment 12 Pablo Estigarribia 2018-04-06 01:56:49 UTC
What I see is: this could be "resolved" if dpm is disabled for amdgpu, and looks as critical as normal users will never understand why their systems freezes on a default install.

Looks like someone changed it for Polaris 550 or not? 

@tim which one is your graphics card?

In this link http://www8.hp.com/us/en/products/laptops/product-detail.html?oid=7834555#!tab=specs I see only intel graphics.... 

Probably new bug report is required to address the bug on that one.

Comment 13 Chris Siebenmann 2018-04-06 19:08:41 UTC
On my hardware (Gigabyte Radeon RX 550), amdgpu.dpm=0 had clear effects on at least power usage; it appears to have restricted my card to a very low power mode, more or less its minimum power usage (and well under basic power usage in X normally). It also caused fan and temperature reporting to stop being visible in lm_sensors output (normally both show up as an 'amdgpu-pci-0900' sensor). I suspect that this power restriction caused lower graphics performance, but I don't have figures on that (and I don't have the interest to re-restrict my machine to make measurements).

All of this suggests that it would be undesirable to set amdgpu.dpm=0 by default. It seems clearly better to have AMDGPU DPM enabled on setups that can use it without locking up.

Comment 14 Pablo Estigarribia 2018-04-06 19:33:48 UTC
@Chris,

What you are finding here looks good arguments to keep it open.

You can also enable dpm and change performance always to high (look comments above), it works for me.

Comment 15 Tim Cuthbertson 2018-04-07 15:31:04 UTC
My discrete card is a Topaz XT Radeon R7 M340. I am not sure what the default dpm setting is - How do I determine that?

My system does not even use the discrete card by default, but it makes it crash with a kernel oops, anyway. The default onboard graphics is Intel HD 520.

The oops is Apr 05 07:09:28 tim-fedora-hp kernel: Oops: 0002 [#1] SMP PTI

My kernel is 4.16.0-300.fc28.x86_64

Comment 16 Marc Ranolfi 2018-04-14 00:15:31 UTC
I'm affected by this.

For months now I've been running with DPM set to "Low" using an udev rule, as a temporary workaround. I set it to "high" when I want to game, which is a rare event these days.

My solution is available at https://gitlab.com/snippets/1709853 and https://gitlab.com/snippets/1709854.

I can't reproduce what Chris Siebenmann described - i.e. in my system lm_sensors output is correct with either DPM state - but I definitely agree that DPM should be working and in use if the card supports it, no question about it.

Bug is present in kernel 4.15.10. I'm going to test with kernel 4.15.15 later.

Comment 17 Pablo Estigarribia 2018-04-14 01:00:25 UTC
After looking again into the files of the card0, I have noticed that dpm is enabled by default! (I have been confused because the file changed its place).


Now it is here:

cat /sys/class/drm/card0/power/control 
auto

uname -a
Linux powers11 4.16.1-300.fc28.x86_64 #1 SMP Mon Apr 9 15:29:05 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Also their outpus have same files:

/sys/class/drm/card0/
card0-DP-1/     card0-HDMI-A-1/ device/         subsystem/
card0-DVI-D-1/  dev             power/          uevent

So it is really fixed for fedora 28!! (kernel 4.16.x)

I have been using this for weeks and not noticed this bug!

Also tested glmark2 and some game just to see if something crashes.

Comment 18 Marc Ranolfi 2018-04-14 05:59:23 UTC
Strange. I'm now running kernel 4.16.2 as we speak and the bug is still present.

Even though I had seen some interesting commits to mainline kernel regarding DPM under '/drivers/gpu/drm/amd/amdgpu/'.

P.S. Can still read lm_sensors GPU temperature with the performance level fixed to either "low" or "high".

I've publish a new version to https://gitlab.com/marc2377/rforcedpm which also handles the <auto> state.

Comment 19 Pablo Estigarribia 2018-05-18 02:59:53 UTC
Today reinstalled fedora 28 and did an upgrade.

Got 3 times the 'blank' screen just working on gnome session, same behaviour as before. 

So looks like this bug is still open.

Comment 20 Marc Ranolfi 2018-05-18 04:20:10 UTC
Of course it is; I'm now on kernel 4.16.8 and still affected.

Comment 21 Pablo Estigarribia 2018-08-17 23:43:14 UTC
Looks good for me on kernel 4.17 on fedora with mesa 18 now.
Fedora 28 with all updates, will close as it has been many weeks without issues.

If something appears to be wrong again, will reopen it.

cat /sys/class/drm/card0/power/control 
auto

    GL_VENDOR:     X.Org
    GL_RENDERER:   Radeon RX 550 Series (POLARIS12 / DRM 3.25.0 / 4.17.5-200.fc28.x86_64, LLVM 6.0.0)
    GL_VERSION:    3.0 Mesa 18.0.5

cho $XDG_SESSION_TYPE 
wayland

shadow] <default>: FPS: 4230 FrameTime: 0.236 ms
[refract] <default>: FPS: 1347 FrameTime: 0.742 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 5557 FrameTime: 0.180 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 5504 FrameTime: 0.182 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 5438 FrameTime: 0.184 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 5469 FrameTime: 0.183 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 5434 FrameTime: 0.184 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 5348 FrameTime: 0.187 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 5128 FrameTime: 0.195 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 5287 FrameTime: 0.189 ms
=======================================================
                                  glmark2 Score: 4200 
=======================================================

glmark score is not completely real, as I have been moving between windows all the time, also installing packages and doing more stuff.


Note You need to log in before you can comment on or make changes to this bug.