1692990 – Wrong 'primary' GPU selected at boot-time (amdgpu, eGPU, thunderbolt)

Bug 1692990 - Wrong 'primary' GPU selected at boot-time (amdgpu, eGPU, thunderbolt)

Summary: Wrong 'primary' GPU selected at boot-time (amdgpu, eGPU, thunderbolt)

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	xorg-x11-drv-ati
Sub Component:
Version:	30
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	X/OpenGL Maintenance List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-26 20:55 UTC by Devon Maloney
Modified:	2023-09-18 00:15 UTC (History)
CC List:	28 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2020-05-26 16:59:36 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
archive containing dmesg outputs for 4.20.16 and 5.0.3 (52.76 KB, application/gzip) 2019-03-26 20:55 UTC, Devon Maloney	no flags	Details
View All

Description Devon Maloney 2019-03-26 20:55:18 UTC

Created attachment 1548242 [details]
archive containing dmesg outputs for 4.20.16 and 5.0.3

1. Please describe the problem:

I use an external GPU (eGPU, Vega 56) connected to my laptop (X1 Yoga 3rd Generation) over Thunderbolt on Fedora 29 (KDE Spin). On kernels 4.20.16.fc29 and lower, the eGPU would be selected as the primary if the Thunderbolt connection was present at boot time. This meant that the boot log output (if ESC were pressed to hide Plymouth) would be displayed on the display connected to the eGPU, SDDM would display on the eGPU display, and the eGPU would be used for rendering / output by default (i.e.: the desktop environment or OpenGL applications).

With the release of kernel 5.0.3-200.fc29 for Fedora 29, the boot log and SDDM are both displayed on the internal laptop display even if the Thunderbolt connection is present at boot (though the Plymouth splash is displayed on both displays). Additionally, the Intel integrated GPU is now used for rendering by default for both the desktop environment and OpenGL applications (setting environment variable DRI_PRIME=1 uses the correct GPU, but under 4.20.16-200.fc29 this was not required) which causes tearing and poor performance for even basic functionality (at least under KDE), presumably due to saturating the limited Thunderbolt bandwidth by copying the framebuffer from the iGPU to the eGPU.


2. What is the Version-Release number of the kernel:

5.0.3-200


3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

Yes, works on 4.20.16-200.fc29. Issue started occurring with the 5.0.3-200.fc29 update.


4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Yes. Boot the latest kernel.


5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Yes (5.1.0-0.rc2.git0.1.fc31).


6. Are you running any modules that not shipped with directly Fedora's kernel?:

Wireguard via dkms package.


7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Attached an output of 4.20.16-200.fc29 and 5.0.3-200.fc29. A specific line of interest which is present in 4.20.16-200.fc29 and not 5.0.3-200.fc29 is:

Mar 26 14:59:39 kernel: fbcon: amdgpudrmfb (fb0) is primary device

Comment 1 Justin M. Forbes 2019-08-20 17:46:19 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 29 kernel bugs.

Fedora 29 has now been rebased to 5.2.9-100.fc29.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 30, and are still experiencing this issue, please change the version to Fedora 30.

If you experience different issues, please open a new bug report for those.

Comment 2 Devon Maloney 2019-08-20 18:03:12 UTC

I have since switched to standard Fedora 30 (no spin) and can confirm this issue still occurs on 5.2.9-200.fc30.x86_64.

Comment 3 Devon Maloney 2019-08-22 18:02:13 UTC

I am currently back to running kernel 4.20.16-200.fc29.x86_64 on my Fedora 30 machine as this is the only way I can get a usable framerate out of GNOME shell (let alone trying to run a game with this setup).

On 4.20 (which correctly selects the AMD GPU as the 'primary' device) GNOME runs at a butter smooth 144fps never dropping a frame. I can play games at framerates comparable to Windows running on the same hardware, and everything is great.

On 5.2.9, the Intel iGPU is the 'primary' device. This causes the Intel iGPU to be used for rendering by default (and having its framebuffer copied to the AMD GPU for display). When DRI_PRIME=1 is used to cause the AMD GPU to render something complex, this presumably causes the rendering to be copied back to the Intel iGPU for some kind of compositing, then back to the AMD GPU for display. Now this may have been tested on a traditional Hybrid Graphics laptop and deemed equally performant, but on a Thunderbolt eGPU there is no bandwidth headroom for this kind of multiple copy back and forth. This results in GNOME running at anywhere from 20 - 60 fps (including the mouse cursor since this is Wayland!) and games of any kind being essentially unplayable.

I may be incorrect in my presumptions of exactly how the PRIME pipeline works, but regardless I would be extremely happy if anyone could find even a workaround (some kind of kernel flag?) to cause the AMD GPU to be selected as the primary device on a newer kernel.

Comment 4 Dave Airlie 2019-08-23 04:24:41 UTC

This was a deliberate change upstream, since the external GPU should be picked as your primary GPU, the primary GPU is the one the BIOS comes up on etc.

If you want to use the external GPU as your primary you can use an xorg.conf to pick it at main display.

Something like this might work.

Section "Device"
     Identifier "egpu"
     Driver "modesetting"
     BusID "PCI:13:0:0"
EndSection

(may have to play with BusID.).

Comment 5 Dave Airlie 2019-08-23 04:25:45 UTC

Oh you are using wayland, I'm not sure there is a use this device as the default device for wayland, Jonas may know.

Comment 6 Devon Maloney 2019-08-23 04:58:09 UTC

When I refer to "primary GPU", I do not mean the primary display. I mean the primary GPU used as an OpenGL / Vulkan provider even when one has not set DRI_PRIME=1. I recall attempting to make this work on Xorg back when I opened this issue as I was using Kwin/Xorg at the time, but I am not entirely convinced setting what Xorg (or Mutter in the Wayland case) sees as the primary display driver provider will fix the underlying issue. More to the point, I believe I directly tried your method (though using device classes as setting a device directly for a removable GPU will cause significant problems when the device is removed haha), but it did not fix the framerate issues.

If I am understanding the issue correctly, this is because even if I configure the AMD eGPU's device as the primary display (to borrow Xorg terminology, though it applies to Mutter/Wayland as well), the kernel sees the amdgpu driver/device as a PRIME render offload provider, as it is not the primary GPU used for compositing outputs. This causes it to use the Intel iGPU for rendering and compositing by default, and only have the AMD eGPU used for performance-intensive tasks which means that the AMD eGPU must do some rendering, copy the result over a highly limited thunderbolt connection to the Intel iGPU for compositing, then copy the complete framebuffer back over the highly limited thunderbolt connection back to the AMD eGPU for display. This would all happen at a level below that of the Xorg/Wayland Compositor, right?

While this is definitely the correct approach for upstream to take regarding 'Optimus' style hybrid-graphics laptops, this produces the entirely wrong effect for an eGPU. The crucial difference is that those hybrid-graphics laptops typically have a discrete GPU not even connected to any outputs at all so we must perform rendering in this roundabout fashion, but for an eGPU the intended display is already connected directly to the GPU (and even worse, the bandwidth between the iGPU and eGPU is significantly more constrained and effectively cannot be rendered that way).

One idea for an upstream fix to this would be to emulate the old behavior if the second GPU has display outputs of its own, which would be a determinate for when the the slower/more bandwidth intensive PRIME copy-back-and-forth method is required and when it is not (or maybe just have some way of directly determining a bandwidth constrained environment such as thunderbolt?).

Comment 7 Devon Maloney 2019-09-02 19:53:27 UTC

I am planning on switching to one of the newer 5700 XTs, which will require the latest 5.3.0 kernel to work (meaning my workaround of using the 4.20.0 kernel for performance-requiring applications will no longer work). Is there some kind of other workaround to trigger the old behavior, such as a kernel flag or sysctl option perhaps?

Comment 8 Devon Maloney 2019-09-06 04:55:59 UTC

I have tracked down the exact change which caused this behavior to commit 3d42f1d (first committed in 5.0-rc1).

Additionally, I have tried every workaround I could think of short of just patching my kernel back to the previous behavior with no success. Under Mutter (GNOME's compositor), the primary GPU is selected based on the boot_vga GPU (selected by vgaarb) with no configuration to change this behavior (and I am not certain Mutter is the correct place for this fix as it is likely far from the only thing which has behavior based on the boot_vga parameter). I additionally tried manually changing the boot_vga tag with udev rules post-boot, but it seems to be read-only in some way as none of my rules were able to adjust the flag, and using a modprobe install hook to simply not load the i915 module when an AMD vendor ID was detected (which did not work as the i915 module seems to load regardless of me even just completely blacklisting it).

Does anyone have ideas for a workaround I could use?

A relatively straightforward permanent solution to this issue would be to introduce a kernel config to allow affecting boot_vga selection as this is likely not the only situation in which a user may want that to not be the BIOS/UEFI dispaly.

Comment 9 Michel Dänzer 2019-09-06 09:06:05 UTC

(In reply to Devon Maloney from comment #8)
> A relatively straightforward permanent solution to this issue would be to
> introduce a kernel config to allow affecting boot_vga selection as this is
> likely not the only situation in which a user may want that to not be the
> BIOS/UEFI dispaly.

That would seem wrong, as boot_vga is supposed to correspond to the BIOS/UEFI display.

Sounds like what you want is something other than the boot_vga device to be used as the primary in your desktop session. This is possible with Xorg as described by Dave, but maybe not yet with GNOME Wayland.

Anyway, this is definitely not a xorg-x11-drv-ati issue. Where should it be reassigned to?

Comment 10 Jonas Ådahl 2019-09-06 09:09:34 UTC

The heuristics for selecting the primary GPU on a Wayland session is implemented in https://gitlab.gnome.org/GNOME/mutter/blob/998114791fbb91541e3ffb33892640f22b1ca3c9/src/backends/native/meta-renderer-native.c#L4011.

Comment 11 Devon Maloney 2019-09-06 14:32:49 UTC

(In reply to Michel Dänzer from comment #9)
> (In reply to Devon Maloney from comment #8)
> > A relatively straightforward permanent solution to this issue would be to
> > introduce a kernel config to allow affecting boot_vga selection as this is
> > likely not the only situation in which a user may want that to not be the
> > BIOS/UEFI dispaly.
> 
> That would seem wrong, as boot_vga is supposed to correspond to the
> BIOS/UEFI display.
> 
> Sounds like what you want is something other than the boot_vga device to be
> used as the primary in your desktop session. This is possible with Xorg as
> described by Dave, but maybe not yet with GNOME Wayland.
> 
> Anyway, this is definitely not a xorg-x11-drv-ati issue. Where should it be
> reassigned to?

While I suppose that might be a better solution, options would need to be created in several places to allow for that as boot_vga is used for more than just the desktop session. For instance, there is no way to only sometimes map fbcon to the secondary GPU (setting fbcon=map:1 disables fbcon if there is no second framebuffer).

(In reply to Jonas Ådahl from comment #10)
> The heuristics for selecting the primary GPU on a Wayland session is
> implemented in
> https://gitlab.gnome.org/GNOME/mutter/blob/
> 998114791fbb91541e3ffb33892640f22b1ca3c9/src/backends/native/meta-renderer-native.c#L4011

Would affecting the selection here also affect the default renderer used by e.g.: Vulkan, OpenGL, VA-API, and XWayland applications? Or would this only affect the desktop environment itself?

Comment 12 Devon Maloney 2019-09-06 19:00:08 UTC

I have found a (not great) workaround that at least allows me to use my machine with the eGPU connected without running GNOME at 30fps. The only thing that I could come up with that works is to bind the Intel iGPU to the vfio-pci driver early in the boot process, which makes GDM (and most things?) consider the eGPU to be the primary (giving a nice 165fps in GNOME, among other things).

To make this automatically happen only when the eGPU is connected, I borrowed some ideas from this post: https://qubitrenegade.com/virtualization/kvm/vfio/2019/07/17/VFIO-Fedora-Notes.html and just bind the boot_vga device to vfio_pci if lspci finds a device with the AMD vendor ID (1002:). This is a bit of a gross solution (as fbcon still binds to the iGPU, which no longer displays anything), but with the current behavior it is possibly the only way to make an AMD eGPU even usable for driving a basic desktop.



/etc/default/grub

GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt rd.driver.pre=vfio-pci"



/etc/modprobe.d/vfio.conf

install vfio-pci /sbin/vfio-pci-override.sh



/etc/dracut.conf.d/vfio.conf

add_drivers+="vfio vfio_iommu_type1 vfio_pci vfio_virqfd"
install_items+="/sbin/vfio-pci-override.sh /sbin/lspci /usr/bin/dirname"



/sbin/vfio-pci-override.sh

#!/bin/sh

set -u

lspci=$(/sbin/lspci -d 1002:)

if [ "${lspci}" ]; then
  echo "Found device(s) matching VID 1002: ${lspci}"
  for boot_vga in /sys/bus/pci/devices/*/boot_vga; do
    echo "Found vga device: ${boot_vga}"
    if [ $(<"${boot_vga}") -eq 1 ]; then
      echo "Found Boot VGA Device - true: ${boot_vga}"

      dir=$(/usr/bin/dirname -- "${boot_vga}")
      for dev in "${dir::-1}"*; do
        echo "Registering Devices: ${dev}"
        echo 'vfio-pci' > "${dev}/driver_override"
      done
    else
      echo "Found Boot VGA Device - false: ${boot_vga}"
    fi
  done
else
  echo "Found no devices matching VID 1002"
fi

/sbin/modprobe -i vfio-pci

Comment 13 Ben Cotton 2020-04-30 21:10:59 UTC

This message is a reminder that Fedora 30 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 30 on 2020-05-26.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '30'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 30 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 14 Ben Cotton 2020-05-26 16:59:36 UTC

Fedora 30 changed to end-of-life (EOL) status on 2020-05-26. Fedora 30 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 15 Michel Dänzer 2020-05-27 15:27:11 UTC

FWIW, I've become aware of a potentially better workaround: Option "PrimaryGPU" in Section "OutputClass". If the eGPU matches that Section "OutputClass", Xorg will treat it as primary, otherwise it'll determine the primary as before.

Comment 16 Red Hat Bugzilla 2023-09-18 00:15:46 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.

airlied
bskeggs
caillon+fedoraproject
hdegoede
ichavero
itamar
jadahl
jarodwilson
jeremy
jforbes
jglisse
john.j5live
jonathan
josef
kernel-maint
linville
malond5+redhat
mchehab
mdaenzer
mjg59
pasik
rhbugs
rhughes
rstrode
sandmann
steved
tcallawa
xgl-maint