Bug 1757891

Summary:

vga switcheroo won't turn off discrete graphics on 5.3.x kernel

Product:

[Fedora] Fedora

Reporter:

Michał <e.misiek>

Component:

kernel

Assignee:

Kernel Maintainer List <kernel-maint>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

CC:

airlied, bskeggs, hdegoede, ichavero, itamar, jarodwilson, jeremy, jglisse, john.j5live, jonathan, josef, kernel-maint, linville, masami256, mbrancaleoni, mchehab, mihai, mjg59, pasik, redhat, steved

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-03-25 22:26:46 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
dmesg	none
dmesg 5.3.1	none
dmesg 5.2.18	none
dmesg 5.3.5	none
dmesg	none
dmesg 5.4 w/o docking station	none
dmesg for 5.4.0 test kernel	none
Full 5.4.0 dmesg	none
dmesg linux-next-20191127	none
dmesg 5.4.0-2.rhbz1757891.fc32.x86_6	none

Description Michał 2019-10-02 16:12:48 UTC

Created attachment 1621924 [details]
dmesg

1. Please describe the problem:

vga switcheroo doesn't turn off discrete graphics card. Just boots up with powered on and just won't turn it off.

2. What is the Version-Release number of the kernel:
5.3.2

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

5.3.1


4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Everytime. Just boot on laptop with optimus graphic. On Fedora 30, with Fedora kernel 30 testiso.

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:


6. Are you running any modules that not shipped with directly Fedora's kernel?:

On my daily system:
akmod-acpi_call-1.1.1-5.fc30.x86_64
akmod-tp_smapi-0.43-3.fc30.x86_64

On testiso for kernel test day nothing extra.

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Comment 1 Michał 2019-10-02 16:13:54 UTC

Created attachment 1621925 [details]
dmesg 5.3.1

Comment 2 Michał 2019-10-02 16:14:19 UTC

Created attachment 1621926 [details]
dmesg 5.2.18

Comment 3 Michał 2019-10-08 17:16:50 UTC

Created attachment 1623531 [details]
dmesg 5.3.5

Comment 4 Michał 2019-10-15 08:14:23 UTC

Tested 5.3.6 vgaswitcheroo keeps power on nvidia card. Vanilla kernels 5.4rcX  have the same issue.

Comment 5 Michał 2019-10-17 09:21:17 UTC

Related: https://bbs.archlinux.org/viewtopic.php?id=249330

And probably THE CAUSE:
https://bugs.freedesktop.org/show_bug.cgi?id=75985

Comment 6 Michał 2019-10-17 14:03:02 UTC

If someone is looking for temporary solution for this problem, then this is a "fix". Without dis-audio vga switching is working fine.

# cat /sys/kernel/debug/vgaswitcheroo/switch
0:IGD:+:Pwr:0000:00:02.0
1:DIS: :DynPwr:0000:01:00.0
2:DIS-Audio: :DynPwr:0000:01:00.1

# echo 1 > /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.1/remove

# cat /sys/kernel/debug/vgaswitcheroo/switch
0:IGD:+:Pwr:0000:00:02.0
1:DIS: :DynPwr:0000:01:00.0

# echo "1:Off" > /sys/kernel/debug/vgaswitcheroo/switch

# cat /sys/kernel/debug/vgaswitcheroo/switch
0:IGD:+:Pwr:0000:00:02.0
1:DIS: :DynOff:0000:01:00.0

# echo "1:DynOff" > /sys/kernel/debug/vgaswitcheroo/switch

---
Save this and run on every boot as root...

saveTheWorld.sh:
#!/bin/bash
echo 1 > /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.1/remove
echo "1:Off" > /sys/kernel/debug/vgaswitcheroo/switch
echo "1:DynOff" > /sys/kernel/debug/vgaswitcheroo/switch

Comment 7 Michał 2019-11-02 10:20:49 UTC

Most of this is fixed in kernel 5.3.8. Still doesn't work powering down with dock station.

commit 3f5fa0ba267074fe39d4cd56f34d873064350911
Author: Lukas Wunner <lukas>
Date:   Thu Oct 17 17:04:11 2019 +0200

    ALSA: hda - Force runtime PM on Nvidia HDMI codecs
    
    commit 94989e318b2f11e217e86bee058088064fa9a2e9 upstream.

Comment 8 Hans de Goede 2019-11-02 13:35:21 UTC

*** Bug 1766198 has been marked as a duplicate of this bug. ***

Comment 9 Matteo Brancaleoni 2019-11-03 09:31:57 UTC

Still happening on 5.3.8-300.fc31.x86_64 from updates-testing.

Removing the devices and powering it off (as reported) fixes it. (the battery discarge rate goes from ~30w to ~15w)

Comment 10 Michał 2019-11-06 18:54:22 UTC

It worked for a few seconds. Long enough to think it's ok and this is specific to docking station, but it isn't apparently. It turns down for few seconds and after this is always on. 

Sorry for noise with closed. My bad.

Comment 11 Hans de Goede 2019-11-20 15:18:08 UTC

The 5.4 kernel has some fixes for this and 1 extra fix is pending for 5.5. I've done a scratch-build of a 5.4 kernel with the extra fix here:
https://koji.fedoraproject.org/koji/taskinfo?taskID=39132834

Here are some generic testing instructions for installing a kernel-build directly from koji:
https://fedorapeople.org/~jwrdegoede/kernel-test-instructions.txt

Please give this kernel a try and let us know if this fixes things. Note koji keeps scratch-builds only for a couple of days (about a week) before removing them to free up disk-space. So if you do not have time to test right now, at least download the rpms so that you can test later.

Comment 12 Matteo Brancaleoni 2019-11-21 21:26:14 UTC

Tried right now, unfortunately nothing changes for me.

Comment 13 Hans de Goede 2019-11-24 15:25:16 UTC

(In reply to Matteo Brancaleoni from comment #12)
> Tried right now, unfortunately nothing changes for me.

Hmm, that is unfortunate it works on my test-machine. Is there anything special about your setup? Do you perhaps have the nvidia binary driver installed?

Comment 14 Matteo Brancaleoni 2019-11-24 22:15:54 UTC

No, I don't have any binaries installed.

The only thing is that I have different from "standard" is nouveau.modeset=0 on my kernel parameters (this is a prime laptop, where I don't need nvidia at all). Not putting it causes a lot of issues with resume from sleep (since ever, is not a new laptop). 

5.2.18-200 is still working ok, and power usage is the same if I use nouveau.modeset is set or not (except from resume from sleep, but this is an old story, as said).

Comment 15 Hans de Goede 2019-11-25 10:34:43 UTC

(In reply to Matteo Brancaleoni from comment #14)
> No, I don't have any binaries installed.
> 
> The only thing is that I have different from "standard" is nouveau.modeset=0
> on my kernel parameters (this is a prime laptop, where I don't need nvidia
> at all). Not putting it causes a lot of issues with resume from sleep (since
> ever, is not a new laptop). 
> 
> 5.2.18-200 is still working ok, and power usage is the same if I use
> nouveau.modeset is set or not (except from resume from sleep, but this is an
> old story, as said).

Ah, that might explain, although the 2 extra patches should fix the case where there is no driver bound.

Anyways, can you try:

1) Removing nouveau.modeset=0, and see if that fixes the power-consumption issue?
I guess you may have your suspend/resume issues back then, but it is still a good
data point to have.

We have been working on some fixes wrt suspend/resume issues and nouveau, so
things might even just work this way. But I believe not all of these fixes have
landed yet (there is some nasty hw underlying issue with one model Intel PCI
bridge there somewhere).

2) If 1. gives you your suspend/resume issues back, can you try adding:
"modprobe.blacklist=snd_hda_intel" to your kernel commandline, that will likely
fix this issue, at the cost of also disabling audio, so again this is mainly a
good data point to have.

Comment 16 Matteo Brancaleoni 2019-11-25 14:01:51 UTC

Ok, will redo the tests then.

Cannot do right now (or today), but tomorrow in the evening (UTC+1) and will report back. I assume that kernel scratch builds are still valid (already downloaded them).

Comment 17 Hans de Goede 2019-11-25 14:20:58 UTC

Yes you can re-use the already downloaded scratch-build for those 2 new tests, thanks.

Comment 18 Michał 2019-11-26 17:12:58 UTC

Created attachment 1639921 [details]
dmesg

This is on docking station. Not working, but there is some extra info inside dmesg.

Comment 19 Michał 2019-11-26 17:17:37 UTC

Created attachment 1639922 [details]
dmesg 5.4 w/o docking station

Without docking station - same result and probably same extra infos in dmesg. 

Sorry for delay. Thanks for keeping eye on this one!

Comment 20 Hans de Goede 2019-11-26 17:32:07 UTC

(In reply to Michał from comment #18)
> Created attachment 1639921 [details]
> dmesg
> 
> This is on docking station. Not working, but there is some extra info inside
> dmesg.

Ok, so I see an oops related to the new HDA audio handling for DP MST:

kernel: WARNING: CPU: 1 PID: 330 at sound/hda/hdac_component.c:290 snd_hdac_acomp_init+0xde/0x130 [snd_hda_core]

Which points to these lines in the kernel:

        if (WARN_ON(hdac_get_acomp(dev)))
                return -EBUSY;

It is probably best if you report this directly to the upstream developers of this part of the kernel by sending an email to "Takashi Iwai <tiwai>" with "alsa-devel" and me in the Cc.

Comment 21 Matteo Brancaleoni 2019-11-26 20:22:57 UTC

(In reply to Hans de Goede from comment #15)
> 1) Removing nouveau.modeset=0, and see if that fixes the power-consumption
> issue?
> I guess you may have your suspend/resume issues back then, but it is still a
> good
> data point to have.

Did it, same high power consumption, ~27W. 


> We have been working on some fixes wrt suspend/resume issues and nouveau, so
> things might even just work this way. But I believe not all of these fixes
> have
> landed yet (there is some nasty hw underlying issue with one model Intel PCI
> bridge there somewhere).

No fixes for that, some locking errors on dmesg during normal boot into xorg, very slow display. Not tried suspend resume.


> 2) If 1. gives you your suspend/resume issues back, can you try adding:
> "modprobe.blacklist=snd_hda_intel" to your kernel commandline, that will
> likely
> fix this issue, at the cost of also disabling audio, so again this is mainly
> a
> good data point to have.

Did that also, confirmed no sound as expected, but no changes in high power usage.

Comment 22 Matteo Brancaleoni 2019-11-26 20:25:33 UTC

Created attachment 1639946 [details]
dmesg for 5.4.0 test kernel

dmesg from 5.4.0 testing kernel, a lot of nouveau timeout errors which probably are not related to this specific issue.

Comment 23 Matteo Brancaleoni 2019-11-26 20:36:15 UTC

Created attachment 1639949 [details]
Full 5.4.0 dmesg

Sorry, had to recreate the dmesg since the ring buffer was not big enough.

Comment 24 Hans de Goede 2019-11-26 20:49:53 UTC

Hmm, if blacklisting the hda codec does not help, then you might be seeing a different issue then other people.

What is the output of running the following command as root ?

cat /sys/kernel/debug/vgaswitcheroo/switch

?

If that includes a 3th line listing an audio-device, please try again with modprobe.blacklist=snd_hda_intel, the 3th line should then be gone.

Comment 25 Matteo Brancaleoni 2019-11-26 21:18:38 UTC

Probably I've misunderstood your 2nd point above and mixed modprobe.blacklist with modeset=0, so did the tests again and let me recap:

- standard cmd line (no additional params): high power usage (~27W), lots of timeout errors as posted dmesg on nouveau, suspend/resume broken.  vgaswitcheroo/switch contains a 3rd entry DIS-Audio: :DynOff:0000:01:00.1

- nouveau.modeset=0 *and* modprobe.blacklist=snd_hda_intel: high power usage, no nouveau errors (of course) and no vgaswitcheroo (expected). suspend/resume ok

- only modprobe.blacklist=snd_hda_intel: low power usage (hooray), no nouveau timeout errors, no 3rd entry on vgaswitcheroo/switch. Suspend/resume seems ok.

Comment 26 Hans de Goede 2019-11-27 10:14:48 UTC

(In reply to Matteo Brancaleoni from comment #25)
> Probably I've misunderstood your 2nd point above and mixed
> modprobe.blacklist with modeset=0, so did the tests again and let me recap:
> 
> - standard cmd line (no additional params): high power usage (~27W), lots of
> timeout errors as posted dmesg on nouveau, suspend/resume broken. 
> vgaswitcheroo/switch contains a 3rd entry DIS-Audio: :DynOff:0000:01:00.1
> 
> - nouveau.modeset=0 *and* modprobe.blacklist=snd_hda_intel: high power
> usage, no nouveau errors (of course) and no vgaswitcheroo (expected).
> suspend/resume ok
> 
> - only modprobe.blacklist=snd_hda_intel: low power usage (hooray), no
> nouveau timeout errors, no 3rd entry on vgaswitcheroo/switch. Suspend/resume
> seems ok.

Ok, so your dGPU suspend issues are also caused by the recent changes for support for audio over HDMI/DP. Then the patches in the test kernel should fix this, but clearly they do not. Are you maybe also seeing an oops with an error like this with the new kernel? :

kernel: WARNING: CPU: 1 PID: 330 at sound/hda/hdac_component.c:290 snd_hdac_acomp_init+0xde/0x130 [snd_hda_core]

Comment 27 Ben Cotton 2019-11-27 14:19:29 UTC

Fedora 29 changed to end-of-life (EOL) status on 2019-11-26. Fedora 29 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 28 Hans de Goede 2019-11-27 15:00:08 UTC

Looks like the F29 EOL closing script got a bit too enthusiastic, re-opening.

Comment 29 Hans de Goede 2019-11-27 15:48:43 UTC

I've started a new scratch kernel-build which contains fixes from upstream for the oops. I'm not 100% sure if this will also fix the dGPU not suspending but please give it a try:
https://koji.fedoraproject.org/koji/taskinfo?taskID=39378056

Note this is still building atm, it may take a couple of hours to finish.

Note I forgot to set the Fedora version to 31 this time, so it looks like a F32 kernel, but that does not matter.

Comment 30 Michał 2019-11-27 17:48:56 UTC

Created attachment 1640200 [details]
dmesg linux-next-20191127

Comment 31 Michał 2019-11-27 17:51:57 UTC

On linux-next 20191127 warning magically disappeared. Still dGPU is always on. 
Next I'll try kernel from koji when it's ready.

Comment 32 Michał 2019-11-27 18:52:55 UTC

For the record, don't know if this is helpfull at all but since I'm now for awhile on linux-next I can add this. 
This is my output from alsa-info.sh: http://alsa-project.org/db/?f=91bb789a01f9eed92d0534fe8951619312b355da

Comment 33 Michał 2019-11-27 19:02:58 UTC

Created attachment 1640221 [details]
dmesg 5.4.0-2.rhbz1757891.fc32.x86_6

Warning is gone. But nothing else happened.

Comment 34 Michał 2019-11-27 20:03:56 UTC

Turns out, that disabling tlp helped. Using Hans de Goede's kernel from koji with disabled TLP solves this issue.

Comment 35 Hans de Goede 2019-11-27 20:09:10 UTC

Ok, so Michal's case has been solved on the alsa-devel mailinglist. Michal was using TLP which was turning of the HDA power-save options and since new kernels support audio over HDMI/DP for Nvidia cards the HDA power-saving now must be on to allow the dGPU to suspend.

For other people still having issues, please run these 2 commands:

[hans@shalem ~]$ cat /sys/module/snd_hda_intel/parameters/power_save
1
[hans@shalem ~]$ cat /sys/module/snd_hda_intel/parameters/power_save_controller
Y

If the output is different then 1 / Y, that is probably why your dGPU is not suspending even with the fixed kernels. In this case you are probably using TLP or have a file in /etc/modprobe.conf.d messing with the snd_hda_intel settings.

Note that running TLP is no longer necessary with recent Fedora versions, all worthwhile power savings are enabled by default, including the HDA power saving settings.

Comment 36 Hans de Goede 2019-11-27 20:13:19 UTC

Matteo, can you check your snd_hda_intel parameters please? (see comment 35)

Comment 37 Michał 2019-11-27 20:30:35 UTC

Don't want to send more noise... But there is an important thing I forgot to add in my last comment.

THANK YOU!

Comment 38 Matteo Brancaleoni 2019-11-27 22:27:17 UTC

(In reply to Hans de Goede from comment #36)
> Matteo, can you check your snd_hda_intel parameters please? (see comment 35)

Sure,

with both kernel 5.4.0-2.rhbz1757891.fc32.x86_64 and 5.2.18-200 I have:

[root@yoda ~]# cat /sys/module/snd_hda_intel/parameters/power_save
1
[root@yoda ~]# cat /sys/module/snd_hda_intel/parameters/power_save_controller
Y

Both tested without any kernel cmdline and with nouveau.modeset=0.

I have also tlp enabled and disabled it for these tests, nothing changes.

Unfortunately same nouveau timeout errors as reported dmesg occurs when not setting modeset=0 with latest test kernel, no matter if tlp is enabled or not.

Comment 39 Hans de Goede 2019-11-28 09:10:12 UTC

(In reply to Matteo Brancaleoni from comment #38)
> (In reply to Hans de Goede from comment #36)
> > Matteo, can you check your snd_hda_intel parameters please? (see comment 35)
> 
> Sure,
> 
> with both kernel 5.4.0-2.rhbz1757891.fc32.x86_64 and 5.2.18-200 I have:
> 
> [root@yoda ~]# cat /sys/module/snd_hda_intel/parameters/power_save
> 1
> [root@yoda ~]# cat /sys/module/snd_hda_intel/parameters/power_save_controller
> Y
> 
> Both tested without any kernel cmdline and with nouveau.modeset=0.
> 
> I have also tlp enabled and disabled it for these tests, nothing changes.
> 
> Unfortunately same nouveau timeout errors as reported dmesg occurs when not
> setting modeset=0 with latest test kernel, no matter if tlp is enabled or
> not.

So if I understand correctly then blacklisting snd_hda_intel, without modeset=0, does fix the high power-consumption, right? And this combo also fixes the nouveau time-out errors, correct?

If I understand that correctly, then I believe it is best if you do the same thing Michal did and contact upstream about this, see comment 20.

Comment 40 Matteo Brancaleoni 2019-11-28 09:20:39 UTC

(In reply to Hans de Goede from comment #39)
> So if I understand correctly then blacklisting snd_hda_intel, without
> modeset=0, does fix the high power-consumption, right? And this combo also
> fixes the nouveau time-out errors, correct?

yes, that's correct.


> If I understand that correctly, then I believe it is best if you do the same
> thing Michal did and contact upstream about this, see comment 20.

Ok, I'll do it as soon as possible, thanks!

Comment 41 Justin M. Forbes 2020-03-03 16:18:07 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 31 kernel bugs.

Fedora 31 has now been rebased to 5.5.7-200.fc31.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 32, and are still experiencing this issue, please change the version to Fedora 32.

If you experience different issues, please open a new bug report for those.

Comment 42 Justin M. Forbes 2020-03-25 22:26:46 UTC

*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 3 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.

Comment 43 Red Hat Bugzilla 2023-09-14 05:44:02 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days