Created attachment 1765428 [details] dmesg.txt from 5.11.7 without it87 1. Please describe the problem: After updating to 5.11.7-200.fc33.x86_64 on my idle office workstation that is sitting in framebuffer text mode with its screen blanked, hwmon reports that my Radeon RX 550 has gone from a typical fan RPM of 780-800 and a typical temperature of 28C (under previous 5.10 and earlier kernels) to the fan being at 2100 RPM and a reported GPU temperature of 35 C (after about half an hour of rising temperatures). The only other reported hwmon difference is that the card's hwmon/hwmon2/pwm1 changed from 81 to 0. In particular, reported power and voltage remain unchanged. 2. What is the Version-Release number of the kernel: 5.11.7-200.fc33.x86_64 3. Did it work previously in Fedora? If so, what kernel version did the issue *first* appear? Old kernels are available for download at https://koji.fedoraproject.org/koji/packageinfo?packageID=8 : It worked as recently as 5.10.23-200.fc33.x86_64 4. Can you reproduce this issue? If so, please provide the steps to reproduce the issue below: I can reproduce this on demand by booting into 5.11.7 (and then stop it by booting back into 5.10.23). 5. Does this problem occur with the latest Rawhide kernel? To install the Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by ``sudo dnf update --enablerepo=rawhide kernel``: Have not tested. Sorry, I'm not running Rawhide kernels on a machine I need to work. 6. Are you running any modules that not shipped with directly Fedora's kernel?: Yes, Guenter Roeck's it87 module that supports my ASUS Prime X370-Pro motherboard and the OpenZFS ZFS modules (latest development versions). In fact the issue reproduces without the it87 kernel module loaded, although I normally do have it active. (I sometimes use VMWare Workstation's out of kernel modules, but I have reproduced this without them loaded; the dmesg attached is from such a reproduction.) 7. Please attach the kernel logs. You can get the complete kernel log for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the issue occurred on a previous boot, use the journalctl ``-b`` flag.
I had an opportunity to inspect the physical machine today and it turns out that the fans are not running at all, despite what appears in hwmon/hwmon2/fan1_input (and is reported by 'sensors' from lm_sensors). hwmon/hwmon2/fan1_enable is 0, but setting it to '1' does nothing.
It appears that the fan doesn't turn on at all, even under high load and high temperatures. I ran a GPU benchmark that raised GPU temperatures to over 80C and the fans were still not active. On 5.10.23, fan RPMs rise to a reported 1500 RPM by the time the GPU hits 64 C (and the listed GPU power consumption is only slightly under what lm_sensors lists as the cap).
This issue is still present in the just-released 5.11.8-200.fc33.x86_64 kernel.
This issue is still present in the just-released 5.11.9-200.fc33.x86_64 kernel.
This issue is still present in the just-released 5.11.10-200.fc33.x86_64 kernel.
This issue is still present in the just-released 5.11.11-200.fc33.x86_64 kernel.
This issue is still present in the just-released 5.11.12-200.fc33.x86_64 kernel.
This issue is still present in the just-released 5.11.20-200.fc33.x86_64 (and has been present in the few intermediate kernels I also checked).
This issue is still present in the just-released 5.12.6-200.fc33.x86_64 kernel.
Examining boot time messages between 5.10 (working) and 5.11 and 5.12 (not), the 5.11 and 5.12 kernels report: amdgpu 0000:0a:00.0: amdgpu: Using BACO for runtime pm The 5.10 kernel(s) also report values for clocks from DM PPLIB, while 5.12 and 5.11 don't: hawkwind.cs kernel: [drm] DM_PPLIB: values for Engine clock hawkwind.cs kernel: [drm] DM_PPLIB: 214000 hawkwind.cs kernel: [drm] DM_PPLIB: 551000 hawkwind.cs kernel: [drm] DM_PPLIB: 734000 hawkwind.cs kernel: [drm] DM_PPLIB: 980000 hawkwind.cs kernel: [drm] DM_PPLIB: 1046000 hawkwind.cs kernel: [drm] DM_PPLIB: 1098000 hawkwind.cs kernel: [drm] DM_PPLIB: 1124000 hawkwind.cs kernel: [drm] DM_PPLIB: 1206000 hawkwind.cs kernel: [drm] DM_PPLIB: Validation clocks: hawkwind.cs kernel: [drm] DM_PPLIB: engine_max_clock: 120600 hawkwind.cs kernel: [drm] DM_PPLIB: memory_max_clock: 175000 hawkwind.cs kernel: [drm] DM_PPLIB: level : 8 hawkwind.cs kernel: [drm] DM_PPLIB: values for Memory clock hawkwind.cs kernel: [drm] DM_PPLIB: 300000 hawkwind.cs kernel: [drm] DM_PPLIB: 625000 hawkwind.cs kernel: [drm] DM_PPLIB: 1750000 hawkwind.cs kernel: [drm] DM_PPLIB: Validation clocks: hawkwind.cs kernel: [drm] DM_PPLIB: engine_max_clock: 120600 hawkwind.cs kernel: [drm] DM_PPLIB: memory_max_clock: 175000 hawkwind.cs kernel: [drm] DM_PPLIB: level : 8 5.12 and 5.10 report different DRM display core initialization versions: [drm] Display Core initialized with v3.2.122! 5.10 reports v3.2.104.
More poking in /sys and some (remote) experiments have revealed that setting pwm1_enable to 1 and writing a suitable non-zero value to pwm1 in /sys/devices/pci0000:00/0000:00:03.1/0000:0a:00.0/hwmon/hwmon2 will cause the fan to apparently spin up and the card to cool down. On 5.12, pwm1_enable's normal value is 2, but pwm1 itself sticks at zero, instead of the '81' that it normally is on 5.10. Changing pwm1_enable back to 2 after it was set to 1 (and pwm1 set to something) on 5.12.6 causes pwm1 to shift rapidly around in a range between 94 and 127 (so far) and the reported GPU temperature to hold steady around 30 C (which is somewhat cooler than 5.10 was holding the card; at the moment that was about 32 C, up from 28 C presumably due to summer heat arriving here and the ambient office temperature going up).
> 5. Does this problem occur with the latest Rawhide kernel? To install the > Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by > ``sudo dnf update --enablerepo=rawhide kernel``: > > Have not tested. Sorry, I'm not running Rawhide kernels on a machine I need to work. This makes things a bit more difficult, because right now, the 5.13 rc kernels are in Rawhide and knowing whether this is still broken in an RC kernel is valuable so that it can be looked at to be fixed during this kernel cycle and backported to stable kernels. And if it's fixed in 5.13, then at least there's that as an option too.
I tried to quickly test a Rawhide kernel, but discovered that OpenZFS isn't compatible with 5.13-rc at this point (its work for even 5.12 is still somewhat in progress in git tip). Since much of my data storage is in ZFS pools, I cannot even start to reboot my office machine remotely without ZFS available (at the moment and for the likely future we are not in the office).
This message is a reminder that Fedora 33 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 33 on 2021-11-30. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '33'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 33 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
This continues to be the case on Fedora 34 with kernels up to 5.14.15-200.fc34.x86_64. I've updated this to be a Fedora 34 bug.
This message is a reminder that Fedora Linux 34 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora Linux 34 on 2022-06-07. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a 'version' of '34'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, change the 'version' to a later Fedora Linux version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora Linux 34 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora Linux, you are encouraged to change the 'version' to a later version prior to this bug being closed.
Fedora Linux 34 entered end-of-life (EOL) status on 2022-06-07. Fedora Linux 34 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. Thank you for reporting this bug and we are sorry it could not be fixed.