Bug 1669564

Summary: nct6775: Kernel 4.20.3-200.fc29.x86_64 doesn't detect all fan sensors on an Asus PRIME Z370-A motherboard when prior kernels did
Product: [Fedora] Fedora Reporter: Chris Siebenmann <cks-rhbugzilla>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED EOL QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 28CC: airlied, bskeggs, hdegoede, ichavero, itamar, jarodwilson, jeremy, jglisse, john.j5live, jonathan, josef, kernel-maint, linville, mchehab, mjg59, steved, y9t7sypezp
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-05-28 22:25:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
4.20.3 boot dmesg
none
4.19.15 lsmod
none
4.19.15 /sys/devices hwmon and fan
none
4.20.3 lsmod
none
4.20.3 /sys/devices hwmon and fan
none
4.19.15 grep-sys-devices-nct6775-fan-4.19.15.txt
none
4.20.3 grep-sys-devices-nct6775-fan-4.20.3.txt none

Description Chris Siebenmann 2019-01-25 16:43:12 UTC
Created attachment 1523544 [details]
4.20.3 boot dmesg

1. Please describe the problem:

I have an Asus PRIME Z370-A based desktop. In kernels before 4.20.3-200,
the nct6775 hardware sensors module correctly detected and reported all
six motherboard fan sensors (including in 4.19.15-300). In 4.20.3, only
the first five fan sensors are detected. In both kernels, the nct6775
reports detecting the same chip:

  nct6775: Enabling hardware monitor logical device mappings.
  nct6775: Found NCT6793D or compatible chip at 0x2e:0x290

(This is the correct chip for this motherboard.)

In 4.19.15, /sys/devices/platform/nct6775.656/hwmon/hwmon3 contains a
variety of files for fan6:

  fan6_alarm fan6_input fan6_min fan6_pulses fan6_target fan6_tolerance

In 4.20.3, this directory only has fan6_target and fan6_tolerance. This
is the only difference between the files in the directory in the two
versions.

To get results from the nct6775 driver on this motherboard requires
booting the kernel with 'acpi_enforce_resources=lax'. In both 4.19.15
and 4.20.3, booting the kernel and bringing up the driver (with or
without the option) produces some ACPI messages:

  ACPI Warning: SystemIO range 0x0000000000000295-0x0000000000000296 conflicts with OpRegion 0x0000000000000290-0x0000000000000299 (\_GPE.HWM) (20181003/utaddress-213)
  ACPI: This conflict may cause random problems and system instability
  ACPI: If an ACPI driver is available for this device, you should use it instead of the native driver

Reproduction of this issue is particularly visible to me because fan6
on this motherboard is the second chassis fan.

6. Are you running any modules that not shipped with directly Fedora's kernel?:

Yes in both 4.20.3 and 4.19.15; WireGuard from COPR jdoss/wireguard,
and ZFS on Linux (0.8.0-rc3). The two kernels have the same versions
of the modules installed (through DKMS) and I have used various versions
of both for a long time on previous kernel versions without problems.

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

I've attached a dmesg dump from 4.20.3 from boot through me logging in,
verifying that sensors did not see the sixth fan, and capturing
dmesg output. I can provide a 4.19.3 dmesg dump from the same situation.
As far as I can tell they have no substantial differences.

Comment 1 Steve 2019-01-25 18:21:33 UTC
Thanks for your report. The attached log shows:

$ grep fans dmesg-4.20.3-200.fc29.x86_64 
[    3.755678] asus_wmi: Number of fans: 1

Could you attach the output from:

$ lsmod > lsmod-1.txt
$ find /sys/devices/ -name '*hwmon*' -o -name '*fan*' > sys-devices-hwmon-fan-1.txt

Also, have you had any problems with fan noise? Compare:

Bug 1665750 - ASUS ROG laptop: asus_wmi: Number of fans: 0 (Bug 1665750, Comment 10)
Bug 1663927 - I installed Fedora 29 to Asus X542U. Now fan is working always and I can not control it.

Comment 2 Chris Siebenmann 2019-01-25 18:41:05 UTC
I haven't had any problems with fan noise in either 4.19.15 or 4.20.3
(when I was running it), and 'sensors' reports that fans it can see have
plausible RPMs. All of my fans that are connected to the motherboard
sensor points at all are visible in 4.19.15 through the nct6775 driver
(and I have no other fans in this desktop, apart from one in the power
supply, which as far as I know cannot be read at all). Under both 4.19.15
and 4.20.3, the asus_wmi data reports a fan RPM of '0' (I have logged
sensors output from both). On at least 4.19.15, the actual /sys/devices
device directory for asus_wmi is 'platform/eeepc-wmi'.

I will get the lsmod and find output for 4.20.3 later, when I can
reboot the machine into that kernel (it's currently running 4.19.15).
I can get 4.19.15 /sys/devices output now if that would be helpful.

Comment 3 Steve 2019-01-25 19:17:54 UTC
(In reply to Chris Siebenmann from comment #2)
> I haven't had any problems with fan noise ...

OK.

> All of my fans that are connected to the motherboard
> sensor points at all are visible in 4.19.15 through the nct6775 driver
> (and I have no other fans in this desktop, apart from one in the power
> supply, which as far as I know cannot be read at all). Under both 4.19.15
> and 4.20.3, the asus_wmi data reports a fan RPM of '0' (I have logged
> sensors output from both). On at least 4.19.15, the actual /sys/devices
> device directory for asus_wmi is 'platform/eeepc-wmi'.

Thanks for pointing that out:

$ modinfo eeepc_wmi | egrep 'alias|description|depends'
alias:          wmi:ABBC0F72-8EA1-11D1-00A0-C90629100000
description:    Eee PC WMI Hotkey Driver
depends:        asus-wmi

> I will get the lsmod and find output for 4.20.3 later, when I can
> reboot the machine into that kernel (it's currently running 4.19.15).
> I can get 4.19.15 /sys/devices output now if that would be helpful.

Since you are reporting a regression, attaching both is a good idea:

sys-devices-hwmon-fan-4.19.15.txt
sys-devices-hwmon-fan-4.20.3.txt

Comment 4 Steve 2019-01-25 20:14:06 UTC
(In reply to Steve from comment #3)
...
> Since you are reporting a regression, attaching both is a good idea:
> 
> sys-devices-hwmon-fan-4.19.15.txt
> sys-devices-hwmon-fan-4.20.3.txt

There was an unrelated bug recently in which the loaded modules were different for different kernels, so could you also attach the lsmod output for both kernels?

$ lsmod > lsmod-4.19.15.txt
$ find /sys/devices/ -name '*hwmon*' -o -name '*fan*' > sys-devices-hwmon-fan-4.19.15.txt

$ lsmod > lsmod-4.20.3.txt
$ find /sys/devices/ -name '*hwmon*' -o -name '*fan*' > sys-devices-hwmon-fan-4.20.3.txt

Comment 5 Chris Siebenmann 2019-01-25 22:04:15 UTC
Created attachment 1523601 [details]
4.19.15 lsmod

Comment 6 Chris Siebenmann 2019-01-25 22:04:48 UTC
Created attachment 1523602 [details]
4.19.15 /sys/devices hwmon and fan

Comment 7 Chris Siebenmann 2019-01-25 22:05:44 UTC
Created attachment 1523603 [details]
4.20.3 lsmod

Comment 8 Chris Siebenmann 2019-01-25 22:06:21 UTC
Created attachment 1523604 [details]
4.20.3 /sys/devices hwmon and fan

Comment 9 Steve 2019-01-26 00:48:58 UTC
Thanks for the attachments. This confirms what you said:

$ diff -u --label '4.19.15' --label '4.20.3' sys-devices-hwmon-fan-4.19.15-sort.txt sys-devices-hwmon-fan-4.20.3-sort.txt
--- 4.19.15
+++ 4.20.3
@@ -41,10 +41,6 @@
 /sys/devices/platform/nct6775.656/hwmon/hwmon3/fan5_pulses
 /sys/devices/platform/nct6775.656/hwmon/hwmon3/fan5_target
 /sys/devices/platform/nct6775.656/hwmon/hwmon3/fan5_tolerance
-/sys/devices/platform/nct6775.656/hwmon/hwmon3/fan6_alarm
-/sys/devices/platform/nct6775.656/hwmon/hwmon3/fan6_input
-/sys/devices/platform/nct6775.656/hwmon/hwmon3/fan6_min
-/sys/devices/platform/nct6775.656/hwmon/hwmon3/fan6_pulses
 /sys/devices/platform/nct6775.656/hwmon/hwmon3/fan6_target
 /sys/devices/platform/nct6775.656/hwmon/hwmon3/fan6_tolerance
 /sys/devices/virtual/thermal/thermal_zone0/hwmon0

(NB: I sorted the files first.)

The lsmod output doesn't show any changes in the loaded modules other than sizes.

Comment 10 Steve 2019-01-26 02:28:25 UTC
This could be related:

hwmon: (nct6775) Only display fan speed tolerance conditionally
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/hwmon/nct6775.c?h=v4.20.3&id=61b6c66a8f740b5025ac49ddf1c2e29091a1274e

Could you attach the output for:

$ grep . /sys/devices/platform/nct6775.*/hwmon/hwmon*/fan* > grep-sys-devices-nct6775-fan-4.19.15.txt
$ grep . /sys/devices/platform/nct6775.*/hwmon/hwmon*/fan* > grep-sys-devices-nct6775-fan-4.20.3.txt

For reference, here is a list of commits for nct6775.c in 4.20.3:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/log/drivers/hwmon/nct6775.c?h=v4.20.3

Comment 11 Chris Siebenmann 2019-01-26 04:32:42 UTC
It looks there is a potentially significant difference in the code
introduced here:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/hwmon/nct6775.c?h=v4.20.3&id=2d99925a15b639026b67bd96419df6f9d760b212

Here is the code before the change, and I think we care about fan6pin:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/hwmon/nct6775.c?h=v4.20.3&id=7dcdbdeb1b45b9071ad986bf20d8c2da6a057eb6#n3532

After the change:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/hwmon/nct6775.c?h=v4.20.3&id=2d99925a15b639026b67bd96419df6f9d760b212#n3532

We change from:
    fan6pin = !dsw_en && (cr2d & BIT(1));
    fan6pin |= creb & BIT(3);

To just:
    fan6pin = creb & BIT(3);

If dsw_en is false, this could create a different result. The 4.19.15
code is differently structured but, I believe, has the same calculation
for fan6pin:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/hwmon/nct6775.c?h=v4.19.15&id=e3185123541204ca4f715eeaaa1f9929c09ff3b4#n3514

The old code runs:
   if (!dsw_en) {
     fan6pin = regval & BIT(1);
     ...
   }
   ...
   if (!fan6pin)
     fan6pin = regval_eb & BIT(3);

regval is the current cr2d, regval_eb is the current creb. dsw_en
is 'bool dsw_en = cr2f & BIT(3);', but I don't know if there's any
way of figuring out its state from outside the driver.

Comment 12 Chris Siebenmann 2019-01-26 05:02:44 UTC
Created attachment 1523634 [details]
4.19.15 grep-sys-devices-nct6775-fan-4.19.15.txt

Comment 13 Chris Siebenmann 2019-01-26 05:03:11 UTC
Created attachment 1523635 [details]
4.20.3 grep-sys-devices-nct6775-fan-4.20.3.txt

Comment 14 Steve 2019-01-26 05:12:16 UTC
(In reply to Chris Siebenmann from comment #11)
...
> We change from:
>     fan6pin = !dsw_en && (cr2d & BIT(1));
>     fan6pin |= creb & BIT(3);
> 
> To just:
>     fan6pin = creb & BIT(3);
...

Good catch. It looks like the first fan6pin assignment didn't get included in the refactoring.

I tried to add the maintainer, Guenter Roeck, to the CC list, but BZ won't accept his email address, because he is not registered. If you want to, you could try emailing him:

$ modinfo nct6775 | grep author
author:         Guenter Roeck <linux>

Comment 15 Steve 2019-01-26 05:41:33 UTC
I suggest adding this to the beginning of the bug summary: "nct6775: ".

Comment 16 Chris Siebenmann 2019-01-27 23:43:14 UTC
I've now sent email to Guenter Roeck describing the issue and so on (and
giving the URL of this bug).

Comment 17 Steve 2019-01-29 06:53:10 UTC
(In reply to Chris Siebenmann from comment #16)
> I've now sent email to Guenter Roeck describing the issue and so on (and
> giving the URL of this bug).

Guenter has a fix in his git repo:

hwmon: (nct6775) Fix fan6 detection for NCT6793D
https://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging.git/commit/?h=hwmon&id=2a2ec4aa0577ec0b7df2d1bde5c84ed39a8637cb

Comment 18 Chris Siebenmann 2019-02-05 04:49:52 UTC
I've hand-built a version of the fixed module for 4.20.5-200.fc29.x86_64
and can verify that it works for me; it detects all six fans on my
motherboard.

Comment 19 Steve 2019-02-10 00:44:41 UTC
(In reply to Chris Siebenmann from comment #18)
> I've hand-built a version of the fixed module for 4.20.5-200.fc29.x86_64
> and can verify that it works for me; it detects all six fans on my
> motherboard.

Thanks for testing the patch. There appears to be a minor glitch in the commit:

Subject	linux-next: Fixes tag needs some work in the hwmon-fixes tree
https://lkml.org/lkml/2019/1/27/177

The commit ID in that message is different from the one in Comment 17, so here is a link:

hwmon: (nct6775) Fix fan6 detection for NCT6793D
https://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging.git/commit/?id=315fd42cc1f8837a134c3b671dde9b132c46ddcb

Comment 20 Steve 2019-02-10 04:00:43 UTC
(In reply to Steve from comment #19)
... 
> The commit ID in that message is different from the one in Comment 17, so here is a link:
...

OK, I figured it out -- commit 2a2ec4aa0577 has the "Fixes" tag on one line, and it is in the "linux-next" git repo:

hwmon: (nct6775) Fix fan6 detection for NCT6793D
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=2a2ec4aa0577ec0b7df2d1bde5c84ed39a8637cb

Comment 21 Steve 2019-02-19 02:35:16 UTC
The fix is in kernel 5.0-rc7:

Merge tag 'hwmon-for-v5.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.0-rc7&id=991b9eb4243b53e6dcaeda94e515d713ca7ddd2e

Comment 22 Ben Cotton 2019-05-02 19:53:19 UTC
This message is a reminder that Fedora 28 is nearing its end of life.
On 2019-May-28 Fedora will stop maintaining and issuing updates for
Fedora 28. It is Fedora's policy to close all bug reports from releases
that are no longer maintained. At that time this bug will be closed as
EOL if it remains open with a Fedora 'version' of '28'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 28 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 23 Chris Siebenmann 2019-05-02 20:55:55 UTC
Two notes: first, this was accidentally filed or classified against Fedora 28, although I was running Fedora 29 at the time I found it. Second, this is fixed in at least the recent Fedora 29 kernels (I can't speak for Fedora 28 ones). It's certainly fixed in 5.0.7-200.fc29.x86_64, and I believe it was also fixed in several prior Fedora 29 5.0.x kernels.

So I think that this can be closed as 'fixed in errata'.

Comment 24 Ben Cotton 2019-05-28 22:25:57 UTC
Fedora 28 changed to end-of-life (EOL) status on 2019-05-28. Fedora 28 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.