Bug 1045807 - dell-laptop module causes lockup on F20 3.12.5 kernel
Summary: dell-laptop module causes lockup on F20 3.12.5 kernel
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 20
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-12-22 06:48 UTC by Bradley
Modified: 2014-03-01 12:23 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-03-01 12:23:39 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
dmidecode output (22.24 KB, text/plain)
2013-12-25 01:49 UTC, Bradley
no flags Details

Description Bradley 2013-12-22 06:48:11 UTC
Description of problem:

When the dell-laptop module is loaded on my Dell laptop AND F20 udev is running, the system locks up. This is a complete lockup, - caps lock doesn't work, can't use magic-sysrq, etc.

This breaks with either the F19 or F20 3.12.5 kernel, but I had sucessfully booted once under F19 with the F19 3.12.5 kernel, with no issues

Version-Release number of selected component (if applicable):

systemd-208-9.fc20.x86_64

BREAKS:

kernel-3.11.9-200.fc19.x86_64
kernel-3.11.10-200.fc19.x86_64

WORKS:

kernel-3.11.9-200.fc19.x86_64
kernel-3.11.10-200.fc19.x86_64

How reproducible:

Always, on F20

Steps to Reproduce:
1. Run F20
2. Boot into emergency mode (to test the ordering manually)
3. systemctl enable debug-shell.service
4. systemctl start debug-shell.service
5. modprobe dell-laptop
6. Observe no issues
8. Change to VT 9; run udevadm monitor; change back to VT1
7. systemctl start systemd-udevd.service
8. systemctl start systemd-udev-trigger.service
9. Quickly change back to VT9

Actual results:

System lockup; last line from the udev monitor is the load of dell-laptop

Expected results:

No lockup

Additional info:

This is 100% reproducable. There's presumably some udev trigger/interaction, because the system doesn't lock up until udev starts the triggering, AND it worked on F19.

I only booted the 3.12.5 kernel once on F19 though (it updated yesterday, and I booted into it to run fedup to F20)

Workaround is to blacklist dell-laptop in /etc/modprobe.d/ - that lets me bootup normally. Any modprode of the driver causes the lockup again

This is a Dell Vostro 3560.

Comment 1 Michele Baldessari 2013-12-22 07:29:33 UTC
$ git lg v3.11.9..v3.12.5 drivers/platform/x86/dell-laptop.c
$

So likely it udev is somehow triggering a loading race. If you boot without the
'rhgb quiet' options and trigger the issue, are you able to see the messages of
the screen lockup? Could you attach a screenshot in that case?

Comment 2 Bradley 2013-12-22 07:35:33 UTC
There's nothing logged - the machine just locks up.

As long as udev is running, and its the kernel on f20, it hangs.

It could be a race, but loading the module manully doesn't cause the issue til after udev loads - how can I tell what udev is trying to do?

Comment 3 Christian V R Lopes 2013-12-23 22:23:15 UTC
Hi, I am also suffering from this issue - it's a dell vostro 3560  - the computer totaly freezes , no keys work except the power button that I have to keep pressed for about 5 seconds to turn it off.

I remove 'rhgb quiet' but the problem persists.

I using F20 upgraded from F19 x86_64 .

Comment 4 Bradley 2013-12-23 23:51:52 UTC
If you're hitting this, the workaround is:

 - boot with 'rhgb quiet' removed, AND add 'emergency'
 - when prompted put in your root password
 - mount -o remount,rw /
 - create a file /etc/modprobe.d/dell.conf, containing:

blacklist dell-laptop
blacklist dell_laptop

 - reboot. Probably best to remove 'rhgb quiet' for the first time, just in case there are any other issues

Comment 5 Josh Boyer 2013-12-24 14:27:43 UTC
We're carrying a backported patch series to fix the dell-laptop module on latitude machines.  Apparently this is causing yoru vostros to fail.  Hans?

Comment 6 Hans de Goede 2013-12-24 15:44:04 UTC
Weird, the new code paths in dell-laptop should not be triggered on Vostro-s, because the rfkill functionality caused issues on various models in the past it is currently only enabled on Latitudes:

        /*
         * rfkill causes trouble on various non Latitudes, according to Dell
         * actually testing the rfkill functionality is only done on Latitudes.
         */
        product = dmi_get_system_info(DMI_PRODUCT_NAME);
        if (!force_rfkill && (!product || strncmp(product, "Latitude", 8)))
                return 0;

Bradley, can you try changing your  /etc/modprobe.d/dell.conf from:
blacklist dell_laptop

To:

options dell_laptop force_rfkill=1

Which should actually enable the rfkill functionality, and see if that makes a difference ?

Comment 7 Hans de Goede 2013-12-24 16:10:20 UTC
I've started a scratchbuild of the F-20 kernel with the dell-laptop patches removed here:
http://koji.fedoraproject.org/koji/taskinfo?taskID=6331521

Note this will take a significant amount of time to finish!

Once this is finished, can you give this kernel a try (without the blacklisting of the dell-laptop module), if this kernel does work then that confirms that the dell-laptop rfkill patches are the cause.

Thanks,

Hans

Comment 8 Hans de Goede 2013-12-24 20:00:49 UTC
Hi Again,

The scratchbuild from comment #7 is done now, and can be downloaded now.

In the mean time I've been looking at the code from the pov of what my changes mean for non whitelisted devices like Vostro-s and there is one changed code-path for them as a result of my patches.

I've written another patch so that that one functional change for non whitelisted devices is avoided, and dell-laptop.c should work 100% as before on non white-listed devices with this patch.

I've started another scratchbuild with the dell-laptop patches added back in including the new patch, you can find it here:
http://koji.fedoraproject.org/koji/taskinfo?taskID=6332009

Again please allow for some time for it to complete building. Please try my previous scratchbuild first, if that does not help then my dell-laptop patches are definitively not the culprit. If it does help, give this new build a try as it should fix things.

Also can you please do:
sudo dmidecode > dmi.log

And attach the generated dmi.log file here ?

Thanks,

Hans

Comment 9 Hans de Goede 2013-12-24 21:56:59 UTC
In case people don't get around to testing this with Christmas and everything, and the buildsys deletes the scratch builds during its automated cleanups, I've put the kernel rpms here:
http://people.fedoraproject.org/~jwrdegoede/rhbz1045807/

Please test the kernel-3.12.6-300.hgd_bz1045807.fc20.x86_64.rpm  build first if that does not help then testing the _2 will be of little use.

Regards,

Hans

Comment 10 Bradley 2013-12-25 01:48:33 UTC
Neither kernel helps.

However, what does help is removing 'acpi_backlight=vendor' - I need that to get the backlight working (see bug 986653)

With that removed, I can boot with the dell-laptop driver loaded, using the standard f20 kernel. (The backlight doesn't work, of course) I'll attach the dmidecode output

Comment 11 Bradley 2013-12-25 01:49:03 UTC
Created attachment 841371 [details]
dmidecode output

Comment 12 Hans de Goede 2013-12-25 09:01:22 UTC
Hi,

Merry Chistmas :)

(In reply to Bradley from comment #10)
> Neither kernel helps.

Given that the first kernel removes all my dell-laptop rfkill patches, it is unsurprising that the second kernel
which re-adds them + a small fix does not fix things either.

This does rule out my dell-laptop rfkill patches being the cause of your issues.

> However, what does help is removing 'acpi_backlight=vendor' - I need that to
> get the backlight working (see bug 986653)

Ah, ok so somehow we've a conflict between various drivers here which has gotten worse with the latest kernel, note that dell-laptop also registers a brightness device.

This is probably best discussed upstream where people more knowledgeable about this are involved, can you please send a mail about this to the platform-driver-x86.org list, with me (hdegoede) in the CC?

Thanks,

Hans

Comment 13 Bradley 2013-12-25 10:54:22 UTC
I previously said that I could reproduce this in single user mode, but I can't any more, so I must have done it wrong. Possibly breaks starting up X?

I'll try a git bisect against upstream first.

Comment 14 Bradley 2013-12-26 14:23:18 UTC
I've bisected this down to 81c0a2bb515fd4daae8cab64352877480792b515

That's a mm patch to the zone allocator. I thought I'd done the bisection wrong, but I double checked by rebuilding the commits either side again, and its definitely this one. (I also double checked a few times along the way, once the bisection hit the -mm tree's diffs which seems completely unrelated)

I confirmed by reverting that patch against 3.12.0 (breaks with the patch, works with it reverted). Same results against master (where part of this patch has already been reverted, so I also reverted fff4068cba484e6b0abe334ed6b15d5a215a3b25) - master is broken, but master with this patch (and fff4068cba484e6b0abe334ed6b15d5a215a3b25) reverted works fine.

This issue isn't intermittent, and while it could be a timing thing I would have expected other changes between 3.11 and master to have impacted timing like this. So I'm confused.

I'm going to rebuild the fedora RPM overnight with this reversion for one more test case (under mock), and also try the working 3.11 kernel with the patch applied to confirm that it breaks.

Comment 15 Bradley 2013-12-27 03:26:39 UTC
Yep, definitely that patch. Mail sent - http://marc.info/?l=linux-mm&m=138811453914848&w=2

Comment 16 Justin M. Forbes 2014-02-24 13:58:17 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs.

Fedora 20 has now been rebased to 3.13.4-200.fc20.  Please test this kernel update and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 17 Bradley 2014-03-01 12:23:39 UTC
This was 'resolved' by 3.13 changing the backlight logic so that the kernel param in question wasn't needed.

(It looks like the kernel param was using some memory that ACPI/EFI/SMI/something was reserving)


Note You need to log in before you can comment on or make changes to this bug.