Bug 1411034 - radeon causes OOPS
Summary: radeon causes OOPS
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 25
Hardware: x86_64
OS: Unspecified
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-07 17:39 UTC by Danny Baumann
Modified: 2019-01-09 12:54 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-12-12 10:11:49 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
dmesg | grep iwl (deleted)
2017-01-07 17:39 UTC, Danny Baumann
no flags Details
dmesg | grep iwl (950 bytes, text/plain)
2017-01-07 17:40 UTC, Danny Baumann
no flags Details
dmesg with debug package (99.74 KB, text/plain)
2017-02-19 14:19 UTC, Danny Baumann
no flags Details

Description Danny Baumann 2017-01-07 17:39:14 UTC
Description of problem:
I have a Dell Inspiron 15R SE (7520) laptop with an Intel 7260-AC wifi card. When using the newest kernel release (4.8.15-300) the system doesn't boot up completely because NetworkManager hangs. When blacklisting the iwlwifi module via kernel command line, the system boots up completely; when insmod'ing iwlwifi later, all programs accessing network stuff in kernel (specifically NetworkManager, but also e.g. ifconfig) hang again. Trying to strace those processes makes strace hang as well. Booting an older kernel (4.8.14-300, 4.8.12-300) makes the wifi card work just fine.

Version-Release number of selected component (if applicable):
kernel-4.8.15-300.fc25.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Use hardware described above with kernel release mentioned above
2. Boot system

Actual results:
Systemd boot is blocked for several minutes by NetworkManager, logging into GDM isn't possible after the NetworkManager task times out

Expected results:
Login should be possible

Comment 1 Danny Baumann 2017-01-07 17:40:33 UTC
Created attachment 1238267 [details]
dmesg | grep iwl

Comment 2 Stanislaw Gruszka 2017-01-11 15:41:37 UTC
Could you install kernel-debug 4.8.15, boot it and provide dmesg ?

Comment 3 Danny Baumann 2017-01-11 19:30:15 UTC
In the meantime it seems 4.8.16-300 was released, which seems to fix the problem for me. Were there any fixes in that area of is this rather unexpected? If the former, I guess we can close this; if the latter, I guess I need to fetch the debug packages for 4.8.15-300 from koji?

Comment 4 Stanislaw Gruszka 2017-01-16 21:37:03 UTC
If new version fixes the problem we can close the bug.

Comment 5 Danny Baumann 2017-02-19 14:18:14 UTC
Reopening, as the issue has reappeared in the 4.9.6 and 4.9.8 kernels. I'll attach dmesg with debug package installed, although there's nothing suspicious wifi related in there ... I wonder whether the KMS backtraces are related?

Comment 6 Danny Baumann 2017-02-19 14:19:34 UTC
Created attachment 1255437 [details]
dmesg with debug package

Comment 7 Danny Baumann 2017-02-19 14:20:53 UTC
What I forgot to mention: 4.9.4 seems to be fine.

Comment 8 Justin M. Forbes 2017-04-11 14:47:35 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 25 kernel bugs.

Fedora 25 has now been rebased to 4.10.9-200.fc25.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 26, and are still experiencing this issue, please change the version to Fedora 26.

If you experience different issues, please open a new bug report for those.

Comment 9 Justin M. Forbes 2017-04-28 17:04:41 UTC
*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 2 weeks. If you are still experiencing this issue, please reopen and attach the 
relevant data from the latest kernel you are running and any data that might have been requested previously.

Comment 10 Danny Baumann 2017-05-01 17:14:18 UTC
The issue still persists, but it's dependant on the kernel version. Of the last 4 kernel releases, I get those results:

- kernel-4.10.12-200.fc25.x86_64 -> doesn't work
- kernel-4.10.9-200.fc25.x86_64 -> works
- kernel-4.10.8-200.fc25.x86_64 -> doesn't work
- kernel-4.10.6-200.fc25.x86_64 -> works

For each of the versions, the state (either working or not working) is 100% reproducible.
Please let me know if I should provide any further information

Comment 11 Stanislaw Gruszka 2017-05-09 12:47:20 UTC
Are working and not-working kernels different variants i.e. ones are -debug kernels other standard kernels or all happen on standard kernels ? 

Please attach dmesg from latest working and non-working kernel.

Comment 12 Danny Baumann 2017-05-10 06:25:40 UTC
All tests mentioned in comment #10 were done using the standard kernel.

In the meantime I've found some additional piece of relevant information: The hangs don't seem to be root caused by iwlwifi, but the radeon driver. As my laptop has both the Intel graphics integrated into the CPU and an additional Radeon graphics chip, I turn off the latter as I don't really need it. I do that by writing OFF to /sys/kernel/debug/vgaswitcheroo/switch by using systemd's tmpfiles.d mechanism. In the 'broken' cases doing so yields a kernel stack trace (see the dmesg in comment #6 at around 6.5 seconds), which seems to trigger the network hangs (maybe a broken failure path that doesn't properly release a mutex or something?)
When I stop writing to that node, network is working fine even with the kernels listed as broken in comment #10 - but that's of course only a workaround.

Comment 13 Stanislaw Gruszka 2017-05-10 06:53:47 UTC
I forgot to look at dmesg from comment 6 . Yes indeed this looks like radeon issue.

Seems locking is fine:

[    6.453923] random: crng init done
[    6.522323] radeon: switched off
[    6.522337] INFO: trying to register non-static key.
[    6.522363] the code is fine but needs lockdep annotation.
[    6.522379] turning off the locking correctness validator.

but later we have oops:

[    6.523256] BUG: unable to handle kernel NULL pointer dereference at           (null)
[    6.523299] IP: [<ffffffffbb49863b>] __list_add+0x1b/0xb0
[    6.523334] PGD 0 
<snip>
[    6.523362] Oops: 0000 [#1] SMP
[    6.524029] CPU: 2 PID: 771 Comm: systemd-tmpfile Not tainted 4.9.8-201.fc25.x86_64+debug #1
[    6.524074] Hardware name: Dell Inc. Inspiron 7520/0PXH02, BIOS A11 02/20/2014
[    6.524113] task: ffff970cba2f8000 task.stack: ffffb279c2aec000
[    6.524147] RIP: 0010:[<ffffffffbb49863b>]  [<ffffffffbb49863b>] __list_add+0x1b/0xb0
[    6.524196] RSP: 0018:ffffb279c2aefbe0  EFLAGS: 00010046
[    6.524227] RAX: ffff970cc78535b8 RBX: ffffb279c2aefc30 RCX: 0000000000000000
[    6.524267] RDX: ffff970cc78535b8 RSI: 0000000000000000 RDI: ffffb279c2aefc30
[    6.524306] RBP: ffffb279c2aefbf8 R08: 0000000000000000 R09: 0000000000000000
[    6.524345] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[    6.524383] R13: ffff970cc78535b8 R14: ffff970cba2f8000 R15: ffffffffbb91431b
[    6.524423] FS:  00007f86b9f37280(0000) GS:ffff970cce000000(0000) knlGS:0000000000000000
[    6.524467] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    6.524500] CR2: 0000000000000000 CR3: 000000043a5a5000 CR4: 00000000001406e0
[    6.524539] Stack:
[    6.524554]  ffff970cc7853568 0000000000000246 ffff970cc7853570 ffffb279c2aefc90
[    6.524597]  ffffffffbb9142f1 ffffffffc019f5e0 0000000000000080 ffff970cc78535b8
[    6.524639]  ffffffffc019f5e0 ffff970cc78535d8 ffffb279c2aefc30 ffffb279c2aefc30
[    6.524681] Call Trace:
[    6.525973]  [<ffffffffbb9142f1>] mutex_lock_nested+0x131/0x3f0
[    6.527077]  [<ffffffffc019f5e0>] ? drm_modeset_lock_all+0x40/0x120 [drm]
[    6.528180]  [<ffffffffc019f5e0>] ? drm_modeset_lock_all+0x40/0x120 [drm]
[    6.529262]  [<ffffffffc019f5c5>] ? drm_modeset_lock_all+0x25/0x120 [drm]
[    6.530332]  [<ffffffffc019f5e0>] drm_modeset_lock_all+0x40/0x120 [drm]
[    6.531407]  [<ffffffffc05ff6fd>] radeon_suspend_kms+0x5d/0x3f0 [radeon]

Comment 14 Stanislaw Gruszka 2017-05-10 07:32:05 UTC
Looks like we use dev->mode_config before it is initialized. Perhaps we should check radeon specific mode_config_initialized boolean to prevent oops:

diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c
index 621af06..723508e 100644
--- a/drivers/gpu/drm/radeon/radeon_device.c
+++ b/drivers/gpu/drm/radeon/radeon_device.c
@@ -1246,6 +1246,9 @@ static void radeon_switcheroo_set_state(struct pci_dev *pdev, enum vga_switchero
        if (radeon_is_px(dev) && state == VGA_SWITCHEROO_OFF)
                return;
 
+       if (!rdev->mode_info.mode_config_initialized)
+               return;
+
        if (state == VGA_SWITCHEROO_ON) {
                unsigned d3_delay = dev->pdev->d3_delay;

Comment 15 Rob Clark 2017-05-10 15:06:49 UTC
(In reply to Stanislaw Gruszka from comment #14)
> Looks like we use dev->mode_config before it is initialized. Perhaps we
> should check radeon specific mode_config_initialized boolean to prevent oops:
> 
> diff --git a/drivers/gpu/drm/radeon/radeon_device.c
> b/drivers/gpu/drm/radeon/radeon_device.c
> index 621af06..723508e 100644
> --- a/drivers/gpu/drm/radeon/radeon_device.c
> +++ b/drivers/gpu/drm/radeon/radeon_device.c
> @@ -1246,6 +1246,9 @@ static void radeon_switcheroo_set_state(struct pci_dev
> *pdev, enum vga_switchero
>         if (radeon_is_px(dev) && state == VGA_SWITCHEROO_OFF)
>                 return;
>  
> +       if (!rdev->mode_info.mode_config_initialized)
> +               return;
> +
>         if (state == VGA_SWITCHEROO_ON) {
>                 unsigned d3_delay = dev->pdev->d3_delay;

Could you try: https://patchwork.freedesktop.org/patch/155277/

Comment 16 Danny Baumann 2017-05-10 15:09:19 UTC
(In reply to Rob Clark from comment #15)
> Could you try: https://patchwork.freedesktop.org/patch/155277/

If by 'you' you mean me: sure ... would it be possible to get an RPM built with this patch, though?

Comment 17 Fedora End Of Life 2017-11-16 19:11:31 UTC
This message is a reminder that Fedora 25 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 25. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '25'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 25 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 18 Fedora End Of Life 2017-12-12 10:11:49 UTC
Fedora 25 changed to end-of-life (EOL) status on 2017-12-12. Fedora 25 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.