Bug 1507173 - list_add corruption. next->prev should be prev (kernel panic) [NEEDINFO]
Summary: list_add corruption. next->prev should be prev (kernel panic)
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 27
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-10-28 01:10 UTC by Reartes Guillermo
Modified: 2018-08-29 15:05 UTC (History)
19 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2018-08-29 15:05:58 UTC
Type: Bug
Embargoed:
jforbes: needinfo?


Attachments (Terms of Use)
kernel panic, captured via serial port (4.87 KB, text/plain)
2017-10-28 01:10 UTC, Reartes Guillermo
no flags Details
lspci (10.35 KB, text/plain)
2017-10-28 01:12 UTC, Reartes Guillermo
no flags Details
messages logfile (607.44 KB, text/plain)
2017-10-28 01:26 UTC, Reartes Guillermo
no flags Details
kernel log from boot to panic, showing list_del corruption (105.58 KB, text/plain)
2017-11-20 17:39 UTC, Björn Persson
no flags Details
kernel log from boot to hang, showing list_add corruption (106.50 KB, text/plain)
2017-11-20 17:44 UTC, Björn Persson
no flags Details
kernel log from boot to panic, showing list_del corruption (123.72 KB, text/plain)
2017-12-12 11:54 UTC, Björn Persson
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Linux Kernel 196683 0 None None None 2019-06-12 20:34:25 UTC

Description Reartes Guillermo 2017-10-28 01:10:53 UTC
Created attachment 1344559 [details]
kernel panic, captured via serial port

Description of problem:

My new ryzen7 visualization lab is randomly crashing with a panic.

It took several weeks until i was able to get a serial port and capture the kernel panic.

Without the serial port it was impossible to see the panic. 
Always a black screen when i plug the DVI connector.

This panic happens after about a day, more or less.
Sometimes is faster. Due to this i am not using it much. 

And yesterday i got the serial port.

I do not know the nature of "list_add corruption" class of issues.

Thanks in advance.


Version-Release number of selected component (if applicable):
F26 Server

How reproducible:
always (but random)

Steps to Reproduce:
1. boot
2. use system (libvirt)
3. wait
4. panic

Actual results:
kernel panic

Expected results:
no panic

Additional info #1:

CPU  : Ryzen7 1700X
RAM  : 64 GB [Corsair LXP CMK16GX4M1B30000C15 x]
MOBO : Asus PRIME X370-PRO [BIOS = 0902]

28:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Redwood XT [Radeon HD 5670/5690/5730]
        Subsystem: XFX Pine Group Inc. Device 3061
        Kernel driver in use: radeon
        Kernel modules: radeon

New: CPU, MOBO, RAM, some HDDs.
Old: VGA, one HDD, 1 SDD.

Let's call this "system-2"

I have 1 monitor and 2 systems. (system-1 is my primary, system-2 is the virtualization lab).


Additional info #2:

Getting the panic without configuring the serial console (grub2/os) was almos impossible.

There is another overlapping issue (imho, unrelated to the panic issue)

The monitor button to switch displays does not work very well, so the DVI is only connected to the system-1

Before replacing the system-2 hardware (old mobo, old cpu, etc), i noticed (some months ago, but i could not report it) that when i connected the DVI, there was no display. That used to work in the past but broke some unknown number of kernels ago.

There was display only after a boot. No DVI port sensing. That was working since ages.

So i don't think this is related, but it made getting the panic very difficult.


Additional info #3:

Without configuring the serial port, most boots end up in black screen.
Now, it always boots.

Comment 1 Reartes Guillermo 2017-10-28 01:11:32 UTC
F26 Server Kernel is: 4.13.9-200.fc26.x86_64

Comment 2 Reartes Guillermo 2017-10-28 01:12:01 UTC
Created attachment 1344560 [details]
lspci

Comment 3 Reartes Guillermo 2017-10-28 01:26:26 UTC
Created attachment 1344573 [details]
messages logfile

Comment 4 Björn Persson 2017-11-20 17:23:50 UTC
This seems similar to what I'm seeing.

Symptoms:
About a day or two after boot the computer appears to lock up. So far it has always happened while I was away and the console was locked. When I return and turn on the monitor the system does not respond to keypresses or mouse movements. The screen remains black. In some cases the system has responded to ping but not to SSH. In other cases it hasn't responded even to ping.

To get more data I've configured a serial console and attached a null-modem cable. The messages I get there vary. In two cases so far there has been list corruption like in this bug report.

This may or may not be related to bug 1450769, which seems similar to other cases I've seen.

Linux 4.13.11-300.fc27.x86_64

Motherboard: Asus Prime X370-pro
Processor: AMD Ryzen
Memory: Kingston KVR24E17D8/16MA with ECC support
Graphics card: AMD/XFX Radeon RX 460

Comment 5 Björn Persson 2017-11-20 17:39:16 UTC
Created attachment 1355999 [details]
kernel log from boot to panic, showing list_del corruption

Comment 6 Björn Persson 2017-11-20 17:44:25 UTC
Created attachment 1356001 [details]
kernel log from boot to hang, showing list_add corruption

In this case no kernel panic message was printed.

Comment 7 Björn Persson 2017-12-12 11:54:08 UTC
Created attachment 1366582 [details]
kernel log from boot to panic, showing list_del corruption

another kernel panic, this time in Linux 4.13.16-302.fc27.x86_64

Comment 8 Björn Persson 2018-02-07 17:01:58 UTC
The rcu_nocbs workaround that is discussed at https://bugzilla.kernel.org/show_bug.cgi?id=196683 seems to prevent both these kernel panics and the soft lockups of bug 1450769.

Comment 9 Laura Abbott 2018-02-28 03:46:07 UTC
We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale. The kernel moves very fast so bugs may get fixed as part of a kernel update. Due to this, we are doing a mass bug update across all of the Fedora 26 kernel bugs.
 
Fedora 26 has now been rebased to 4.15.4-200.fc26.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 27, and are still experiencing this issue, please change the version to Fedora 27.
 
If you experience different issues, please open a new bug report for those.

Comment 10 Björn Persson 2018-03-18 16:47:46 UTC
It still happens in Linux 4.15.9-300.fc27.x86_64 without rcu_nocbs.

Comment 11 Justin M. Forbes 2018-07-23 15:10:24 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 27 kernel bugs.

Fedora 27 has now been rebased to 4.17.7-100.fc27.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 28, and are still experiencing this issue, please change the version to Fedora 28.

If you experience different issues, please open a new bug report for those.

Comment 12 Justin M. Forbes 2018-08-29 15:05:58 UTC
*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 5 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.


Note You need to log in before you can comment on or make changes to this bug.