704462 – AMANDA backup causing kernel crash

Bug 704462 - AMANDA backup causing kernel crash

Summary: AMANDA backup causing kernel crash

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	15
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-05-13 08:44 UTC by Trever Adams
Modified:	2011-10-14 14:13 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2011-10-14 14:13:54 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
screen shot (1.81 MB, image/jpeg) 2011-05-13 08:44 UTC, Trever Adams	no flags	Details
A freeze backtrace (7.77 KB, application/octet-stream) 2011-06-20 13:32 UTC, Trever Adams	no flags	Details
This one looks much different (11.13 KB, application/octet-stream) 2011-06-20 13:53 UTC, Trever Adams	no flags	Details
a few more backtraces (21.73 KB, application/octet-stream) 2011-06-20 14:13 UTC, Trever Adams	no flags	Details
View All

Description Trever Adams 2011-05-13 08:44:28 UTC

Created attachment 498725 [details]
screen shot

Description of problem:
I have an ASUS E35M1-I PRO board, Zacate processor. I have AMANDA setup to backup from 4 machines, 4 disksets each. /home on 4 is quite large. Linux consistently crashes (usually a reboot with no messages). I got one message. It is attached as a picture as it was the only way I could get it.

Version-Release number of selected component (if applicable):
kernel-2.6.38.5-24.fc15.x86_64

How reproducible:
Every time I do amdump.

Comment 1 Trever Adams 2011-05-25 17:26:35 UTC

This may or may not be caused by a bad board. However, I have another bug report I am filing which may or may not be related, on a new good board. Bug #707686

Comment 2 Trever Adams 2011-05-26 16:14:35 UTC

This is NOT due to faulty hardware and exists in kernel kernel-2.6.38.6-27.fc15.x86_64. I tried to take another picture, but it was too blurry and I had to reboot the machine. It looked almost identical to the screen shot already taken. This is on a board that doesn't exhibit the hardware problems mentioned above.

Comment 3 Trever Adams 2011-05-26 16:15:16 UTC

This happens with C6 on or off.

Comment 4 Trever Adams 2011-05-28 14:55:11 UTC

Ok, I think I am seeing three things that may or may not be causing this.

The one I know for certain is at least causing reboots and a few crashes:
RTL8168b/8111b (built in) -- when I switch to RTL8169sb/8110sb (add on card), they go away. I see the following types of messages on the 8168b, but not on the 8169sb
May 28 01:07:51 FC kernel: [ 2015.591351] NOHZ: local_softirq_pending 08
May 28 01:07:52 FC kernel: [ 2016.029662] NOHZ: local_softirq_pending 08
May 28 01:07:52 FC kernel: [ 2016.029817] NOHZ: local_softirq_pending 08
May 28 01:07:52 FC kernel: [ 2016.096437] NOHZ: local_softirq_pending 08
May 28 01:07:56 FC kernel: [ 2019.945279] net_ratelimit: 50 callbacks suppressed

The closest I have to this with the 8169 is that I occasionally get messages like this: BUG: soft lockup - CPU#1 stuck for 64s! [kworker/1:0:9]. But that is during a RAID 5 rebuild due to the previous crashes I think.

The other possible ones causing some of the crashes with messages (a few of which I was able to post here) may be caused by C6 and EPU Power Saving being enabled. I will be testing those to see if they are red herrings next week.

I know the R8168b bug is a long standing one (finding it accidentally is what led to me testing all this). It seems it may be time to fix it, if possible, since this is being found in new MotherBoards.

Comment 5 Trever Adams 2011-06-20 13:32:47 UTC

Created attachment 505611 [details]
A freeze backtrace

This may or may not show the bug from the screenshot since it locked the screen, I do not know.

Comment 6 Trever Adams 2011-06-20 13:42:01 UTC

I should mention that any backtraces after June 16 at 6:16 AM MDT is from kernel-2.6.38.8-32.fc15.x86_64

Comment 7 Trever Adams 2011-06-20 13:53:56 UTC

Created attachment 505618 [details]
This one looks much different

Comment 8 Trever Adams 2011-06-20 14:13:58 UTC

Created attachment 505621 [details]
a few more backtraces

I do not think I will do anymore. While there are some unique parts, there appears to be a core that is repeated over and over. I imagine the trouble is there.

Comment 9 Trever Adams 2011-07-05 19:46:33 UTC

I switched Realtek 8169 to Intel e100e PCIe card. I have not been able to duplicate any of these problems since, even under very heavy load. The process is also much more idle (nearly completely used w/ 8169 and about 30-70% idle most of the time, more than 50 quite often, with the later card).

I do not know if the 8169 chipset is just broken or if the driver is, but the problem lies with one of the two.

Comment 10 Josh Boyer 2011-10-11 17:52:32 UTC

Have you happened to test this issue with the 2.6.40.6 kernels?  I realize you have switched to an Intel card at this point, but thought it might be worth asking.

If the issues are resolved for you then we might close the bug out unless you're willing to recreate it with the latest F15 kernel.

Comment 11 Trever Adams 2011-10-14 14:08:57 UTC

I cannot test this. I am sorry. For me, the issue is resolved. I understand that a kernel fix may have fixed this (something to do with DMA if I remember right).

Comment 12 Josh Boyer 2011-10-14 14:13:54 UTC

OK, thank you for letting us know.

Note You need to log in before you can comment on or make changes to this bug.