Bug 704462

Summary: AMANDA backup causing kernel crash
Product: [Fedora] Fedora Reporter: Trever Adams <trever>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 15CC: gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-10-14 14:13:54 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
screen shot
none
A freeze backtrace
none
This one looks much different
none
a few more backtraces none

Description Trever Adams 2011-05-13 08:44:28 UTC
Created attachment 498725 [details]
screen shot

Description of problem:
I have an ASUS E35M1-I PRO board, Zacate processor. I have AMANDA setup to backup from 4 machines, 4 disksets each. /home on 4 is quite large. Linux consistently crashes (usually a reboot with no messages). I got one message. It is attached as a picture as it was the only way I could get it.

Version-Release number of selected component (if applicable):
kernel-2.6.38.5-24.fc15.x86_64

How reproducible:
Every time I do amdump.

Comment 1 Trever Adams 2011-05-25 17:26:35 UTC
This may or may not be caused by a bad board. However, I have another bug report I am filing which may or may not be related, on a new good board. Bug #707686

Comment 2 Trever Adams 2011-05-26 16:14:35 UTC
This is NOT due to faulty hardware and exists in kernel kernel-2.6.38.6-27.fc15.x86_64. I tried to take another picture, but it was too blurry and I had to reboot the machine. It looked almost identical to the screen shot already taken. This is on a board that doesn't exhibit the hardware problems mentioned above.

Comment 3 Trever Adams 2011-05-26 16:15:16 UTC
This happens with C6 on or off.

Comment 4 Trever Adams 2011-05-28 14:55:11 UTC
Ok, I think I am seeing three things that may or may not be causing this.

The one I know for certain is at least causing reboots and a few crashes:
RTL8168b/8111b (built in) -- when I switch to RTL8169sb/8110sb (add on card), they go away. I see the following types of messages on the 8168b, but not on the 8169sb
May 28 01:07:51 FC kernel: [ 2015.591351] NOHZ: local_softirq_pending 08
May 28 01:07:52 FC kernel: [ 2016.029662] NOHZ: local_softirq_pending 08
May 28 01:07:52 FC kernel: [ 2016.029817] NOHZ: local_softirq_pending 08
May 28 01:07:52 FC kernel: [ 2016.096437] NOHZ: local_softirq_pending 08
May 28 01:07:56 FC kernel: [ 2019.945279] net_ratelimit: 50 callbacks suppressed

The closest I have to this with the 8169 is that I occasionally get messages like this: BUG: soft lockup - CPU#1 stuck for 64s! [kworker/1:0:9]. But that is during a RAID 5 rebuild due to the previous crashes I think.

The other possible ones causing some of the crashes with messages (a few of which I was able to post here) may be caused by C6 and EPU Power Saving being enabled. I will be testing those to see if they are red herrings next week.

I know the R8168b bug is a long standing one (finding it accidentally is what led to me testing all this). It seems it may be time to fix it, if possible, since this is being found in new MotherBoards.

Comment 5 Trever Adams 2011-06-20 13:32:47 UTC
Created attachment 505611 [details]
A freeze backtrace

This may or may not show the bug from the screenshot since it locked the screen, I do not know.

Comment 6 Trever Adams 2011-06-20 13:42:01 UTC
I should mention that any backtraces after June 16 at 6:16 AM MDT is from kernel-2.6.38.8-32.fc15.x86_64

Comment 7 Trever Adams 2011-06-20 13:53:56 UTC
Created attachment 505618 [details]
This one looks much different

Comment 8 Trever Adams 2011-06-20 14:13:58 UTC
Created attachment 505621 [details]
a few more backtraces

I do not think I will do anymore. While there are some unique parts, there appears to be a core that is repeated over and over. I imagine the trouble is there.

Comment 9 Trever Adams 2011-07-05 19:46:33 UTC
I switched Realtek 8169 to Intel e100e PCIe card. I have not been able to duplicate any of these problems since, even under very heavy load. The process is also much more idle (nearly completely used w/ 8169 and about 30-70% idle most of the time, more than 50 quite often, with the later card).

I do not know if the 8169 chipset is just broken or if the driver is, but the problem lies with one of the two.

Comment 10 Josh Boyer 2011-10-11 17:52:32 UTC
Have you happened to test this issue with the 2.6.40.6 kernels?  I realize you have switched to an Intel card at this point, but thought it might be worth asking.

If the issues are resolved for you then we might close the bug out unless you're willing to recreate it with the latest F15 kernel.

Comment 11 Trever Adams 2011-10-14 14:08:57 UTC
I cannot test this. I am sorry. For me, the issue is resolved. I understand that a kernel fix may have fixed this (something to do with DMA if I remember right).

Comment 12 Josh Boyer 2011-10-14 14:13:54 UTC
OK, thank you for letting us know.