Bug 711352

Summary:

[abrt] kernel: BUG: soft lockup - CPU#1 stuck for 68s! [watchdog/1:12]

Product:

[Fedora] Fedora

Reporter:

Trever Adams <trever>

Component:

kernel

Assignee:

Kernel Maintainer List <kernel-maint>

Status:

CLOSED DUPLICATE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

CC:

aquini, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Unspecified

Whiteboard:

abrt_hash:81495c6ef202c98211566d9770e333cf1dfd9763

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-07-19 12:47:15 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Further backtraces on freezes	none
Another freeze backtrace	none

Description Trever Adams 2011-06-07 09:42:57 UTC

abrt version: 2.0.1
architecture:   x86_64
cmdline:        ro root=/dev/mapper/vg_forbiddencity-lv_root rd_LVM_LV=vg_forbiddencity/lv_root rd_LVM_LV=vg_forbiddencity/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYTABLE=us rhgb quiet
comment:        Running amanda backup
component:      kernel
kernel:         2.6.38.6-27.fc15.x86_64
os_release:     Fedora release 15 (Lovelock)
package:        kernel
reason:         BUG: soft lockup - CPU#1 stuck for 68s! [watchdog/1:12]
reported_to:    kerneloops: URL=http://submit.kerneloops.org/submitoops.php
time:           Tue Jun  7 03:37:16 2011

backtrace:
:BUG: soft lockup - CPU#1 stuck for 68s! [watchdog/1:12]
:Modules linked in: cpufreq_ondemand powernow_k8 freq_table mperf ts_kmp ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_amanda ip6table_filter ip6_tables raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx snd_hda_codec_realtek eeepc_wmi sparse_keymap snd_hda_codec_hdmi rfkill snd_hda_intel snd_hda_codec snd_seq snd_hwdep sp5100_tco snd_seq_device xhci_hcd r8169 i2c_piix4 mii k10temp snd_pcm snd_timer snd wmi soundcore snd_page_alloc ipv6 radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
:CPU 1 
:Modules linked in: cpufreq_ondemand powernow_k8 freq_table mperf ts_kmp ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_amanda ip6table_filter ip6_tables raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx snd_hda_codec_realtek eeepc_wmi sparse_keymap snd_hda_codec_hdmi rfkill snd_hda_intel snd_hda_codec snd_seq snd_hwdep sp5100_tco snd_seq_device xhci_hcd r8169 i2c_piix4 mii k10temp snd_pcm snd_timer snd wmi soundcore snd_page_alloc ipv6 radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan]
:Pid: 12, comm: watchdog/1 Not tainted 2.6.38.6-27.fc15.x86_64 #1 System manufacturer System Product Name/E35M1-M PRO
:RIP: 0010:[<ffffffff81080b0a>]  [<ffffffff81080b0a>] arch_local_irq_restore+0x6/0xd
:RSP: 0018:ffff8800b79034b0  EFLAGS: 00000286
:RAX: ffff880127d1c000 RBX: ffffffff81010145 RCX: 0000000019bfcc04
:RDX: ffff880127d1dc80 RSI: 0000000000000286 RDI: 0000000000000286
:RBP: ffff8800b79034b0 R08: ffff880127d1dc80 R09: 0000000000000000
:R10: ffffffff81b3fb01 R11: ffffffff81b40d00 R12: ffffffff8100a593
:R13: ffff8800b7903428 R14: 0000000119e526e5 R15: ffff8801240b2430
:FS:  00007f2718d3c7e0(0000) GS:ffff8800b7900000(0000) knlGS:0000000000000000
:CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
:CR2: 00007fa6a18bd000 CR3: 0000000102bec000 CR4: 00000000000006e0
:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
:Process watchdog/1 (pid: 12, threadinfo ffff880127d2c000, task ffff880127d22e40)
:Stack:
: ffff8800b79034c0 ffffffff8147588c ffff8800b7903520 ffffffff81061554
: ffffffff00000001 01ff88011ac72f00 ffff88003782e110 0000000000000286
: ffffffff81475dd3 ffff8801240b23a8 0000000000000000 ffff88011ac72f00
:Call Trace:

Comment 1 Dave Jones 2011-06-07 17:25:31 UTC

This and bug 711353, bug 711355 and bug 711356 are all symptoms of the same problem, so I'm going to close those as duplicates.

The question is why this lockup occured. Can you give some more information about the type of backup amanda was doing ? Anything involving nfs perhaps ?

Comment 2 Dave Jones 2011-06-07 17:25:41 UTC

*** Bug 711353 has been marked as a duplicate of this bug. ***

Comment 3 Dave Jones 2011-06-07 17:25:47 UTC

*** Bug 711355 has been marked as a duplicate of this bug. ***

Comment 4 Dave Jones 2011-06-07 17:25:55 UTC

*** Bug 711356 has been marked as a duplicate of this bug. ***

Comment 5 Trever Adams 2011-06-07 19:15:04 UTC

Sure. I am sorry for the duplicates. As I know very little about kernel internals I have to trust that abrt can dedup.

These were all level 0 backups. (My raid set, md raid5, suddenly started to eat itself on crashes that are mentioned in my other bug reports, so these are all fresh starts or part of a fresh start backup.)

I have four VMs (kemu) on another box. Each with four disksets. /etc, /var/, /home, /. One of the homes is quite large. It is almost always during that one (with others running in parallel) when it starts having problems. I have a maximum number of dumpers set at four.

Often things run smoothly until taper runs, then it starts to do these bugs and the crash mentioned elsewhere. So, chunker and dumper run fine, pretty much, if taper shows up, it often goes fine.

If I use the built in network card, these problems happen faster and more often hard lock. Also, I keep getting disabling interrupt 18 during backups or every several hours. I think there must be a bug somewhere because I use these cards on other machines with no problem. I have never used a Zacate process before, so maybe the problem is there or in the firmware on this board. Sorry if this paragraph is not relevant. I am just trying to make sure you know everything I do.

18:       2235     298041   IO-APIC-fasteoi   ohci_hcd:usb4, ohci_hcd:usb5, ohci_hcd:usb6, ohci_hcd:usb7, p4p1

Oh, and NO nfs! It happens whether amanda is using udp or tcp.

Thank you,
Trever

Comment 6 Trever Adams 2011-06-08 10:10:57 UTC

Just some added information. It is actually 6 VMs. At 4 dumpers, it tends toward 0% idle in top. I tried 2 dumpers. It seems to average about 1/3 idle, with it falling between 20-68% idle. It still would freeze for 30-45 seconds here and there and still eventually did the same hard lock.

Comment 7 Trever Adams 2011-06-09 22:01:41 UTC

I am wondering if this is in the raid code. If I do several concurrent dds to a not RAID area and a iperf, or several, at once, no problem. I can even often do this to the raid area, but when I get massive long writes going in parallel, things go weird and start crashing. So, I am thinking that maybe it is in RAID locking. I tried turning off ext4 barriers, that didn't change anything (no the RAID was eating itself before that and after I turned it back on).

This is md RAID 5 on four 2 TB WDC drives (SATA) with 64k RAID block size, with file system set ad --stride=16 and --stripe-width=48 (these numbers are based on things I read and nothing else... simply used what I read as recommended). Again, ext4 file systems.

It goes wdc -> md -> lvm -> ext4. LVM was created with simply the required, nothing else.

Bugs that may or may not be related which I have reported:
https://bugzilla.redhat.com/show_bug.cgi?id=707686
https://bugzilla.redhat.com/show_bug.cgi?id=704462

Comment 8 Trever Adams 2011-06-09 22:02:25 UTC

Oh, the LVM the pg has all the raid, the lv was created to fill the pg. (This is a machine that does backups after all.)

Comment 9 Trever Adams 2011-06-09 22:03:57 UTC

Oh, and this was all IPv4. The machine is IPv6 but other backups don't use the raid set (rsync to main drive) and are running well, so I haven't had much time to try rebuilding the raid set and running with IPv6 now that my servers have all going dual stack.

Comment 10 Trever Adams 2011-06-14 08:11:04 UTC

Is this possible a nohz bug? As I have upgraded all of my systems to F15, I am starting to see stalls (20-40 second freezes) on most if not all of my systems. This didn't happen under very similar loads in F14.

Comment 11 Trever Adams 2011-06-18 15:49:25 UTC

Created attachment 505395 [details]
Further backtraces on freezes

Comment 12 Trever Adams 2011-06-18 16:24:13 UTC

I just realized one change between the old box and the new box (this bug was all on new box and may not include the details of what worked and didn't).

That changes is: nf_conntrack_amanda. The old box never had this installed as I wasn't bothering with higher level security yet. This may or may not be part of the issue as hardware changed too. (The motherboard is about 8 years newer in this box than the old one.)

Comment 13 Trever Adams 2011-06-20 13:35:29 UTC

Created attachment 505613 [details]
Another freeze backtrace

This one has a lot of similar things but some different ones. I hope this will provide more useful information.

Comment 14 Trever Adams 2011-06-20 13:42:16 UTC

I should mention that any backtraces after June 16 at 6:16 AM MDT is from kernel-2.6.38.8-32.fc15.x86_64

Comment 15 Chuck Ebbert 2011-06-24 14:44:09 UTC

I see r8169 in the traces. Try a better network adapter.

Comment 16 Trever Adams 2011-06-24 19:39:42 UTC

If you can, any suggestions so I don't make any mistakes?

Comment 17 Chuck Ebbert 2011-06-25 19:26:34 UTC

(In reply to comment #16)
> If you can, any suggestions so I don't make any mistakes?

e1000 and tg3 seem to be the most popular for heavy duty use.

Comment 18 Trever Adams 2011-07-05 19:45:43 UTC

I switched Realtek 8169 to Intel e100e PCIe card. I have not been able to duplicate any of these problems since, even under very heavy load. The process is also much more idle (nearly completely used w/ 8169 and about 30-70% idle most of the time, more than 50 quite often, with the later card).

I do not know if the 8169 chipset is just broken or if the driver is, but the problem lies with one of the two.

Somewhere in all of these bugs I mentioned that the md driver was starting to eat the raid set. This was caused by problems on reboot. I fixed those even before switching the card and the raid setup stopped being destroyed.

Comment 19 Chuck Ebbert 2011-07-19 12:47:15 UTC


*** This bug has been marked as a duplicate of bug 710841 ***