Hide Forgot
abrt version: 2.0.1 architecture: x86_64 cmdline: ro root=/dev/mapper/vg_forbiddencity-lv_root rd_LVM_LV=vg_forbiddencity/lv_root rd_LVM_LV=vg_forbiddencity/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYTABLE=us rhgb quiet comment: Running amanda backup component: kernel kernel: 2.6.38.6-27.fc15.x86_64 os_release: Fedora release 15 (Lovelock) package: kernel reason: BUG: soft lockup - CPU#1 stuck for 68s! [watchdog/1:12] reported_to: kerneloops: URL=http://submit.kerneloops.org/submitoops.php time: Tue Jun 7 03:37:16 2011 backtrace: :BUG: soft lockup - CPU#1 stuck for 68s! [watchdog/1:12] :Modules linked in: cpufreq_ondemand powernow_k8 freq_table mperf ts_kmp ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_amanda ip6table_filter ip6_tables raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx snd_hda_codec_realtek eeepc_wmi sparse_keymap snd_hda_codec_hdmi rfkill snd_hda_intel snd_hda_codec snd_seq snd_hwdep sp5100_tco snd_seq_device xhci_hcd r8169 i2c_piix4 mii k10temp snd_pcm snd_timer snd wmi soundcore snd_page_alloc ipv6 radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan] :CPU 1 :Modules linked in: cpufreq_ondemand powernow_k8 freq_table mperf ts_kmp ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_amanda ip6table_filter ip6_tables raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx snd_hda_codec_realtek eeepc_wmi sparse_keymap snd_hda_codec_hdmi rfkill snd_hda_intel snd_hda_codec snd_seq snd_hwdep sp5100_tco snd_seq_device xhci_hcd r8169 i2c_piix4 mii k10temp snd_pcm snd_timer snd wmi soundcore snd_page_alloc ipv6 radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: scsi_wait_scan] :Pid: 12, comm: watchdog/1 Not tainted 2.6.38.6-27.fc15.x86_64 #1 System manufacturer System Product Name/E35M1-M PRO :RIP: 0010:[<ffffffff81080b0a>] [<ffffffff81080b0a>] arch_local_irq_restore+0x6/0xd :RSP: 0018:ffff8800b79034b0 EFLAGS: 00000286 :RAX: ffff880127d1c000 RBX: ffffffff81010145 RCX: 0000000019bfcc04 :RDX: ffff880127d1dc80 RSI: 0000000000000286 RDI: 0000000000000286 :RBP: ffff8800b79034b0 R08: ffff880127d1dc80 R09: 0000000000000000 :R10: ffffffff81b3fb01 R11: ffffffff81b40d00 R12: ffffffff8100a593 :R13: ffff8800b7903428 R14: 0000000119e526e5 R15: ffff8801240b2430 :FS: 00007f2718d3c7e0(0000) GS:ffff8800b7900000(0000) knlGS:0000000000000000 :CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b :CR2: 00007fa6a18bd000 CR3: 0000000102bec000 CR4: 00000000000006e0 :DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 :DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 :Process watchdog/1 (pid: 12, threadinfo ffff880127d2c000, task ffff880127d22e40) :Stack: : ffff8800b79034c0 ffffffff8147588c ffff8800b7903520 ffffffff81061554 : ffffffff00000001 01ff88011ac72f00 ffff88003782e110 0000000000000286 : ffffffff81475dd3 ffff8801240b23a8 0000000000000000 ffff88011ac72f00 :Call Trace:
This and bug 711353, bug 711355 and bug 711356 are all symptoms of the same problem, so I'm going to close those as duplicates. The question is why this lockup occured. Can you give some more information about the type of backup amanda was doing ? Anything involving nfs perhaps ?
*** Bug 711353 has been marked as a duplicate of this bug. ***
*** Bug 711355 has been marked as a duplicate of this bug. ***
*** Bug 711356 has been marked as a duplicate of this bug. ***
Sure. I am sorry for the duplicates. As I know very little about kernel internals I have to trust that abrt can dedup. These were all level 0 backups. (My raid set, md raid5, suddenly started to eat itself on crashes that are mentioned in my other bug reports, so these are all fresh starts or part of a fresh start backup.) I have four VMs (kemu) on another box. Each with four disksets. /etc, /var/, /home, /. One of the homes is quite large. It is almost always during that one (with others running in parallel) when it starts having problems. I have a maximum number of dumpers set at four. Often things run smoothly until taper runs, then it starts to do these bugs and the crash mentioned elsewhere. So, chunker and dumper run fine, pretty much, if taper shows up, it often goes fine. If I use the built in network card, these problems happen faster and more often hard lock. Also, I keep getting disabling interrupt 18 during backups or every several hours. I think there must be a bug somewhere because I use these cards on other machines with no problem. I have never used a Zacate process before, so maybe the problem is there or in the firmware on this board. Sorry if this paragraph is not relevant. I am just trying to make sure you know everything I do. 18: 2235 298041 IO-APIC-fasteoi ohci_hcd:usb4, ohci_hcd:usb5, ohci_hcd:usb6, ohci_hcd:usb7, p4p1 Oh, and NO nfs! It happens whether amanda is using udp or tcp. Thank you, Trever
Just some added information. It is actually 6 VMs. At 4 dumpers, it tends toward 0% idle in top. I tried 2 dumpers. It seems to average about 1/3 idle, with it falling between 20-68% idle. It still would freeze for 30-45 seconds here and there and still eventually did the same hard lock.
I am wondering if this is in the raid code. If I do several concurrent dds to a not RAID area and a iperf, or several, at once, no problem. I can even often do this to the raid area, but when I get massive long writes going in parallel, things go weird and start crashing. So, I am thinking that maybe it is in RAID locking. I tried turning off ext4 barriers, that didn't change anything (no the RAID was eating itself before that and after I turned it back on). This is md RAID 5 on four 2 TB WDC drives (SATA) with 64k RAID block size, with file system set ad --stride=16 and --stripe-width=48 (these numbers are based on things I read and nothing else... simply used what I read as recommended). Again, ext4 file systems. It goes wdc -> md -> lvm -> ext4. LVM was created with simply the required, nothing else. Bugs that may or may not be related which I have reported: https://bugzilla.redhat.com/show_bug.cgi?id=707686 https://bugzilla.redhat.com/show_bug.cgi?id=704462
Oh, the LVM the pg has all the raid, the lv was created to fill the pg. (This is a machine that does backups after all.)
Oh, and this was all IPv4. The machine is IPv6 but other backups don't use the raid set (rsync to main drive) and are running well, so I haven't had much time to try rebuilding the raid set and running with IPv6 now that my servers have all going dual stack.
Is this possible a nohz bug? As I have upgraded all of my systems to F15, I am starting to see stalls (20-40 second freezes) on most if not all of my systems. This didn't happen under very similar loads in F14.
Created attachment 505395 [details] Further backtraces on freezes
I just realized one change between the old box and the new box (this bug was all on new box and may not include the details of what worked and didn't). That changes is: nf_conntrack_amanda. The old box never had this installed as I wasn't bothering with higher level security yet. This may or may not be part of the issue as hardware changed too. (The motherboard is about 8 years newer in this box than the old one.)
Created attachment 505613 [details] Another freeze backtrace This one has a lot of similar things but some different ones. I hope this will provide more useful information.
I should mention that any backtraces after June 16 at 6:16 AM MDT is from kernel-2.6.38.8-32.fc15.x86_64
I see r8169 in the traces. Try a better network adapter.
If you can, any suggestions so I don't make any mistakes?
(In reply to comment #16) > If you can, any suggestions so I don't make any mistakes? e1000 and tg3 seem to be the most popular for heavy duty use.
I switched Realtek 8169 to Intel e100e PCIe card. I have not been able to duplicate any of these problems since, even under very heavy load. The process is also much more idle (nearly completely used w/ 8169 and about 30-70% idle most of the time, more than 50 quite often, with the later card). I do not know if the 8169 chipset is just broken or if the driver is, but the problem lies with one of the two. Somewhere in all of these bugs I mentioned that the md driver was starting to eat the raid set. This was caused by problems on reboot. I fixed those even before switching the card and the raid setup stopped being destroyed.
*** This bug has been marked as a duplicate of bug 710841 ***