| Summary: | [abrt] kernel: BUG: soft lockup - CPU#1 stuck for 68s! [watchdog/1:12] | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Trever Adams <trever> | ||||||
| Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||
| Status: | CLOSED DUPLICATE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
| Severity: | unspecified | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 15 | CC: | aquini, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | abrt_hash:81495c6ef202c98211566d9770e333cf1dfd9763 | ||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2011-07-19 12:47:15 UTC | Type: | --- | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Attachments: |
|
||||||||
|
Description
Trever Adams
2011-06-07 09:42:57 UTC
This and bug 711353, bug 711355 and bug 711356 are all symptoms of the same problem, so I'm going to close those as duplicates. The question is why this lockup occured. Can you give some more information about the type of backup amanda was doing ? Anything involving nfs perhaps ? *** Bug 711353 has been marked as a duplicate of this bug. *** *** Bug 711355 has been marked as a duplicate of this bug. *** *** Bug 711356 has been marked as a duplicate of this bug. *** Sure. I am sorry for the duplicates. As I know very little about kernel internals I have to trust that abrt can dedup. These were all level 0 backups. (My raid set, md raid5, suddenly started to eat itself on crashes that are mentioned in my other bug reports, so these are all fresh starts or part of a fresh start backup.) I have four VMs (kemu) on another box. Each with four disksets. /etc, /var/, /home, /. One of the homes is quite large. It is almost always during that one (with others running in parallel) when it starts having problems. I have a maximum number of dumpers set at four. Often things run smoothly until taper runs, then it starts to do these bugs and the crash mentioned elsewhere. So, chunker and dumper run fine, pretty much, if taper shows up, it often goes fine. If I use the built in network card, these problems happen faster and more often hard lock. Also, I keep getting disabling interrupt 18 during backups or every several hours. I think there must be a bug somewhere because I use these cards on other machines with no problem. I have never used a Zacate process before, so maybe the problem is there or in the firmware on this board. Sorry if this paragraph is not relevant. I am just trying to make sure you know everything I do. 18: 2235 298041 IO-APIC-fasteoi ohci_hcd:usb4, ohci_hcd:usb5, ohci_hcd:usb6, ohci_hcd:usb7, p4p1 Oh, and NO nfs! It happens whether amanda is using udp or tcp. Thank you, Trever Just some added information. It is actually 6 VMs. At 4 dumpers, it tends toward 0% idle in top. I tried 2 dumpers. It seems to average about 1/3 idle, with it falling between 20-68% idle. It still would freeze for 30-45 seconds here and there and still eventually did the same hard lock. I am wondering if this is in the raid code. If I do several concurrent dds to a not RAID area and a iperf, or several, at once, no problem. I can even often do this to the raid area, but when I get massive long writes going in parallel, things go weird and start crashing. So, I am thinking that maybe it is in RAID locking. I tried turning off ext4 barriers, that didn't change anything (no the RAID was eating itself before that and after I turned it back on). This is md RAID 5 on four 2 TB WDC drives (SATA) with 64k RAID block size, with file system set ad --stride=16 and --stripe-width=48 (these numbers are based on things I read and nothing else... simply used what I read as recommended). Again, ext4 file systems. It goes wdc -> md -> lvm -> ext4. LVM was created with simply the required, nothing else. Bugs that may or may not be related which I have reported: https://bugzilla.redhat.com/show_bug.cgi?id=707686 https://bugzilla.redhat.com/show_bug.cgi?id=704462 Oh, the LVM the pg has all the raid, the lv was created to fill the pg. (This is a machine that does backups after all.) Oh, and this was all IPv4. The machine is IPv6 but other backups don't use the raid set (rsync to main drive) and are running well, so I haven't had much time to try rebuilding the raid set and running with IPv6 now that my servers have all going dual stack. Is this possible a nohz bug? As I have upgraded all of my systems to F15, I am starting to see stalls (20-40 second freezes) on most if not all of my systems. This didn't happen under very similar loads in F14. Created attachment 505395 [details]
Further backtraces on freezes
I just realized one change between the old box and the new box (this bug was all on new box and may not include the details of what worked and didn't). That changes is: nf_conntrack_amanda. The old box never had this installed as I wasn't bothering with higher level security yet. This may or may not be part of the issue as hardware changed too. (The motherboard is about 8 years newer in this box than the old one.) Created attachment 505613 [details]
Another freeze backtrace
This one has a lot of similar things but some different ones. I hope this will provide more useful information.
I should mention that any backtraces after June 16 at 6:16 AM MDT is from kernel-2.6.38.8-32.fc15.x86_64 I see r8169 in the traces. Try a better network adapter. If you can, any suggestions so I don't make any mistakes? (In reply to comment #16) > If you can, any suggestions so I don't make any mistakes? e1000 and tg3 seem to be the most popular for heavy duty use. I switched Realtek 8169 to Intel e100e PCIe card. I have not been able to duplicate any of these problems since, even under very heavy load. The process is also much more idle (nearly completely used w/ 8169 and about 30-70% idle most of the time, more than 50 quite often, with the later card). I do not know if the 8169 chipset is just broken or if the driver is, but the problem lies with one of the two. Somewhere in all of these bugs I mentioned that the md driver was starting to eat the raid set. This was caused by problems on reboot. I fixed those even before switching the card and the raid setup stopped being destroyed. *** This bug has been marked as a duplicate of bug 710841 *** |