Bug 205610
Summary: | Assertion failure in log_do_checkpoint | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Stuart Hayes <stuart_hayes> | ||||
Component: | kernel | Assignee: | Eric Sandeen <esandeen> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | Brian Brock <bbrock> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4.4 | CC: | admin, jbaron, jfeeney, markscarbrough, plankers, rdejean, sct, tao, wwlinuxengineering | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2007-04-11 22:17:46 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Stuart Hayes
2006-09-07 16:38:30 UTC
Searching the mailing lists, I came up with this. I have no idea if this will actually fix the bug we're seeing or not. This patch apparently did go into the mainstream kernel, and is not in RHEL4 U4. http://marc.theaimsgroup.com/?l=linux-kernel&m=111877954110234&w=2 The patch from comment #1 appears to not be relevant to RHEL4. However, I have found this, which seems to indicate that there are other race conditions in this code that can trigger this bug, and this patch attempts to fix those. http://marc.theaimsgroup.com/?l=linux-kernel&m=114659569713241&w=2 Stuart, what's the status on reproducing this one? The same system ran for 120+ hours with no failure. At that point, RHEL4 was reinstalled, and it ran for another 120+ hours with no failure. So we're assuming this is a really rare race condition. Took a brief look at the patch linked above, it looks like it removes this assertion altogether and so would be a large-hammer approach to avoiding it. The downside is that the patch is *really* big, and needs to be carefully looked at to make sure it doesn't break kabi. A reproducer is really needed here at this point. *** Bug 206393 has been marked as a duplicate of this bug. *** Hi folks, I am seeing this as well, on a Dell PowerEdge 1850 running RHEL AS 4 kernel version 2.6.9-42.0.3.ELsmp. This server is a mail server (sendmail), and under particularly high I/O loads it panics with the same error (yes, the server is named 'UP'): Oct 31 15:26:33 up kernel: Assertion failure in log_do_checkpoint() at fs/jbd/ch eckpoint.c:363: "drop_count != 0 || cleanup_ret != 0" Oct 31 15:26:33 up kernel: ------------[ cut here ]------------ Oct 31 15:26:33 up kernel: kernel BUG at fs/jbd/checkpoint.c:363! Oct 31 15:26:33 up kernel: invalid operand: 0000 [#1] Oct 31 15:26:33 up kernel: SMP Oct 31 15:26:33 up kernel: Modules linked in: md5 ipv6 i2c_dev i2c_core ipmi_dev intf ipmi_si ipmi_msghandler ipt_REJECT ipt_state ip_conntrack iptable_filter ip _tables button battery ac joydev uhci_hcd ehci_hcd hw_random e1000 floppy sg dm_ snapshot dm_zero dm_mirror ext3 jbd dm_mod megaraid_mbox megaraid_mm sd_mod scsi _mod Oct 31 15:26:33 up kernel: CPU: 2 Oct 31 15:26:33 up kernel: EIP: 0060:[<f8880fc0>] Not tainted VLI Oct 31 15:26:33 up kernel: EFLAGS: 00010216 (2.6.9-42.0.3.ELsmp) Oct 31 15:26:33 up kernel: EIP is at log_do_checkpoint+0x10e/0x148 [jbd] Oct 31 15:26:33 up kernel: eax: 0000006e ebx: f13df2fc ecx: d3a29d84 edx: f8884dac Oct 31 15:26:33 up kernel: esi: f7e0de00 edi: f7091200 ebp: f13df2fc esp: d3a29d80 Oct 31 15:26:33 up kernel: ds: 007b es: 007b ss: 0068 Oct 31 15:26:33 up kernel: Process sendmail (pid: 5020, threadinfo=d3a29000 task =cf870630) I have seen this two days in a row as an automated system flushes its queue at this mail server. This isn't quite the ability to reproduce it, but I'd be glad to do anything you might need to better document the issue. Bob, thanks for the report. I'm thinking that the patch referenced in comment #2 likely addresses the issue (if you hit the assertion a lot, we could verify this) but it has KABI issues which must be addressed, and this is going to take some time to work out. Thanks Eric. I understand the KABI issues and don't envy the work that needs to happen to maintain that. Do you have any suggestions regarding a workaround in the meantime? My only though was putting a mainstream kernel on the box, built with roughly the same config as the 2.6.9-43.0.3, while the fix is pending. I'm afraid that I do not have a good workaround for you at the moment. I'll see if I can work something out soon for this bug... looks like it's getting a bit more popular. Need to move this one for 4.5 per Dell request. Set ACK review flags. 3 IT's associated with this bug. I've finally had some time to look at this one a bit more. The patch in comment #2 looks pretty good from the perspective that since it went upstream, it's had no further changes, and it's been there for a long time for soaking. No good ideas about the KABI issues yet though - adding the extra struct member clobbers jbd KABI. However, there seems to be another obvious bug that was fixed here: http://linux.bkbits.net:8080/linux-2.6/?PAGE=gnupatch&REV=1.3104.1.2 Fix a bug in list scanning that can cause us to skip the last buffer on the checkpoint list (and hence fail to do any progress under some rather unfavorable conditions). in 2.6.10, which looks obviously correct & probably should go in, although I'm not sure it will address all the problems folks are seeing here. Dell said it would test the fix when available. Currently, they are trying to create a test environment where the bug manifests in a more consistent manner. I haven't heard of anything since I asked last week. Will ask again. Charles, any word on getting this reproduced? It appears as though the lack of a reproducer is gating the resolution. Setting to Needinfo. We are still exploring ways to reproduce this issue. Well i'm seeing the same problem here. RHEL 4, kernel 2.6.9-42 and 2.6.9-42.0.8 also. On IBM x360 server. This server is also a mail server, running qpopper pop and dovecot imap server for 15,000 students and 2,000 faculty/staff. So yes it gets pretty busy with I/O at times. Disk storage is IBM DS4700. This past week it's gotten really bad, crashing once or twice per day. Bosses are complaining, haha. I'm going to try the small patch mentioned in comment #16 to see if that helps. Will post our results. Ray, thanks. I look forward to the results of your test... Well it's been up for 40 hours with no problems. I'll be out all next week snowboarding in Breckenridge. So if it stays up for a week solid and i get no calls on the mountain, that'll be the true test. Ray, still going strong? If anyone else is hitting this, can you please try the same patch? If you need a kernel rpm built w/ the change I'll gladly do that. *** Bug 203847 has been marked as a duplicate of this bug. *** (In reply to comment #28) > Ray, still going strong? > > If anyone else is hitting this, can you please try the same patch? If you need > a kernel rpm built w/ the change I'll gladly do that. Yep, still up after 10 days. Below is a list of our crashes/reboots until i patched 2.6.9-42.0.8.ELsmp on Feb 21. I'd say this patch definitely helped. <pre> reboot system boot 2.6.9-42.0.8.ELs Wed Feb 21 23:36 (10+23:11) reboot system boot 2.6.9-42.0.8.ELs Wed Feb 21 11:03 (12:29) reboot system boot 2.6.9-42.0.8.ELs Wed Feb 21 00:34 (22:59) reboot system boot 2.6.9-42.ELsmp Tue Feb 20 14:31 (09:59) reboot system boot 2.6.9-42.ELsmp Tue Feb 20 04:15 (20:14) reboot system boot 2.6.9-42.ELsmp Mon Feb 19 07:30 (1+17:00) reboot system boot 2.6.9-42.ELsmp Sun Feb 18 14:27 (2+10:03) reboot system boot 2.6.9-42.ELsmp Fri Feb 16 12:37 (4+11:53) reboot system boot 2.6.9-42.ELsmp Wed Feb 14 07:13 (6+17:17) </pre> Because it is an obviously correct fix (though at the time of filing, not obviously the correct fix for -this- issue in particular), the patch in comment #16 got its own bug, 224638: jbd __cleanup_transaction skips last buffer on checkpoint list It's been submitted for inclusion in a future RHEL4 kernel release. Since it fixes at least one reporter's problems with this bug, dup'ing of one bug to the other may be in order. Eric please post the patch for review. Created attachment 150590 [details] jbd patch Peter, the patch was already sent internally, and ACKed, for bug 224638. Attaching here for completeness. It was not clear at the time that it would resolve the issues in -this- bug, because there was another possibility that might have been causing the problem, so I created a new bug for this simple, obviously correct fix. So far, though, it looks like this patch *is* resolving this issue for the reporters who have chimed in. So, not sure what to do with this bug, administration-wise? Can engineering provide a test kernel or package as applicable so Dell can verify if this problem has been resolved with this patch? Thanks. I have placed 42.0.10 kernels with the added fix for bug 224638 (which seems to have resolved this issue for some people) at: http://people.redhat.com/esandeen/bz224638/ If anyone needs a kernel other than i686 and x86_64, smp, please let me know. Note these kernels are only for testing, and are not official RHEL4 released kernels. Thanks, -Eric Although this bugzilla was approved for RHEL 4.5, it remained in an unresolved status and no patch was included in the 4.5 kernel. Moved to 4.6. Duping to bug 224638 for now, patch which seems to fix this issue was sent under that bug. *** This bug has been marked as a duplicate of 224638 *** |